A data science project to predict the probability of a machine encountering malware based on telemetry data collected from Microsoft Defender.
Built using Python, Dask, LightGBM, and essential data science libraries for handling large-scale structured data.
With increasing cyber threats, early detection of malware is crucial for protecting user devices and data.
This project aims to predict the likelihood of a malware detection on a machine using telemetry data, enabling proactive defense mechanisms for organizations and end-users alike.
| Description | Value |
|---|---|
| Source | Microsoft Malware Prediction (Kaggle) |
| Training Set Size | 8,920,441 rows × 83 features |
| Test Set Size | 7,653,424 rows × 83 features |
| File Size | Approx. 8 GB for train.csv |
| Target Variable | HasDetections (1 = Malware detected, 0 = No malware detected) |
| Data Type | Tabular, mixed categorical & numerical |
| Class Imbalance | Slight imbalance (~50:50 ratio, needs careful validation) |
| Category | Tools/Libraries | Reason |
|---|---|---|
| Language | Python 3.11 | Versatile and widely used for ML workflows |
| Data Handling | pandas, dask, numpy |
Efficient large dataset processing |
| Visualization | seaborn, matplotlib, plotly |
EDA and visual storytelling |
| Machine Learning | LightGBM |
High-speed gradient boosting on large datasets |
| Evaluation Metrics | scikit-learn |
Classification reports, confusion matrices |
MachineIdentifier.num_leaves = 64learning_rate = 0.1feature_fraction = 0.8bagging_fraction = 0.8max_depth = 8result.csv.| Metric | Validation Set Value |
|---|---|
| Accuracy | ~0.734 |
| AUC Score | ~0.79 |
| F1 Score | ~0.73 |
SmartScreen, AVProductStatesIdentifier, and Platform.Optuna or GridSearchCV.```bash git clone https://github.com/yourusername/malware-prediction-ml.git cd malware-prediction-ml