A data science project to predict the probability of a machine encountering malware based on telemetry data collected from Microsoft Defender.
Built using Python, Dask, LightGBM, and essential data science libraries for handling large-scale structured data.
2021-11-29.23-08-56_Trim.mp4
With increasing cyber threats, early detection of malware is crucial for protecting user devices and data.
This project aims to predict the likelihood of a malware detection on a machine using telemetry data, enabling proactive defense mechanisms for organizations and end-users alike.
Description | Value |
---|---|
Source | Microsoft Malware Prediction (Kaggle) |
Training Set Size | 8,920,441 rows × 83 features |
Test Set Size | 7,653,424 rows × 83 features |
File Size | Approx. 8 GB for train.csv |
Target Variable | HasDetections (1 = Malware detected, 0 = No malware detected) |
Data Type | Tabular, mixed categorical & numerical |
Class Imbalance | Slight imbalance (~50:50 ratio, needs careful validation) |
Category | Tools/Libraries | Reason |
---|---|---|
Language | Python 3.11 | Versatile and widely used for ML workflows |
Data Handling | pandas , dask , numpy |
Efficient large dataset processing |
Visualization | seaborn , matplotlib , plotly |
EDA and visual storytelling |
Machine Learning | LightGBM |
High-speed gradient boosting on large datasets |
Evaluation Metrics | scikit-learn |
Classification reports, confusion matrices |
- Used Dask to handle large CSV files without exceeding system memory.
- Loaded over 8.9 million records with 83 features in a distributed manner.
- Dropped columns with over 40% missing values.
- Removed high-cardinality columns (>500 unique values) to avoid sparse matrices.
- Label-encoded categorical columns.
- Dropped identifier columns like
MachineIdentifier
. - Cleaned and transformed data while optimizing memory usage.
- Visualized missing values using heatmaps.
- Explored target variable distribution.
- Plotted feature distributions and their relationship with malware detection.
- Analyzed cardinality of categorical variables.
- Implemented a LightGBM Classifier with tuned hyperparameters:
num_leaves = 64
learning_rate = 0.1
feature_fraction = 0.8
bagging_fraction = 0.8
max_depth = 8
- Split dataset into 85% train and 15% validation.
- Evaluated using:
- Classification Reports (Precision, Recall, F1-score)
- AUC-ROC Curve
- Normalized Confusion Matrices
- Feature Importance Plot
- Processed test set in memory-efficient batches.
- Generated malware detection probability predictions.
- Saved results to
result.csv
.
Metric | Validation Set Value |
---|---|
Accuracy | ~0.734 |
AUC Score | ~0.79 |
F1 Score | ~0.73 |
- The LightGBM model displayed a strong ability to discriminate between infected and safe machines.
- Feature Importance Plot revealed critical features like
SmartScreen
,AVProductStatesIdentifier
, andPlatform
.
- Implement cross-validation for more robust performance estimation.
- Integrate hyperparameter tuning using
Optuna
orGridSearchCV
. - Apply advanced missing value imputation instead of row removal.
- Try additional algorithms (XGBoost, CatBoost) for benchmarking.
- Deploy a scalable API service to accept telemetry data and predict malware probability in real-time.
- Clone the repository
git clone https://github.com/yourusername/malware-prediction-ml.git
cd malware-prediction-ml