Skip to content

A data science project to predict the probability of a machine encountering malware based on telemetry data collected from Microsoft Defender. Built using Python, Dask, LightGBM, and essential data science libraries for handling large-scale structured data.

License

Notifications You must be signed in to change notification settings

shushantrishav/Microsoft-Malware-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

🛡️ Malware Prediction Using Machine Learning

A data science project to predict the probability of a machine encountering malware based on telemetry data collected from Microsoft Defender.
Built using Python, Dask, LightGBM, and essential data science libraries for handling large-scale structured data.


🎥Demo Video

2021-11-29.23-08-56_Trim.mp4



📊 Problem Statement

With increasing cyber threats, early detection of malware is crucial for protecting user devices and data.
This project aims to predict the likelihood of a malware detection on a machine using telemetry data, enabling proactive defense mechanisms for organizations and end-users alike.


📦 Dataset Details

Description Value
Source Microsoft Malware Prediction (Kaggle)
Training Set Size 8,920,441 rows × 83 features
Test Set Size 7,653,424 rows × 83 features
File Size Approx. 8 GB for train.csv
Target Variable HasDetections (1 = Malware detected, 0 = No malware detected)
Data Type Tabular, mixed categorical & numerical
Class Imbalance Slight imbalance (~50:50 ratio, needs careful validation)

🛠️ Tech Stack

Category Tools/Libraries Reason
Language Python 3.11 Versatile and widely used for ML workflows
Data Handling pandas, dask, numpy Efficient large dataset processing
Visualization seaborn, matplotlib, plotly EDA and visual storytelling
Machine Learning LightGBM High-speed gradient boosting on large datasets
Evaluation Metrics scikit-learn Classification reports, confusion matrices

📊 Project Workflow

1️⃣ Data Loading

  • Used Dask to handle large CSV files without exceeding system memory.
  • Loaded over 8.9 million records with 83 features in a distributed manner.

2️⃣ Data Cleaning & Preprocessing

  • Dropped columns with over 40% missing values.
  • Removed high-cardinality columns (>500 unique values) to avoid sparse matrices.
  • Label-encoded categorical columns.
  • Dropped identifier columns like MachineIdentifier.
  • Cleaned and transformed data while optimizing memory usage.

3️⃣ Exploratory Data Analysis (EDA)

  • Visualized missing values using heatmaps.
  • Explored target variable distribution.
  • Plotted feature distributions and their relationship with malware detection.
  • Analyzed cardinality of categorical variables.

4️⃣ Model Building

  • Implemented a LightGBM Classifier with tuned hyperparameters:
    • num_leaves = 64
    • learning_rate = 0.1
    • feature_fraction = 0.8
    • bagging_fraction = 0.8
    • max_depth = 8
  • Split dataset into 85% train and 15% validation.

5️⃣ Evaluation

  • Evaluated using:
    • Classification Reports (Precision, Recall, F1-score)
    • AUC-ROC Curve
    • Normalized Confusion Matrices
    • Feature Importance Plot

6️⃣ Prediction & Output

  • Processed test set in memory-efficient batches.
  • Generated malware detection probability predictions.
  • Saved results to result.csv.

📊 Results

Metric Validation Set Value
Accuracy ~0.734
AUC Score ~0.79
F1 Score ~0.73
  • The LightGBM model displayed a strong ability to discriminate between infected and safe machines.
  • Feature Importance Plot revealed critical features like SmartScreen, AVProductStatesIdentifier, and Platform.

🔮 Future Scope

  • Implement cross-validation for more robust performance estimation.
  • Integrate hyperparameter tuning using Optuna or GridSearchCV.
  • Apply advanced missing value imputation instead of row removal.
  • Try additional algorithms (XGBoost, CatBoost) for benchmarking.
  • Deploy a scalable API service to accept telemetry data and predict malware probability in real-time.

🚀 Setup Instructions

  1. Clone the repository
git clone https://github.com/yourusername/malware-prediction-ml.git
cd malware-prediction-ml

About

A data science project to predict the probability of a machine encountering malware based on telemetry data collected from Microsoft Defender. Built using Python, Dask, LightGBM, and essential data science libraries for handling large-scale structured data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published