🛡️ Malware Prediction Using Machine Learning

A data science project to predict the probability of a machine encountering malware based on telemetry data collected from Microsoft Defender.
Built using Python, Dask, LightGBM, and essential data science libraries for handling large-scale structured data.

🎥Demo Video

2021-11-29.23-08-56_Trim.mp4

📊 Problem Statement

With increasing cyber threats, early detection of malware is crucial for protecting user devices and data.
This project aims to predict the likelihood of a malware detection on a machine using telemetry data, enabling proactive defense mechanisms for organizations and end-users alike.

📦 Dataset Details

Description	Value
Source	Microsoft Malware Prediction (Kaggle)
Training Set Size	8,920,441 rows × 83 features
Test Set Size	7,653,424 rows × 83 features
File Size	Approx. 8 GB for train.csv
Target Variable	`HasDetections` (1 = Malware detected, 0 = No malware detected)
Data Type	Tabular, mixed categorical & numerical
Class Imbalance	Slight imbalance (~50:50 ratio, needs careful validation)

🛠️ Tech Stack

Category	Tools/Libraries	Reason
Language	Python 3.11	Versatile and widely used for ML workflows
Data Handling	`pandas`, `dask`, `numpy`	Efficient large dataset processing
Visualization	`seaborn`, `matplotlib`, `plotly`	EDA and visual storytelling
Machine Learning	`LightGBM`	High-speed gradient boosting on large datasets
Evaluation Metrics	`scikit-learn`	Classification reports, confusion matrices

📊 Project Workflow

1️⃣ Data Loading

Used Dask to handle large CSV files without exceeding system memory.
Loaded over 8.9 million records with 83 features in a distributed manner.

2️⃣ Data Cleaning & Preprocessing

Dropped columns with over 40% missing values.
Removed high-cardinality columns (>500 unique values) to avoid sparse matrices.
Label-encoded categorical columns.
Dropped identifier columns like MachineIdentifier.
Cleaned and transformed data while optimizing memory usage.

3️⃣ Exploratory Data Analysis (EDA)

Visualized missing values using heatmaps.
Explored target variable distribution.
Plotted feature distributions and their relationship with malware detection.
Analyzed cardinality of categorical variables.

4️⃣ Model Building

Implemented a LightGBM Classifier with tuned hyperparameters:
- num_leaves = 64
- learning_rate = 0.1
- feature_fraction = 0.8
- bagging_fraction = 0.8
- max_depth = 8
Split dataset into 85% train and 15% validation.

5️⃣ Evaluation

Evaluated using:
- Classification Reports (Precision, Recall, F1-score)
- AUC-ROC Curve
- Normalized Confusion Matrices
- Feature Importance Plot

6️⃣ Prediction & Output

Processed test set in memory-efficient batches.
Generated malware detection probability predictions.
Saved results to result.csv.

📊 Results

Metric	Validation Set Value
Accuracy	~0.734
AUC Score	~0.79
F1 Score	~0.73

The LightGBM model displayed a strong ability to discriminate between infected and safe machines.
Feature Importance Plot revealed critical features like SmartScreen, AVProductStatesIdentifier, and Platform.

🔮 Future Scope

Implement cross-validation for more robust performance estimation.
Integrate hyperparameter tuning using Optuna or GridSearchCV.
Apply advanced missing value imputation instead of row removal.
Try additional algorithms (XGBoost, CatBoost) for benchmarking.
Deploy a scalable API service to accept telemetry data and predict malware probability in real-time.

🚀 Setup Instructions

Clone the repository

git clone https://github.com/yourusername/malware-prediction-ml.git
cd malware-prediction-ml

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE		LICENSE
Malware_Prediction.ipynb		Malware_Prediction.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🛡️ Malware Prediction Using Machine Learning

🎥Demo Video

📊 Problem Statement

📦 Dataset Details

🛠️ Tech Stack

📊 Project Workflow

1️⃣ Data Loading

2️⃣ Data Cleaning & Preprocessing

3️⃣ Exploratory Data Analysis (EDA)

4️⃣ Model Building

5️⃣ Evaluation

6️⃣ Prediction & Output

📊 Results

🔮 Future Scope

🚀 Setup Instructions

About

Uh oh!

Releases

Packages

Languages

License

shushantrishav/Microsoft-Malware-Prediction

Folders and files

Latest commit

History

Repository files navigation

🛡️ Malware Prediction Using Machine Learning

🎥Demo Video

📊 Problem Statement

📦 Dataset Details

🛠️ Tech Stack

📊 Project Workflow

1️⃣ Data Loading

2️⃣ Data Cleaning & Preprocessing

3️⃣ Exploratory Data Analysis (EDA)

4️⃣ Model Building

5️⃣ Evaluation

6️⃣ Prediction & Output

📊 Results

🔮 Future Scope

🚀 Setup Instructions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages