Merge pull request #205 from DarshAgrawal14/main

Added PDF malware Detection Pipeline
UppuluriKalyani · Oct 10, 2024 · f5ae07f · f5ae07f
2 parents e2bd264 + 03a9023
commit f5ae07f
Show file tree

Hide file tree

Showing 6 changed files with 11,415 additions and 0 deletions.
diff --git a/Prediction Models/PDF_Malware_Detection/Dataset/PDFMalware2022.csv b/Prediction Models/PDF_Malware_Detection/Dataset/PDFMalware2022.csv
diff --git a/Prediction Models/PDF_Malware_Detection/README.md b/Prediction Models/PDF_Malware_Detection/README.md
@@ -0,0 +1,62 @@
+# PDF Malware Detection
+
+This project implements a machine learning-based system for detecting potential malware in PDF files. It includes feature extraction from PDF files, model training, and a prediction script for classifying PDFs as potentially malicious or benign.
+
+## Components
+
+1. **Feature Extraction** (`pdf_feature_extraction.py`)
+   - Extracts various features from PDF files using PyMuPDF and pdfid.
+   - Features include metadata, structural elements, and presence of potentially risky elements.
+
+2. **Model Training** (`pdf_malware_dataset_training.py`)
+   - Prepares the dataset, handles data cleaning and preprocessing.
+   - Trains a Random Forest classifier for malware detection.
+   - Includes code for hyperparameter tuning (commented out).
+
+3. **Prediction Script** (`predict_malware.py`)
+   - Uses the trained model to predict whether a given PDF file is potentially malicious.
+
+## Setup
+
+1. Install required dependencies:
+   ```
+   pip install numpy pandas matplotlib scikit-learn imblearn PyMuPDF pdfid joblib
+   ```
+
+2. Ensure you have the dataset file `PDFMalware2022.csv` in the `Dataset` folder.
+
+## Usage
+
+### Training the Model
+
+1. Run the `pdf_malware_dataset_training.py` script to train the model:
+   ```
+   python pdf_malware_dataset_training.py
+   ```
+   This will create a `random_forest_model.pkl` file containing the trained model.
+
+### Predicting Malware
+
+1. Use the `predict_malware` function in `predict_malware.py` to classify a PDF file:
+   ```python
+   from predict_malware import predict_malware
+
+   result = predict_malware("path/to/your/pdf_file.pdf")
+   print("Prediction (0: Benign, 1: Malicious):", result)
+   ```
+
+2. Alternatively, run the script directly:
+   ```
+   python predict_malware.py path/to/your/pdf_file.pdf
+   ```
+
+## Note
+
+This project is for educational and research purposes only. It should not be used as a sole means of determining file safety. Always use caution when dealing with potentially malicious files and consult with cybersecurity professionals for comprehensive security measures.
+
+## Future Improvements
+
+- Implement more advanced feature extraction techniques.
+- Explore other machine learning algorithms for potentially better performance.
+- Add a user-friendly interface for easier interaction with the prediction system.
+- Incorporate regular model updates with new malware samples to keep the detection current.