A machine learning-based system for detecting fraudulent credit card transactions with high accuracy and minimal false positives.
This project implements a comprehensive fraud detection system that:
- Processes and analyzes credit card transaction data
- Handles class imbalance using SMOTE
- Trains and evaluates multiple machine learning models
- Identifies the most important features for fraud detection
- Provides visualization tools for model performance analysis
Based on the latest execution:
- Dataset size: 1,296,675 transactions
- Fraud cases: 7,506 (0.58% of total transactions)
- Models trained: Random Forest, Gradient Boosting
- Best model: Random Forest (F1-Score: 0.6480)
- Key performance metrics:
- Random Forest: 79% recall on fraud cases with 55% precision
- Gradient Boosting: 86% recall on fraud cases with 18% precision
fraud-detection/
├── main.py # Main script to run the fraud detection pipeline
├── utils.py # Utility functions for data processing and visualization
├── requirements.txt # Project dependencies
├── fraudTrain.rar # Training dataset
├── models/ # Saved models and preprocessing objects
│ ├── random_forest_20250320_205116.pkl
│ ├── gradient_boosting_20250320_205116.pkl
│ └── preprocessing_20250320_205116.pkl
└── plots/ # Generated visualizations
├── fraud_distribution.png
├── feature_importance_random_forest.png
└── feature_importance_gradient_boosting.png
- Datetime processing: Extracts hour, day, month, and day of week from transaction timestamps
- Geographic analysis: Calculates distances between customer and merchant locations
- Categorical encoding: Handles merchant, category, job, and gender features
- Feature scaling: Normalizes numerical features
- Random Forest: Optimized for balanced precision and recall
- Gradient Boosting: Provides high recall for fraud detection
- Class imbalance handling: Uses SMOTE to address the imbalanced nature of fraud data (0.58% fraud)
- Comprehensive metrics: F1-score, precision, recall, and confusion matrix
- Feature importance analysis: Identifies the most important features for fraud detection
- Cross-validation: Ensures model reliability and generalization
- Transaction amount (amt): By far the most significant indicator of fraud
- Transaction category: Different categories have varying fraud risks
- Hour of transaction: Time of day significantly impacts fraud likelihood
- Population density (city_pop): Transactions in certain population areas show higher fraud rates
- Merchant: Specific merchants may have higher fraud rates
- Clone the repository:
git clone https://github.com/yourusername/fraud-detection.git
cd fraud-detection-
Create and activate a virtual environment:
For Windows:
python -m venv .venv
.venv\Scripts\activate For Mac/Linux:
python -m venv .venv
source .venv/bin/activate- Install dependencies:
pip install -r requirements.txtTo run the full fraud detection pipeline:
python main.pyThis will:
- Load and preprocess the data from fraudTrain.csv
- Handle class imbalance with SMOTE
- Train Random Forest and Gradient Boosting models
- Evaluate model performance
- Save the trained models to the models/ directory
- Generate visualizations in the plots/ directory
To use the trained models for prediction with your own script:
import pickle
import pandas as pd
from utils import preprocess_data
# Load the model and preprocessing objects
with open('models/random_forest_20250320_205116.pkl', 'rb') as f:
model = pickle.load(f)
with open('models/preprocessing_20250320_205116.pkl', 'rb') as f:
preprocessing = pickle.load(f)
# First, unzip the fraudTrain.rar file
# You need to have appropriate software like WinRAR, 7-Zip, or unrar installed
import os
import subprocess
# For Windows using 7-Zip (adjust path if needed)
if os.name == 'nt':
subprocess.run(['7z', 'x', 'fraudTrain.rar'])
# For Linux/Mac using unrar
else:
subprocess.run(['unrar', 'x', 'fraudTrain.rar'])
# Load and preprocess new transaction data
new_data = pd.read_csv('fraudTrain.csv')
preprocessed_data = preprocess_data(new_data, preprocessing)
# Make predictions
predictions = model.predict(preprocessed_data)The system is designed to handle large transaction volumes efficiently. The pipeline includes:
- Optimized preprocessing steps
- Efficient model training with parallel processing (Random Forest uses n_jobs=-1)
- Fast prediction capabilities for real-time fraud detection
- Implement deep learning models for improved performance
- Add anomaly detection techniques for identifying new fraud patterns
- Develop real-time monitoring dashboard
- Implement model explainability features for better interpretability
- Create an API for real-time fraud detection
The system was trained using the fraudTrain.csv dataset and evaluated on fraudTest.csv, which contain credit card transaction data with the following features:
- Transaction details (date, time, amount)
- Credit card information
- Merchant information
- Customer demographics
- Geographic coordinates
- Fraud labels (0.58% of transactions labeled as fraud)