The project aims to build and evaluate machine learning models that can effectively classify websites as phishing or legitimate based on their features. The use of data balancing, visualization, and various models demonstrates a comprehensive approach to tackling the phishing detection problem.
- Dataset Loading: The project begins by loading the
phishing.csv
dataset. - Exploratory Analysis:
- Used functions like
head()
,info()
, andisnull().sum()
to understand the data structure and check for missing values.
- Used functions like
- The target variable (
class
), which indicates whether a website is phishing or legitimate, was found to be imbalanced. - To address this, the
RandomOverSampler
technique was employed, ensuring equal representation of both classes for better model training.
Several visualizations were created to gain insights into the data:
-
Bar Plots:
- Distribution of the target variable before and after balancing.
- Comparison of counts for
LongURL
andShortURL
features.
-
Feature Importance:
- Used
ExtraTreesClassifier
to visualize the most influential features.
- Used
-
Feature Relationships:
- Explored the relationship between
AnchorURL
andHTTPS
using a stacked bar chart.
- Explored the relationship between
-
Correlation Heatmap:
- Displayed relationships between different features to identify patterns.
Various machine learning models were trained to detect phishing websites:
- Random Forest
- Decision Tree
- SVM (Support Vector Machine)
- Naive Bayes
- Gradient Boosting
- LightGBM
- CatBoost
- Accuracy scores and classification reports were generated for each model to assess their performance.
- A bar plot was used to compare the accuracies of different models.
- This helped identify the best-performing model(s).
- The project explored hybrid models by combining different base models using
VotingClassifier
. - The accuracies of these hybrid models were compared to assess improvements in performance.
Thank you 😊