PhishDataset

This is the dataset distributed in my paper "Segmentation-based Phishing URL Detection". The paper is published in WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Paper is available @.https://doi.org/10.1145/3486622.3493983

Phishing Dataset : We collected phishing URLs from PhishTank , the most popular site distributing phishing websites, from May 2021 to June 2021.

Legitimate Dataset : Legitimate URLs were prepared by the following steps:

legitimate domains were chosen randomly from a set of domains included in the IP2Location dataset consistently from January 2021 to March 2021,
Each chosen domain was accessed by Apache Nutch crawler to gather the web pages located in the same domain at most 100 pages, and
A legitimate URL was randomly chosen from the gathered URLs in each domain. Note that URLs in IP2Location consist of both legitimate and phishing URLs; however, we assume that most URLs are legitimate.

A balanced dataset with 10,000 legitimate and 10,000 phishing URLs and an imbalanced dataset with 50,000 legitimate and 5,000 phishing URLs were prepared.

Label 0 represents Legitimate URL

Label 1 represents Phishing URL

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
data_bal - 20000.xlsx		data_bal - 20000.xlsx
data_imbal - 55000.xlsx		data_imbal - 55000.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhishDataset

About

Releases

Packages

ESDAUNG/PhishDataset

Folders and files

Latest commit

History

Repository files navigation

PhishDataset

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages