Gun Incidents in the USA

Data Mining - A.Y. 2023/2024

This README provides an overview of the Gun Incidents in the USA project for the academic year 2023/2024, including dataset details and the tasks to be completed. For more detailed information, please refer to the project documentation and code (still in progress).

The project involves data analysis using data mining tools and must be completed by a team of three students. The primary programming language for this project is Python. The project guidelines specify addressing specific tasks, and the results must be reported in a unique paper with a total length of 25 pages of text, including figures. Additionally, students are required to deliver both the paper and well-commented Python notebooks.

Dataset Description

The project utilizes three CSV files. The main dataset, incidents.csv, contains information about gun incidents in the USA. It includes the following variables:

date: Date of incident occurrence
state: State where the incident took place
city_or_county: City or county where the incident took place
address: Address where the incident took place
latitude: Latitude of the incident
longitude: Longitude of the incident
congressional_district: Congressional district where the incident took place
state_house_district: State house district
state_senate_district: State senate district where the incident took place
participant_age1: Exact age of one (randomly chosen) participant in the incident
participant_age_group1: Exact age group of one (randomly chosen) participant in the incident
participant_gender1: Exact gender of one (randomly chosen) participant in the incident
min_age_participants: Minimum age of the participants in the incident
avg_age_participants: Average age of the participants in the incident
max_age_participants: Maximum age of the participants in the incident
n_participants_child: Number of child participants (0-11)
n_participants_teen: Number of teen participants (12-17)
n_participants_adult: Number of adult participants (18+)
n_males: Number of male participants
n_females: Number of female participants
n_killed: Number of people killed
n_injured: Number of people injured
n_arrested: Number of arrested participants
n_unharmed: Number of unharmed participants
n_participants: Number of participants in the incident
notes: Additional notes about the incident
incident_characteristics1: Incident characteristics
incident_characteristics2: Incident characteristics (not all incidents have two available characteristics)

The second file, povertyByStateYear.csv, contains information about the poverty percentage for each USA state and year, with the following variables:

state
year
povertyPercentage: Poverty percentage for the corresponding state and year

The third file, year_state_district_house.csv, contains information about the winner of the congressional elections in the USA for each year, state, and congressional district. It includes the following variables:

year
state
congressional_district
party: Winning party for the corresponding congressional district in the state, in the corresponding year
candidateVotes: Number of votes obtained by the winning party in the corresponding election
totalVotes: Total number of votes for the corresponding election

Tasks

Task 1: Data Understanding and Preparation (30 points)

Task 1.1: Data Understanding

Explore the incidents dataset using analytical tools and write a concise "data understanding" report that assesses data quality, the distribution of variables, and pairwise correlations.

Subtasks of Data Understanding:

Data semantics for each feature not described above and the new ones defined by the team
Distribution of the variables and statistics
Assessing data quality (missing values, outliers, duplicated records, errors)
Variables transformations
Pairwise correlations and eventual elimination of redundant variables

For this task we followed the following check structure #WIP:

Type of data
Type of attribute
Data Quality
Correlation analysis
Outliers detection and manipulation

For task 1.1 see the corresponding Notebook in Task 1.1 - Data Understanding.

Task 1.2: Data Preparation

Improve the quality of your data and prepare it by extracting new features interesting for describing the incidents. Some examples of indicators to be computed are:

How many males are involved in incidents relative to the total number of males for the same city and in the same period?
How many injured and killed people have been involved relative to the total injured and killed people in the same congressional district in a given period of time?
Ratio of the number of killed people in the incidents relative to the number of participants in the incident
Ratio of unharmed people in the incidents relative to the average of unharmed people in the same period

Note that these examples are not mandatory, and teams can define their own indicators. Each indicator must be correlated with a description and, when necessary, its mathematical formulation. The extracted variables will be useful for the clustering analysis in the second project's task. Once the set of indicators is computed, the team should explore the new features for a statistical analysis, including distributions, outliers, visualizations, and correlations.

Nice visualization and insights can be obtained by exploiting the latitude and longitude features (example).

For this task we followed the following check structure: #WIP

Data aggregation
Reduction of dimensionality
Data cleaning
Discretization
Data transofmration
Principal Component Analysis via Covariance Matrix
Data Similarity via Entropy and proximity coefficients

See the corresponding Notebook in Task 1.2 - Data Preparation.

Task 2: Clustering Analysis (30 POINTS - 32 with optional subtask)

Based on the features extracted in the previous task, explore the dataset using various clustering techniques. Carefully describe your decisions for each algorithm and the advantages provided by the different approaches.

Subtasks:

Clustering Analysis by K-means on the entire dataset:
1. Identification of the best value of k
2. Characterization of the obtained clusters using analysis of the k centroids and comparison of the distribution of variables within the clusters and that in the whole dataset
3. Evaluation of the clustering results
Analysis by density-based clustering. In this task, choose one state in the dataset:
1. Study the clustering parameters
2. Characterize and interpret the obtained clusters
Analysis by hierarchical clustering. In this task, choose one state in the dataset:
1. Compare different clustering results obtained using different versions of the algorithm
2. Show and discuss different dendrograms using different algorithms
Final evaluation of the best clustering approach and comparison of the clustering obtained
Optional (2 points): Explore the opportunity to use alternative clustering techniques in the library pyclustering .

Note: The final report must be delivered by the end of December and can also improve the already delivered tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
ds		ds
map_data		map_data
report		report
source		source
.gitattributes		.gitattributes
README.md		README.md
requirements.txt		requirements.txt
todolist.txt		todolist.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gun Incidents in the USA

Data Mining - A.Y. 2023/2024

Dataset Description

Tasks

Task 1: Data Understanding and Preparation (30 points)

Task 1.1: Data Understanding

Task 1.2: Data Preparation

Task 2: Clustering Analysis (30 POINTS - 32 with optional subtask)

About

Releases

Packages

Contributors 3

Languages

TheDarkPyotr/DataMining2023

Folders and files

Latest commit

History

Repository files navigation

Gun Incidents in the USA

Data Mining - A.Y. 2023/2024

Dataset Description

Tasks

Task 1: Data Understanding and Preparation (30 points)

Task 1.1: Data Understanding

Task 1.2: Data Preparation

Task 2: Clustering Analysis (30 POINTS - 32 with optional subtask)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages