This README provides an overview of the Gun Incidents in the USA project for the academic year 2023/2024, including dataset details and the tasks to be completed. For more detailed information, please refer to the project documentation and code (still in progress).
The project involves data analysis using data mining tools and must be completed by a team of three students. The primary programming language for this project is Python. The project guidelines specify addressing specific tasks, and the results must be reported in a unique paper with a total length of 25 pages of text, including figures. Additionally, students are required to deliver both the paper and well-commented Python notebooks.
The project utilizes three CSV files. The main dataset, incidents.csv
, contains information about gun incidents in the USA. It includes the following variables:
date
: Date of incident occurrencestate
: State where the incident took placecity_or_county
: City or county where the incident took placeaddress
: Address where the incident took placelatitude
: Latitude of the incidentlongitude
: Longitude of the incidentcongressional_district
: Congressional district where the incident took placestate_house_district
: State house districtstate_senate_district
: State senate district where the incident took placeparticipant_age1
: Exact age of one (randomly chosen) participant in the incidentparticipant_age_group1
: Exact age group of one (randomly chosen) participant in the incidentparticipant_gender1
: Exact gender of one (randomly chosen) participant in the incidentmin_age_participants
: Minimum age of the participants in the incidentavg_age_participants
: Average age of the participants in the incidentmax_age_participants
: Maximum age of the participants in the incidentn_participants_child
: Number of child participants (0-11)n_participants_teen
: Number of teen participants (12-17)n_participants_adult
: Number of adult participants (18+)n_males
: Number of male participantsn_females
: Number of female participantsn_killed
: Number of people killedn_injured
: Number of people injuredn_arrested
: Number of arrested participantsn_unharmed
: Number of unharmed participantsn_participants
: Number of participants in the incidentnotes
: Additional notes about the incidentincident_characteristics1
: Incident characteristicsincident_characteristics2
: Incident characteristics (not all incidents have two available characteristics)
The second file, povertyByStateYear.csv
, contains information about the poverty percentage for each USA state and year, with the following variables:
state
year
povertyPercentage
: Poverty percentage for the corresponding state and year
The third file, year_state_district_house.csv
, contains information about the winner of the congressional elections in the USA for each year, state, and congressional district. It includes the following variables:
year
state
congressional_district
party
: Winning party for the corresponding congressional district in the state, in the corresponding yearcandidateVotes
: Number of votes obtained by the winning party in the corresponding electiontotalVotes
: Total number of votes for the corresponding election
Explore the incidents dataset using analytical tools and write a concise "data understanding" report that assesses data quality, the distribution of variables, and pairwise correlations.
Subtasks of Data Understanding:
- Data semantics for each feature not described above and the new ones defined by the team
- Distribution of the variables and statistics
- Assessing data quality (missing values, outliers, duplicated records, errors)
- Variables transformations
- Pairwise correlations and eventual elimination of redundant variables
For this task we followed the following check structure #WIP:
- Type of data
- Type of attribute
- Data Quality
- Correlation analysis
- Outliers detection and manipulation
For task 1.1 see the corresponding Notebook in Task 1.1 - Data Understanding.
Improve the quality of your data and prepare it by extracting new features interesting for describing the incidents. Some examples of indicators to be computed are:
- How many males are involved in incidents relative to the total number of males for the same city and in the same period?
- How many injured and killed people have been involved relative to the total injured and killed people in the same congressional district in a given period of time?
- Ratio of the number of killed people in the incidents relative to the number of participants in the incident
- Ratio of unharmed people in the incidents relative to the average of unharmed people in the same period
Note that these examples are not mandatory, and teams can define their own indicators. Each indicator must be correlated with a description and, when necessary, its mathematical formulation. The extracted variables will be useful for the clustering analysis in the second project's task. Once the set of indicators is computed, the team should explore the new features for a statistical analysis, including distributions, outliers, visualizations, and correlations.
Nice visualization and insights can be obtained by exploiting the latitude and longitude features (example).
For this task we followed the following check structure: #WIP
- Data aggregation
- Reduction of dimensionality
- Data cleaning
- Discretization
- Data transofmration
- Principal Component Analysis via Covariance Matrix
- Data Similarity via Entropy and proximity coefficients
See the corresponding Notebook in Task 1.2 - Data Preparation.
Based on the features extracted in the previous task, explore the dataset using various clustering techniques. Carefully describe your decisions for each algorithm and the advantages provided by the different approaches.
Subtasks:
-
Clustering Analysis by K-means on the entire dataset:
- Identification of the best value of k
- Characterization of the obtained clusters using analysis of the k centroids and comparison of the distribution of variables within the clusters and that in the whole dataset
- Evaluation of the clustering results
-
Analysis by density-based clustering. In this task, choose one state in the dataset:
- Study the clustering parameters
- Characterize and interpret the obtained clusters
-
Analysis by hierarchical clustering. In this task, choose one state in the dataset:
- Compare different clustering results obtained using different versions of the algorithm
- Show and discuss different dendrograms using different algorithms
-
Final evaluation of the best clustering approach and comparison of the clustering obtained
-
Optional (2 points): Explore the opportunity to use alternative clustering techniques in the library pyclustering .
Note: The final report must be delivered by the end of December and can also improve the already delivered tasks.