Welcome to my NYC Taxi Trip Data Analysis project! This repository showcases the work I have done as a junior data scientist to analyze and model New York City taxi trip data for January 2025. Below, I explain what I did step by step and summarize the results of my analysis.
- Import libraries
- Import data
- Data visualization
- Data cleaning
- Data preparation
- Benchmark model
5.1. Train-test split - Data preparation
I started by importing essential Python libraries such as:
pandasfor data manipulationnumpyfor numerical operationsmatplotlibandseabornfor data visualizationscikit-learnfor machine learning modeling
import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns
I worked with the NYC Yellow Taxi trip dataset for January 2025, which was provided in .parquet format. The dataset contains detailed information about taxi trips, including:
- Pickup and drop-off times
- Trip distances
- Passenger counts
- Fare details (e.g., fare amount, tips, total amount)
I explored the dataset to understand its structure and identify key patterns:
- Visualized distributions of trip distances, fare amounts, and passenger counts.
- Analyzed trends over time (e.g., hourly, daily patterns).
- Identified outliers such as negative fares or unusually high values.
To ensure data quality, I performed the following steps:
- Removed rows with missing or invalid values (e.g., negative fares).
- Filtered out extreme outliers that could skew the analysis.
- Ensured consistency in categorical variables like payment types.
- Split the data into training and testing sets to evaluate model performance.
I trained a simple machine learning model to predict taxi fare amounts based on trip distance, time of day, and other features:
- Used a
DecisionTreeRegressoras the benchmark model.
-
Trip Patterns:
- Most trips occurred during rush hours (morning and evening).
- Weekends had fewer trips compared to weekdays.
-
Fare Trends:
- Fare amounts were generally proportional to trip distances.
- Short trips within Manhattan had higher average fares per mile compared to longer trips.
-
Impact of Weather:
- Rainy days showed a slight increase in trip fares due to longer travel times.
The benchmark model achieved the following results:
- Mean Absolute Error (MAE): ~3.50 USD
This indicates that the model's predictions were off by an average of $3.50 compared to actual fare amounts.
- Clone this repository:
git clone https://github.com/your-repo-name/nyc-taxi-analysis.git
cd nyc-taxi-analysis
- Install required dependencies:
pip install pandas numpy matplotlib seaborn scikit-learn
- Open the Jupyter Notebook (
notebook_taxi_nyc.ipynb) to explore the analysis and results step by step.
This project helped me build skills in:
- Cleaning and preparing real-world datasets.
- Performing exploratory data analysis (EDA) using visualizations.
- Building and evaluating machine learning models.
- Understanding how external factors (e.g., weather) impact predictions.