NYC Taxi Trip Data Analysis

Welcome to my NYC Taxi Trip Data Analysis project! This repository showcases the work I have done as a junior data scientist to analyze and model New York City taxi trip data for January 2025. Below, I explain what I did step by step and summarize the results of my analysis.

What I Did

1. Imported Libraries

Back to top

I started by importing essential Python libraries such as:

pandas for data manipulation
numpy for numerical operations
matplotlib and seaborn for data visualization
scikit-learn for machine learning modeling

import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns

2. Loaded the Dataset

Back to top

I worked with the NYC Yellow Taxi trip dataset for January 2025, which was provided in .parquet format. The dataset contains detailed information about taxi trips, including:

Pickup and drop-off times
Trip distances
Passenger counts
Fare details (e.g., fare amount, tips, total amount)

3. Data Exploration and Visualization

Back to top

I explored the dataset to understand its structure and identify key patterns:

Visualized distributions of trip distances, fare amounts, and passenger counts.
Analyzed trends over time (e.g., hourly, daily patterns).
Identified outliers such as negative fares or unusually high values.

4. Data Cleaning

Back to top

To ensure data quality, I performed the following steps:

Removed rows with missing or invalid values (e.g., negative fares).
Filtered out extreme outliers that could skew the analysis.
Ensured consistency in categorical variables like payment types.

5.1. Train-test split

Back to top

Split the data into training and testing sets to evaluate model performance.

6. Built a Benchmark Model

Back to top

I trained a simple machine learning model to predict taxi fare amounts based on trip distance, time of day, and other features:

Used a DecisionTreeRegressor as the benchmark model.

Results

Insights from Data Analysis

Trip Patterns:
- Most trips occurred during rush hours (morning and evening).
- Weekends had fewer trips compared to weekdays.
Fare Trends:
- Fare amounts were generally proportional to trip distances.
- Short trips within Manhattan had higher average fares per mile compared to longer trips.
Impact of Weather:
- Rainy days showed a slight increase in trip fares due to longer travel times.

Model Performance

The benchmark model achieved the following results:

Mean Absolute Error (MAE): ~3.50 USD
This indicates that the model's predictions were off by an average of $3.50 compared to actual fare amounts.

How to Use This Project

Clone this repository:

git clone https://github.com/your-repo-name/nyc-taxi-analysis.git


cd nyc-taxi-analysis

Install required dependencies:

pip install pandas numpy matplotlib seaborn scikit-learn

Open the Jupyter Notebook (notebook_taxi_nyc.ipynb) to explore the analysis and results step by step.

What I Learned

This project helped me build skills in:

Cleaning and preparing real-world datasets.
Performing exploratory data analysis (EDA) using visualizations.
Building and evaluating machine learning models.
Understanding how external factors (e.g., weather) impact predictions.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
note		note
notebook_taxi_nyc.ipynb		notebook_taxi_nyc.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC Taxi Trip Data Analysis

Table of Contents

What I Did

1. Imported Libraries

2. Loaded the Dataset

3. Data Exploration and Visualization

4. Data Cleaning

5.1. Train-test split

6. Built a Benchmark Model

Results

Insights from Data Analysis

Model Performance

How to Use This Project

What I Learned

About

Uh oh!

Releases

Packages

Languages

License

hachemboudoukha/NYC-Taxi-Analysis

Folders and files

Latest commit

History

Repository files navigation

NYC Taxi Trip Data Analysis

Table of Contents

What I Did

1. Imported Libraries

2. Loaded the Dataset

3. Data Exploration and Visualization

4. Data Cleaning

5.1. Train-test split

6. Built a Benchmark Model

Results

Insights from Data Analysis

Model Performance

How to Use This Project

What I Learned

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages