Skip to content

hachemboudoukha/NYC-Taxi-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NYC Taxi Trip Data Analysis

Welcome to my NYC Taxi Trip Data Analysis project! This repository showcases the work I have done as a junior data scientist to analyze and model New York City taxi trip data for January 2025. Below, I explain what I did step by step and summarize the results of my analysis.


Table of Contents

  1. Import libraries
  2. Import data
  3. Data visualization
  4. Data cleaning
  5. Data preparation
  6. Benchmark model
    5.1. Train-test split
  7. Data preparation

What I Did

1. Imported Libraries

Back to top

I started by importing essential Python libraries such as:

  • pandas for data manipulation
  • numpy for numerical operations
  • matplotlib and seaborn for data visualization
  • scikit-learn for machine learning modeling

import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns

2. Loaded the Dataset

Back to top

I worked with the NYC Yellow Taxi trip dataset for January 2025, which was provided in .parquet format. The dataset contains detailed information about taxi trips, including:

  • Pickup and drop-off times
  • Trip distances
  • Passenger counts
  • Fare details (e.g., fare amount, tips, total amount)

3. Data Exploration and Visualization

Back to top

I explored the dataset to understand its structure and identify key patterns:

  • Visualized distributions of trip distances, fare amounts, and passenger counts.
  • Analyzed trends over time (e.g., hourly, daily patterns).
  • Identified outliers such as negative fares or unusually high values.

4. Data Cleaning

Back to top

To ensure data quality, I performed the following steps:

  • Removed rows with missing or invalid values (e.g., negative fares).
  • Filtered out extreme outliers that could skew the analysis.
  • Ensured consistency in categorical variables like payment types.

5.1. Train-test split

Back to top

  • Split the data into training and testing sets to evaluate model performance.

6. Built a Benchmark Model

Back to top

I trained a simple machine learning model to predict taxi fare amounts based on trip distance, time of day, and other features:

  • Used a DecisionTreeRegressor as the benchmark model.

Results

Insights from Data Analysis

  1. Trip Patterns:

    • Most trips occurred during rush hours (morning and evening).
    • Weekends had fewer trips compared to weekdays.
  2. Fare Trends:

    • Fare amounts were generally proportional to trip distances.
    • Short trips within Manhattan had higher average fares per mile compared to longer trips.
  3. Impact of Weather:

    • Rainy days showed a slight increase in trip fares due to longer travel times.

Model Performance

The benchmark model achieved the following results:

  • Mean Absolute Error (MAE): ~3.50 USD
    This indicates that the model's predictions were off by an average of $3.50 compared to actual fare amounts.

How to Use This Project

  1. Clone this repository:
git clone https://github.com/your-repo-name/nyc-taxi-analysis.git


cd nyc-taxi-analysis
  1. Install required dependencies:
pip install pandas numpy matplotlib seaborn scikit-learn
  1. Open the Jupyter Notebook (notebook_taxi_nyc.ipynb) to explore the analysis and results step by step.

What I Learned

This project helped me build skills in:

  1. Cleaning and preparing real-world datasets.
  2. Performing exploratory data analysis (EDA) using visualizations.
  3. Building and evaluating machine learning models.
  4. Understanding how external factors (e.g., weather) impact predictions.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published