Yellow Taxicabs are the only vehicles that have the right to pick up street-hailing and prearranged passengers anywhere in New York City. My objective is upload the collected dataset to hadoop ecosytem, analyse, and explore the uploaded dataset while answering some important questions.
You can see my HiveQL queries here.
The dataset used in this Hadoop-Hive Case Study is collected from the official website of the NYC Taxi and Limousine Commission (TLC) of the year 2015. The yellow taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
In this case study I explore the following questions:
- What is the total number of trips (equal to the number of rows)?
- What is the total revenue generated by all the trips?
- What fraction of the total is paid for tolls?
- What fraction of it is driver tips?
- What is the average trip amount?
- What is the average distance of the trips?
- How many different payment types are used?
- For each payment type, display the following details:
- Average fare generated
- Average tip
- Average tax
- On average which hour of the day generates the highest revenue?