Skip Navigation

NYC Taxi & Limousine Commission - yellow taxi trip records

NYC Taxi TLC yellow

The yellow taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

Volume and Retention

This dataset is stored in Parquet format. There are about 1.5B rows (50GB) in total as of 2018.

This dataset contains historical records accumulated from 2009 to 2018. You can use parameter settings in our SDK to fetch data within a specific time range.

Storage Location

This dataset is stored in the East US Azure region. Allocating compute resources in East US is recommended for affinity.

Additional Information

NYC Taxi and Limousine Commission (TLC):

The data was collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). The trip data was not created by the TLC, and TLC makes no representations as to the accuracy of these data.

Additional information about TLC Trip Record Data can be found from here and here.

Notices

MICROSOFT PROVIDES AZURE OPEN DATASETS ON AN “AS IS” BASIS. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, GUARANTEES OR CONDITIONS WITH RESPECT TO YOUR USE OF THE DATASETS. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAW, MICROSOFT DISCLAIMS ALL LIABILITY FOR ANY DAMAGES OR LOSSES, INCLUDING DIRECT, CONSEQUENTIAL, SPECIAL, INDIRECT, INCIDENTAL OR PUNITIVE, RESULTING FROM YOUR USE OF THE DATASETS.

This dataset is provided under the original terms that Microsoft received source data. The dataset may include data sourced from Microsoft.

Access

Available inWhen to use
Azure Notebooks

Quickly explore the dataset with Jupyter notebooks hosted on Azure or your local machine.

Azure Databricks

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Preview

vendorID tpepPickupDateTime tpepDropoffDateTime passengerCount tripDistance puLocationId doLocationId rateCodeId storeAndFwdFlag paymentType fareAmount extra mtaTax improvementSurcharge tipAmount tollsAmount totalAmount puYear puMonth
2 1/24/2088 12:25:39 AM 1/24/2088 7:28:25 AM 1 4.05 24 162 1 N 2 14.5 0 0.5 0.3 0 0 15.3 2088 1
2 1/24/2088 12:15:42 AM 1/24/2088 12:19:46 AM 1 0.63 41 166 1 N 2 4.5 0 0.5 0.3 0 0 5.3 2088 1
2 11/4/2084 12:32:24 PM 11/4/2084 12:47:41 PM 1 1.34 238 236 1 N 2 10 0 0.5 0.3 0 0 10.8 2084 11
2 11/4/2084 12:25:53 PM 11/4/2084 12:29:00 PM 1 0.32 238 238 1 N 2 4 0 0.5 0.3 0 0 4.8 2084 11
2 11/4/2084 12:08:33 PM 11/4/2084 12:22:24 PM 1 1.85 236 238 1 N 2 10 0 0.5 0.3 0 0 10.8 2084 11
2 11/4/2084 11:41:35 AM 11/4/2084 11:59:41 AM 1 1.65 68 237 1 N 2 12.5 0 0.5 0.3 0 0 13.3 2084 11
2 11/4/2084 11:27:28 AM 11/4/2084 11:39:52 AM 1 1.07 170 68 1 N 2 9 0 0.5 0.3 0 0 9.8 2084 11
2 11/4/2084 11:19:06 AM 11/4/2084 11:26:44 AM 1 1.3 107 170 1 N 2 7.5 0 0.5 0.3 0 0 8.3 2084 11
2 11/4/2084 11:02:59 AM 11/4/2084 11:15:51 AM 1 1.85 113 137 1 N 2 10 0 0.5 0.3 0 0 10.8 2084 11
2 11/4/2084 10:46:05 AM 11/4/2084 10:50:09 AM 1 0.62 231 231 1 N 2 4.5 0 0.5 0.3 0 0 5.3 2084 11
Name Data type Unique Values (sample) Description
doLocationId string 265 161
236

TLC Taxi Zone in which the taximeter was disengaged.

endLat double 961,994 41.366138
40.75
endLon double 1,144,935 -73.137393
-73.9824
extra double 877 0.5
1.0

Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.

fareAmount double 18,935 6.5
4.5

The time-and-distance fare calculated by the meter.

improvementSurcharge string 60 0.3
0

$0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.

mtaTax double 360 0.5
-0.5

$0.50 MTA tax that is automatically triggered based on the metered rate in use.

passengerCount int 64 1
2

The number of passengers in the vehicle. This is a driver-entered value.

paymentType string 6,282 CSH
CRD

A numeric code signifying how the passenger paid for the trip.

1= Credit card;

2= Cash;

3= No charge;

4= Dispute;

5= Unknown;

6= Voided trip.

puLocationId string 266 237
161

TLC Taxi Zone in which the taximeter was engaged.

puMonth int 12 3
5
puYear int 29 2012
2011
rateCodeId int 56 1
2

The final rate code in effect at the end of the trip.

1= Standard rate;

2= JFK;

3= Newark;

4= Nassau or Westchester;

5= Negotiated fare;

6= Group ride.

startLat double 833,016 41.366138
40.7741
startLon double 957,428 -73.137393
-73.9824
storeAndFwdFlag string 8 N
0

This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server.

Y= store and forward trip;

N= not a store and forward trip.

tipAmount double 12,121 1.0
2.0

This field is automatically populated for credit card tips. Cash tips are not included.

tollsAmount double 6,634 5.33
4.8

Total amount of all tolls paid in trip.

totalAmount double 39,707 7.0
7.8

The total amount charged to passengers. Does not include cash tips.

tpepDropoffDateTime timestamp 290,185,010 2013-11-03 01:25:00
2009-12-28 18:12:00

The date and time when the meter was disengaged.

tpepPickupDateTime timestamp 289,948,585 2009-12-28 18:12:00
2013-11-03 01:15:00

The date and time when the meter was engaged.

tripDistance double 14,003 1.0
0.9

The elapsed trip distance in miles reported by the taximeter.

vendorID string 7 VTS
CMT

A code indicating the TPEP provider that provided the record.

1= Creative Mobile Technologies, LLC;

2= VeriFone Inc.

Select your preferred service:

Azure Notebooks

Azure Databricks

Azure Notebooks

Package: Language: Python Python
In [1]:
# This is a package in preview.
from azureml.opendatasets import NycTlcYellow

from datetime import datetime
from dateutil import parser


end_date = parser.parse('2018-06-06')
start_date = parser.parse('2018-05-01')
nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_pandas_dataframe()
ActivityStarted, to_pandas_dataframe ActivityStarted, to_pandas_dataframe_in_worker Target paths: ['/puYear=2018/puMonth=5/', '/puYear=2018/puMonth=6/'] Looking for parquet files... Reading them into Pandas dataframe... Reading yellow/puYear=2018/puMonth=5/part-00087-tid-4962944523873006564-6d1b261c-5f96-4819-ba4d-a034cf2bc6ec-12005.c000.snappy.parquet under container nyctlc Reading yellow/puYear=2018/puMonth=6/part-00171-tid-4962944523873006564-6d1b261c-5f96-4819-ba4d-a034cf2bc6ec-12089.c000.snappy.parquet under container nyctlc Done. ActivityCompleted: Activity=to_pandas_dataframe_in_worker, HowEnded=Success, Duration=137433.5 [ms] ActivityCompleted: Activity=to_pandas_dataframe, HowEnded=Success, Duration=137510.05 [ms]
In [2]:
nyc_tlc_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 10695823 entries, 189 to 5673232 Data columns (total 21 columns): vendorID object tpepPickupDateTime datetime64[ns] tpepDropoffDateTime datetime64[ns] passengerCount int32 tripDistance float64 puLocationId object doLocationId object startLon float64 startLat float64 endLon float64 endLat float64 rateCodeId int32 storeAndFwdFlag object paymentType object fareAmount float64 extra float64 mtaTax float64 improvementSurcharge object tipAmount float64 tollsAmount float64 totalAmount float64 dtypes: datetime64[ns](2), float64(11), int32(2), object(6) memory usage: 1.7+ GB
# Pip install packages
import os, sys

!{sys.executable} -m pip install azure-storage
!{sys.executable} -m pip install pyarrow
!{sys.executable} -m pip install pandas

# COMMAND ----------

# Azure storage access info
azure_storage_account_name = "azureopendatastorage"
azure_storage_sas_token = r""
container_name = "nyctlc"
folder_name = "yellow"

# COMMAND ----------

from azure.storage.blob import BlockBlobService

if azure_storage_account_name is None or azure_storage_sas_token is None:
    raise Exception("Provide your specific name and key for your Azure Storage account--see the Prerequisites section earlier.")

print('Looking for the first parquet under the folder ' + folder_name + ' in container "' + container_name + '"...')
blob_service = BlockBlobService(account_name = azure_storage_account_name, sas_token = azure_storage_sas_token,)
blobs = blob_service.list_blobs(container_name)
sorted_blobs = sorted(list(blobs), key=lambda e: e.name, reverse=True)
targetBlobName=''
for blob in sorted_blobs:
    if blob.name.startswith(folder_name) and blob.name.endswith('.parquet'):
        targetBlobName = blob.name
        break

print('Target blob to download: ' + targetBlobName)
_, filename = os.path.split(targetBlobName)
parquet_file=blob_service.get_blob_to_path(container_name, targetBlobName, filename)

# COMMAND ----------

# Read the local parquet file into Pandas data frame
import pyarrow.parquet as pq
import pandas as pd

appended_df = []
print('Reading the local parquet file into Pandas data frame')
df = pq.read_table(filename).to_pandas()

# COMMAND ----------

# you can add your filter at below
print('Loaded as a Pandas data frame: ')
df

# COMMAND ----------


Azure Databricks

Package: Language: Python Python
In [1]:
# This is a package in preview.
# You need to pip install azureml-opendatasets in Databricks cluster. https://docs.microsoft.com/en-us/azure/data-explorer/connect-from-databricks#install-the-python-library-on-your-azure-databricks-cluster
from azureml.opendatasets import NycTlcYellow

from datetime import datetime
from dateutil import parser


end_date = parser.parse('2018-06-06')
start_date = parser.parse('2018-05-01')
nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_spark_dataframe()
ActivityStarted, to_spark_dataframe ActivityStarted, to_spark_dataframe_in_worker ActivityCompleted: Activity=to_spark_dataframe_in_worker, HowEnded=Success, Duration=91957.61 [ms] ActivityCompleted: Activity=to_spark_dataframe, HowEnded=Success, Duration=91961.14 [ms]
In [2]:
display(nyc_tlc_df.limit(5))
vendorIDtpepPickupDateTimetpepDropoffDateTimepassengerCounttripDistancepuLocationIddoLocationIdstartLonstartLatendLonendLatrateCodeIdstoreAndFwdFlagpaymentTypefareAmountextramtaTaximprovementSurchargetipAmounttollsAmounttotalAmountpuYearpuMonth
22018-06-05T23:37:15.000+00002018-06-06T00:15:57.000+0000121.65132142nullnullnullnull2N152.00.00.50.38.05.7666.5620186
22018-05-31T18:04:19.000+00002018-06-01T17:56:15.000+000019.95230138nullnullnullnull1N135.01.00.50.38.515.7651.0720186
22018-05-31T18:58:20.000+00002018-06-01T18:55:28.000+000021.9148239nullnullnullnull1N212.01.00.50.30.00.013.820186
22018-05-31T16:45:53.000+00002018-06-01T16:06:26.000+000062.79230125nullnullnullnull1N214.01.00.50.30.00.015.820186
22018-05-31T15:08:53.000+00002018-06-01T14:21:29.000+000011.723674nullnullnullnull1N110.01.00.50.32.360.014.1620186
# Databricks notebook source
# Azure storage access info
blob_account_name = "azureopendatastorage"
blob_container_name = "nyctlc"
blob_relative_path = "yellow"
blob_sas_token = r""

# COMMAND ----------

# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)

# COMMAND ----------

# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')

# COMMAND ----------

# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))