跳过导航

NYC Taxi & Limousine Commission - For-Hire Vehicle (FHV) trip records

NYC TLC Taxi FHV

营运车辆(“FHV”)的行程记录包含捕获调度基地许可证号,搭车日期、时间和出租车区域位置 ID 的字段(形成的文件如下)。 这些记录由基地所提交的 FHV 行程记录生成。

数量和保留期

此数据集以 Parquet 格式存储。 截至 2018 年,大约有 5 亿行 (5GB)。

此数据集包括从 2009 年到 2018 年累积的历史记录。 可使用我们的 SDK 中的参数设置来提取特定时间范围内的数据。

存储位置

此数据集存储在美国东部 Azure 区域。 建议将计算资源分配到美国东部地区,以实现相关性。

其他信息

纽约出租车和豪华轿车委员会 (TLC):

数据是由 Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP) 授权的技术提供商收集并提供给纽约出租车和豪华轿车委员会 (TLC)。 行程数据不是由 TLC 创建的,因此 TLC 不对这些数据的准确性做任何声明。

有关 TLC 行程记录数据的其他信息,请参阅 此处此处

通知

Microsoft 以“原样”为基础提供 AZURE 开放数据集。 Microsoft 对数据集的使用不提供任何担保(明示或暗示)、保证或条件。 在当地法律允许的范围内,Microsoft 对使用数据集而导致的任何损害或损失不承担任何责任,包括直接、必然、特殊、间接、偶发或惩罚。

此数据集是根据 Microsoft 接收源数据的原始条款提供的。 数据集可能包含来自 Microsoft 的数据。

Access

Available inWhen to use
Azure Notebooks

Quickly explore the dataset with Jupyter notebooks hosted on Azure or your local machine.

Azure Databricks

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Azure Synapse

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Preview

dispatchBaseNum pickupDateTime dropOffDateTime puLocationId doLocationId srFlag puYear puMonth
B03157 6/30/2019 11:59:57 PM 7/1/2019 12:07:21 AM 264 null null 2019 6
B01667 6/30/2019 11:59:56 PM 7/1/2019 12:28:06 AM 264 null null 2019 6
B02849 6/30/2019 11:59:55 PM 7/1/2019 12:14:10 AM 264 null null 2019 6
B02249 6/30/2019 11:59:53 PM 7/1/2019 12:15:53 AM 264 null null 2019 6
B00887 6/30/2019 11:59:48 PM 7/1/2019 12:29:29 AM 264 null null 2019 6
B01626 6/30/2019 11:59:45 PM 7/1/2019 12:18:20 AM 264 null null 2019 6
B01259 6/30/2019 11:59:44 PM 7/1/2019 12:03:15 AM 264 null null 2019 6
B01145 6/30/2019 11:59:43 PM 7/1/2019 12:11:15 AM 264 null null 2019 6
B00887 6/30/2019 11:59:42 PM 7/1/2019 12:34:21 AM 264 null null 2019 6
B00821 6/30/2019 11:59:40 PM 7/1/2019 12:02:57 AM 264 null null 2019 6
Name Data type Unique Values (sample) Description
dispatchBaseNum string 1,144 B02510
B02764

调度行程的基地的 TLC 基本许可证号

doLocationId string 267 265
132

行程结束时所处的 TLC 出租车区域。

dropOffDateTime timestamp 57,110,352 2017-07-31 23:59:00
2017-10-15 00:44:34

行程的落客日期和时间。

pickupDateTime timestamp 111,270,396 2016-08-16 00:00:00
2016-08-17 00:00:00

行程接送的日期和时间。

puLocationId string 266 79
161

行程开始的 TLC 出租车区域。

puMonth int 12 1
12
puYear int 5 2018
2017
srFlag string 44 1
2

指示该行程是否属于 High Volume FHV 公司(例如,Uber Pool、Lyft Line)提供的共享乘车链的一部分。 对于共享行程,值为 1。 对于非共享行程,该值为 NULL。

注意:对于大多数 High Volume FHV 公司,只有在行程中被请求并与其他共享乘车请求匹配的共享乘车才会被标记。 但是,Lyft(基地许可证编号 B02510 + B02844)也标记请求共享乘车,但另一个乘客未成功匹配以共享该行程的乘车。因此,来自这两个基地的 SR_Flag=1 的行程记录可能指示共享行程链中的第一个行程或请求共享乘车但从未匹配的行程。 用户应该能推测出对 Lyft 成功完成的共享行程的过度计数。

Select your preferred service:

Azure Notebooks

Azure Databricks

Azure Synapse

Azure Notebooks

Package: Language: Python Python
In [1]:
# This is a package in preview.
from azureml.opendatasets import NycTlcFhv

from datetime import datetime
from dateutil import parser


end_date = parser.parse('2018-06-06')
start_date = parser.parse('2018-05-01')
nyc_tlc = NycTlcFhv(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_pandas_dataframe()
ActivityStarted, to_pandas_dataframe ActivityStarted, to_pandas_dataframe_in_worker Target paths: ['/puYear=2018/puMonth=5/', '/puYear=2018/puMonth=6/'] Looking for parquet files... Reading them into Pandas dataframe... Reading fhv/puYear=2018/puMonth=5/part-00087-tid-3759768514304694104-985150ea-1bd1-4f3d-b368-27cbe0c377b0-12952.c000.snappy.parquet under container nyctlc Reading fhv/puYear=2018/puMonth=6/part-00171-tid-3759768514304694104-985150ea-1bd1-4f3d-b368-27cbe0c377b0-13036.c000.snappy.parquet under container nyctlc Done. ActivityCompleted: Activity=to_pandas_dataframe_in_worker, HowEnded=Success, Duration=737522.49 [ms] ActivityCompleted: Activity=to_pandas_dataframe, HowEnded=Success, Duration=737620.1 [ms]
In [2]:
nyc_tlc_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 25142904 entries, 0 to 21089615 Data columns (total 6 columns): dispatchBaseNum object pickupDateTime datetime64[ns] dropOffDateTime datetime64[ns] puLocationId object doLocationId object srFlag object dtypes: datetime64[ns](2), object(4) memory usage: 1.3+ GB
In [1]:
# Pip install packages
import os, sys

!{sys.executable} -m pip install azure-storage-blob
!{sys.executable} -m pip install pyarrow
!{sys.executable} -m pip install pandas
In [2]:
# Azure storage access info
azure_storage_account_name = "azureopendatastorage"
azure_storage_sas_token = r""
container_name = "nyctlc"
folder_name = "fhv"
In [3]:
from azure.storage.blob import BlockBlobServicefrom azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

if azure_storage_account_name is None or azure_storage_sas_token is None:
    raise Exception(
        "Provide your specific name and key for your Azure Storage account--see the Prerequisites section earlier.")

print('Looking for the first parquet under the folder ' +
      folder_name + ' in container "' + container_name + '"...')
container_url = f"https://{azure_storage_account_name}.blob.core.windows.net/"
blob_service_client = BlobServiceClient(
    container_url, azure_storage_sas_token if azure_storage_sas_token else None)

container_client = blob_service_client.get_container_client(container_name)
blobs = container_client.list_blobs(folder_name)
sorted_blobs = sorted(list(blobs), key=lambda e: e.name, reverse=True)
targetBlobName = ''
for blob in sorted_blobs:
    if blob.name.startswith(folder_name) and blob.name.endswith('.parquet'):
        targetBlobName = blob.name
        break

print('Target blob to download: ' + targetBlobName)
_, filename = os.path.split(targetBlobName)
blob_client = container_client.get_blob_client(targetBlobName)
with open(filename, 'wb') as local_file:
    blob_client.download_blob().download_to_stream(local_file)
In [4]:
# Read the parquet file into Pandas data frame
import pandas as pd

print('Reading the parquet file into Pandas data frame')
df = pd.read_parquet(filename)
In [5]:
# you can add your filter at below
print('Loaded as a Pandas data frame: ')
df
In [6]:
 

Azure Databricks

Package: Language: Python Python
In [1]:
# This is a package in preview.
# You need to pip install azureml-opendatasets in Databricks cluster. https://docs.microsoft.com/en-us/azure/data-explorer/connect-from-databricks#install-the-python-library-on-your-azure-databricks-cluster
from azureml.opendatasets import NycTlcFhv

from datetime import datetime
from dateutil import parser


end_date = parser.parse('2018-06-06')
start_date = parser.parse('2018-05-01')
nyc_tlc = NycTlcFhv(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_spark_dataframe()
ActivityStarted, to_spark_dataframe ActivityStarted, to_spark_dataframe_in_worker ActivityCompleted: Activity=to_spark_dataframe_in_worker, HowEnded=Success, Duration=27284.71 [ms] ActivityCompleted: Activity=to_spark_dataframe, HowEnded=Success, Duration=27288.78 [ms]
In [2]:
display(nyc_tlc_df.limit(5))
dispatchBaseNumpickupDateTimedropOffDateTimepuLocationIddoLocationIdsrFlagpuYearpuMonth
B028002018-05-08T13:31:30.000+00002018-05-08T13:57:23.000+000042237120185
B028002018-05-08T13:31:30.000+00002018-05-08T13:48:36.000+000033261null20185
B028002018-05-08T13:31:38.000+00002018-05-08T13:59:23.000+000016497120185
B028002018-05-08T13:31:41.000+00002018-05-08T13:38:35.000+0000144211120185
B028002018-05-08T13:31:43.000+00002018-05-08T13:44:23.000+000018648120185
In [1]:
# Azure storage access info
blob_account_name = "azureopendatastorage"
blob_container_name = "nyctlc"
blob_relative_path = "fhv"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))

Azure Synapse

Package: Language: Python Python
In [29]:
# This is a package in preview.
from azureml.opendatasets import NycTlcFhv

from datetime import datetime
from dateutil import parser


end_date = parser.parse('2018-06-06')
start_date = parser.parse('2018-05-01')
nyc_tlc = NycTlcFhv(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_spark_dataframe()
In [30]:
# Display top 5 rows
display(nyc_tlc_df.limit(5))
Out[30]:
In [1]:
# Azure storage access info
blob_account_name = "azureopendatastorage"
blob_container_name = "nyctlc"
blob_relative_path = "fhv"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))