NYC Taxi & Limousine Commission - For-Hire Vehicle (FHV) trip records
FHV-reiseposter (For-Hire Vehicle) inkluderer felt for lisensnummeret og hentedato, klokkeslett og taxisonens steds-ID (formfil nedenfor). Disse postene genereres fra FHV-reisepostinnsendinger foretatt av baser.
Volum og dataoppbevaring
Dette datasettet er lagret i Parquet-format. Det er ca. 500M rader (5 GB) fra og med 2018.
Dette datasettet inneholder historiske poster akkumulert fra 2009 til 2018. Du kan bruke parameterinnstillinger i vårt SDK til å hente data innenfor et spesifikt tidsintervall.
Lagerplassering
Dette datasettet er lagret i Azure-området i øst-USA. Tildeling av databehandlingsressurser i øst-USA er anbefalt for affinitet.
Mer informasjon
NYC Taxi and Limousine Commission (TLC):
Dataene ble samlet inn og levert til NYC Taxi and Limousine Commission (TLC) av teknologileverandører autorisert under programmene Taxicab & Livery Passenger Enhancement (TPEP/LPEP). Reisedataene ble ikke opprettet av TLC, og TLC gir ingen framstilling hva gjelder nøyaktigheten av disse dataene.
Du finner ytterligere informasjon om TLC-reisepostdata her og her.
Merknader
MICROSOFT LEVERER AZURE OPEN DATASETS PÅ EN “SOM DE ER”-BASIS. MICROSOFT GIR INGEN GARANTIER, UTTRYKTE ELLER IMPLISERTE, ELLER BETINGELSER MED HENSYN TIL DIN BRUK AV DATASETTENE. I DEN GRAD LOKAL LOV TILLATER DET, FRASKRIVER MICROSOFT SEG ALT ANSVAR FOR EVENTUELLE SKADER ELLER TAP, INKLUDERT DIREKTE SKADE, FØLGESKADE, DOKUMENTERT ERSTATNINGSKRAV, INDIREKTE SKADE ELLER ERSTATNING UTOVER DET SOM VILLE VÆRE NORMALT, SOM FØLGE AV DIN BRUK AV DATASETTENE.
Dette datasettet leveres i henhold til de originale vilkårene Microsoft mottok kildedata. Datasettet kan inkludere data hentet fra Microsoft.
Access
Available in | When to use |
---|---|
Azure Notebooks | Quickly explore the dataset with Jupyter notebooks hosted on Azure or your local machine. |
Azure Databricks | Use this when you need the scale of an Azure managed Spark cluster to process the dataset. |
Azure Synapse | Use this when you need the scale of an Azure managed Spark cluster to process the dataset. |
Preview
dispatchBaseNum | pickupDateTime | dropOffDateTime | puLocationId | doLocationId | srFlag | puYear | puMonth |
---|---|---|---|---|---|---|---|
B03157 | 6/30/2019 11:59:57 PM | 7/1/2019 12:07:21 AM | 264 | null | null | 2019 | 6 |
B01667 | 6/30/2019 11:59:56 PM | 7/1/2019 12:28:06 AM | 264 | null | null | 2019 | 6 |
B02849 | 6/30/2019 11:59:55 PM | 7/1/2019 12:14:10 AM | 264 | null | null | 2019 | 6 |
B02249 | 6/30/2019 11:59:53 PM | 7/1/2019 12:15:53 AM | 264 | null | null | 2019 | 6 |
B00887 | 6/30/2019 11:59:48 PM | 7/1/2019 12:29:29 AM | 264 | null | null | 2019 | 6 |
B01626 | 6/30/2019 11:59:45 PM | 7/1/2019 12:18:20 AM | 264 | null | null | 2019 | 6 |
B01259 | 6/30/2019 11:59:44 PM | 7/1/2019 12:03:15 AM | 264 | null | null | 2019 | 6 |
B01145 | 6/30/2019 11:59:43 PM | 7/1/2019 12:11:15 AM | 264 | null | null | 2019 | 6 |
B00887 | 6/30/2019 11:59:42 PM | 7/1/2019 12:34:21 AM | 264 | null | null | 2019 | 6 |
B00821 | 6/30/2019 11:59:40 PM | 7/1/2019 12:02:57 AM | 264 | null | null | 2019 | 6 |
Name | Data type | Unique | Values (sample) | Description |
---|---|---|---|---|
dispatchBaseNum | string | 1,144 | B02510 B02764 |
TLC-lisensnummeret til basen som kjørte reisen |
doLocationId | string | 267 | 265 132 |
TLC Taxi-sonen turen endte i. |
dropOffDateTime | timestamp | 57,110,352 | 2017-07-31 23:59:00 2017-10-15 00:44:34 |
Datoen og klokkeslettet for reiseavlevering. |
pickupDateTime | timestamp | 111,270,396 | 2016-08-16 00:00:00 2016-08-17 00:00:00 |
Datoen og klokkeslettet for reisehenting. |
puLocationId | string | 266 | 79 161 |
TLC Taxi-sonen turen begynte i. |
puMonth | int | 12 | 1 12 |
|
puYear | int | 5 | 2018 2017 |
|
srFlag | string | 44 | 1 2 |
Indikerer om reisen var del av en delt reisekjede tilbudt av et High Volume FHV-selskap (f.eks. Uber, Lyft Line). For delte reiser, er verdien 1. For ikke-delte reiser, er dette feltet null. OBS! For de fleste High Volume FHV-selskaper, flagges kun delte reiser som ble forespurt OG matchet med en annen forespørsel om delt reise i løpet av reisen. Men Lyft (lisensnumre B02510 + B02844) flagger også reiser der en delt reise ble forespurt, men en annen person ikke ble matchet for å dele reisen—derfor kan reiseposter med SR_Flag=1 fra de to basene indikere ENTEN en første reise i en delt reisekjede ELLER en reise der en delt reise ble forespurt, men aldri matchet. Brukere bør forvente et overtall av vellykkede delte reiser fullført av Lyft. |
Azure Notebooks
# This is a package in preview.
from azureml.opendatasets import NycTlcFhv
from datetime import datetime
from dateutil import parser
end_date = parser.parse('2018-06-06')
start_date = parser.parse('2018-05-01')
nyc_tlc = NycTlcFhv(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_pandas_dataframe()
nyc_tlc_df.info()
# Pip install packages
import os, sys
!{sys.executable} -m pip install azure-storage-blob
!{sys.executable} -m pip install pyarrow
!{sys.executable} -m pip install pandas
# Azure storage access info
azure_storage_account_name = "azureopendatastorage"
azure_storage_sas_token = r""
container_name = "nyctlc"
folder_name = "fhv"
from azure.storage.blob import BlockBlobServicefrom azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient
if azure_storage_account_name is None or azure_storage_sas_token is None:
raise Exception(
"Provide your specific name and key for your Azure Storage account--see the Prerequisites section earlier.")
print('Looking for the first parquet under the folder ' +
folder_name + ' in container "' + container_name + '"...')
container_url = f"https://{azure_storage_account_name}.blob.core.windows.net/"
blob_service_client = BlobServiceClient(
container_url, azure_storage_sas_token if azure_storage_sas_token else None)
container_client = blob_service_client.get_container_client(container_name)
blobs = container_client.list_blobs(folder_name)
sorted_blobs = sorted(list(blobs), key=lambda e: e.name, reverse=True)
targetBlobName = ''
for blob in sorted_blobs:
if blob.name.startswith(folder_name) and blob.name.endswith('.parquet'):
targetBlobName = blob.name
break
print('Target blob to download: ' + targetBlobName)
_, filename = os.path.split(targetBlobName)
blob_client = container_client.get_blob_client(targetBlobName)
with open(filename, 'wb') as local_file:
blob_client.download_blob().download_to_stream(local_file)
# Read the parquet file into Pandas data frame
import pandas as pd
print('Reading the parquet file into Pandas data frame')
df = pd.read_parquet(filename)
# you can add your filter at below
print('Loaded as a Pandas data frame: ')
df
Azure Databricks
# This is a package in preview.
# You need to pip install azureml-opendatasets in Databricks cluster. https://docs.microsoft.com/en-us/azure/data-explorer/connect-from-databricks#install-the-python-library-on-your-azure-databricks-cluster
from azureml.opendatasets import NycTlcFhv
from datetime import datetime
from dateutil import parser
end_date = parser.parse('2018-06-06')
start_date = parser.parse('2018-05-01')
nyc_tlc = NycTlcFhv(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_spark_dataframe()
display(nyc_tlc_df.limit(5))
# Azure storage access info
blob_account_name = "azureopendatastorage"
blob_container_name = "nyctlc"
blob_relative_path = "fhv"
blob_sas_token = r""
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token)
print('Remote blob path: ' + wasbs_path)
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))
Azure Synapse
# This is a package in preview.
from azureml.opendatasets import NycTlcFhv
from datetime import datetime
from dateutil import parser
end_date = parser.parse('2018-06-06')
start_date = parser.parse('2018-05-01')
nyc_tlc = NycTlcFhv(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_spark_dataframe()
# Display top 5 rows
display(nyc_tlc_df.limit(5))
# Azure storage access info
blob_account_name = "azureopendatastorage"
blob_container_name = "nyctlc"
blob_relative_path = "fhv"
blob_sas_token = r""
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token)
print('Remote blob path: ' + wasbs_path)
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))