跳过导航

NOAA Global Forecast System (GFS)

Weather GFS NOAA

美国国家海洋和大气管理局 (NOAA) 的全球预报系统 (GFS) 所生成的美国天气预报数据(例如:温度、降水、风力),周期为 15 天,按小时记录。

全球预报系统 (GFS) 是由美国国家环境预报中心 (NCEP) 生成的一种天气预报模型。 此数据集可提供几十种大气和土壤变量,从温度、风力和降水到土壤湿度和大气臭氧浓度。 GFS 以格点间 18 英里(28 公里)的基本水平分辨率覆盖全球,天气预报员可借助它来预测未来 16 天的天气。 水平分辨率降至格点间 44 英里(70 公里),可以预测一到两星期的天气。 此数据集专门取自 GFS4

数量和保留期

此数据集以 Parquet 格式存储。 每日更新,包含 15 天的前瞻性预报数据。 截至 2019 年,总共约有 90 亿行 (200 GB)。

此数据集包含从 2018 年 12 月至今累积的历史记录。 可使用我们的 SDK 中的参数设置来提取特定时间范围内的数据。

存储位置

此数据集存储在美国东部 Azure 区域。 建议将计算资源分配到美国东部地区,以实现相关性。

其他信息

此数据集的来源是 NOAA 全球预报系统。 有关此数据集的其他信息,请参阅此处此处。 如果对数据源有任何疑问,请发送电子邮件至

通知

Microsoft 以“原样”为基础提供 AZURE 开放数据集。 Microsoft 对数据集的使用不提供任何担保(明示或暗示)、保证或条件。 在当地法律允许的范围内,Microsoft 对使用数据集而导致的任何损害或损失不承担任何责任,包括直接、必然、特殊、间接、偶发或惩罚。

此数据集是根据 Microsoft 接收源数据的原始条款提供的。 数据集可能包含来自 Microsoft 的数据。

Access

Available inWhen to use
Azure Notebooks

Quickly explore the dataset with Jupyter notebooks hosted on Azure or your local machine.

Azure Databricks

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Azure Synapse

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Preview

currentDatetime forecastHour latitude longitude precipitableWaterEntireAtmosphere seaLvlPressure temperature windSpeedGustSurface totalCloudCoverConvectiveCloud year month day
11/18/2020 6:00:00 AM 96 -45 90 26.2195 98565.3 281.607 13.7958 0 2020 11 18
11/18/2020 6:00:00 AM 96 -45 94.5 28.9195 99093.3 282.107 15.7958 0 2020 11 18
11/18/2020 6:00:00 AM 96 -45 90.5 26.7195 98619.7 281.507 11.7958 0 2020 11 18
11/18/2020 6:00:00 AM 96 -45 91 27.1195 98629.3 281.607 14.0958 0 2020 11 18
11/18/2020 6:00:00 AM 96 -45 91.5 27.5195 98685.3 281.306 16.8958 19 2020 11 18
11/18/2020 6:00:00 AM 96 -45 92 28.1195 98750.9 281.206 14.7958 0 2020 11 18
11/18/2020 6:00:00 AM 96 -45 92.5 28.5195 98821.3 281.107 13.6958 0 2020 11 18
11/18/2020 6:00:00 AM 96 -45 93 28.7195 98874.1 281.206 13.0958 0 2020 11 18
11/18/2020 6:00:00 AM 96 -45 93.5 28.8195 98957.3 281.507 14.3958 0 2020 11 18
11/18/2020 6:00:00 AM 96 -45 94 28.5195 99027.7 281.806 15.4958 0 2020 11 18
Name Data type Unique Values (sample) Description
currentDatetime timestamp 2,520 2018-12-08 18:00:00
2018-12-09 18:00:00

预测模型周期运行时。

day int 31 1
5

currentDatetime 的日期。

forecastHour int 129 336
102

自 currentDatetime 以来的小时,预测或观测时间。

latitude double 361 42.0
66.0

纬度,degrees_north。

longitude double 1,079 81.0
87.0

经度,degrees_east。

month int 12 12
9

currentDatetime 的月份。

precipitableWaterEntireAtmosphere double 5,168,395 0.5
0.2

整个大气层中的可降水量。 单位:kg.m-2

seaLvlPressure double 8,564,867 101104.0
101096.0

地面或水面的压力。 单位:帕

snowDepthSurface double 1,209 nan
1.0

地面或水面的积雪深度。 单位:米

temperature double 5,840,572 273.0
273.1

地面或水面的温度。 单位:K

totalCloudCoverConvectiveCloud double 82 1.0
2.0

对流云层的总体云量。 单位:%

windSpeedGustSurface double 19,064,943 4.5
5.0

地面或水面的风速(强风)。 单位:米/秒

year int 4 2019
2020

currentDatetime 的年份。

Select your preferred service:

Azure Notebooks

Azure Databricks

Azure Synapse

Azure Notebooks

Package: Language: Python Python
In [1]:
# This is a package in preview.
from azureml.opendatasets import NoaaGfsWeather

from dateutil import parser


start_date = parser.parse('2018-12-20')
end_date = parser.parse('2018-12-21')
gfs = NoaaGfsWeather(start_date, end_date)
gfs_df = gfs.to_pandas_dataframe()
ActivityStarted, to_pandas_dataframe Due to size, we only allow getting 1-day data into pandas dataframe! We are taking the latest day: /year=2018/month=12/day=21/ Target paths: ['/year=2018/month=12/day=21/'] Looking for parquet files... Reading them into Pandas dataframe... Reading GFSWeather/GFSProcessed/year=2018/month=12/day=21/part-00000-tid-570650763889113128-ff3109d0-23cf-4024-a096-63964952b0c7-4397-c000.snappy.parquet under container gfsweatherdatacontainer Reading GFSWeather/GFSProcessed/year=2018/month=12/day=21/part-00001-tid-570650763889113128-ff3109d0-23cf-4024-a096-63964952b0c7-4398-c000.snappy.parquet under container gfsweatherdatacontainer ... Reading GFSWeather/GFSProcessed/year=2018/month=12/day=21/part-00199-tid-570650763889113128-ff3109d0-23cf-4024-a096-63964952b0c7-4596-c000.snappy.parquet under container gfsweatherdatacontainer Done. ActivityCompleted: Activity=to_pandas_dataframe, HowEnded=Success, Duration=91914.45 [ms]
In [2]:
gfs_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 24172560 entries, 0 to 120634 Data columns (total 10 columns): currentDatetime datetime64[ns] forecastHour int32 latitude float64 longitude float64 precipitableWaterEntireAtmosphere float64 seaLvlPressure float64 snowDepthSurface float64 temperature float64 windSpeedGustSurface float64 totalCloudCoverConvectiveCloud float64 dtypes: datetime64[ns](1), float64(8), int32(1) memory usage: 1.9 GB
In [1]:
# Pip install packages
import os, sys

!{sys.executable} -m pip install azure-storage-blob
!{sys.executable} -m pip install pyarrow
!{sys.executable} -m pip install pandas
In [2]:
# Azure storage access info
azure_storage_account_name = "azureopendatastorage"
azure_storage_sas_token = r""
container_name = "gfsweatherdatacontainer"
folder_name = "GFSWeather/GFSProcessed"
In [3]:
from azure.storage.blob import BlockBlobServicefrom azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

if azure_storage_account_name is None or azure_storage_sas_token is None:
    raise Exception(
        "Provide your specific name and key for your Azure Storage account--see the Prerequisites section earlier.")

print('Looking for the first parquet under the folder ' +
      folder_name + ' in container "' + container_name + '"...')
container_url = f"https://{azure_storage_account_name}.blob.core.windows.net/"
blob_service_client = BlobServiceClient(
    container_url, azure_storage_sas_token if azure_storage_sas_token else None)

container_client = blob_service_client.get_container_client(container_name)
blobs = container_client.list_blobs(folder_name)
sorted_blobs = sorted(list(blobs), key=lambda e: e.name, reverse=True)
targetBlobName = ''
for blob in sorted_blobs:
    if blob.name.startswith(folder_name) and blob.name.endswith('.parquet'):
        targetBlobName = blob.name
        break

print('Target blob to download: ' + targetBlobName)
_, filename = os.path.split(targetBlobName)
blob_client = container_client.get_blob_client(targetBlobName)
with open(filename, 'wb') as local_file:
    blob_client.download_blob().download_to_stream(local_file)
In [4]:
# Read the parquet file into Pandas data frame
import pandas as pd

print('Reading the parquet file into Pandas data frame')
df = pd.read_parquet(filename)
In [5]:
# you can add your filter at below
print('Loaded as a Pandas data frame: ')
df
In [6]:
 

Azure Databricks

Package: Language: Python Python
In [1]:
# This is a package in preview.
# You need to pip install azureml-opendatasets in Databricks cluster. https://docs.microsoft.com/en-us/azure/data-explorer/connect-from-databricks#install-the-python-library-on-your-azure-databricks-cluster
from azureml.opendatasets import NoaaGfsWeather

from dateutil import parser


start_date = parser.parse('2018-12-20')
end_date = parser.parse('2018-12-21')
gfs = NoaaGfsWeather(start_date, end_date)
gfs_df = gfs.to_spark_dataframe()
ActivityStarted, to_spark_dataframe ActivityCompleted: Activity=to_spark_dataframe, HowEnded=Success, Duration=92636.3 [ms]
In [2]:
display(gfs_df.limit(5))
currentDatetimeforecastHourlatitudelongitudeprecipitableWaterEntireAtmosphereseaLvlPressuresnowDepthSurfacetemperaturewindSpeedGustSurfacetotalCloudCoverConvectiveCloudyearmonthday
2018-12-20T00:00:00.000+00000-90.079.03.54831433296203671160.97656251.0099999904632568260.606781005859412.820813179016113null20181220
2018-12-20T00:00:00.000+00000-90.0268.03.54831433296203671160.97656251.0099999904632568260.606781005859412.820813179016113null20181220
2018-12-20T00:00:00.000+00000-89.536.53.448314189910888770757.77343751.0099999904632568258.606781005859412.620813369750977null20181220
2018-12-20T00:00:00.000+00000-89.543.03.348314285278320370597.77343751.0099999904632568258.306793212890612.720812797546387null20181220
2018-12-20T00:00:00.000+00000-89.5144.03.24831438064575269701.77343751.0099999904632568259.5067749023437512.620813369750977null20181220
In [1]:
# Azure storage access info
blob_account_name = "azureopendatastorage"
blob_container_name = "gfsweatherdatacontainer"
blob_relative_path = "GFSWeather/GFSProcessed"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))

Azure Synapse

Package: Language: Python Python
In [23]:
# This is a package in preview.
from azureml.opendatasets import NoaaGfsWeather

from dateutil import parser


start_date = parser.parse('2018-12-20')
end_date = parser.parse('2018-12-21')
gfs = NoaaGfsWeather(start_date, end_date)
gfs_df = gfs.to_spark_dataframe()
In [24]:
# Display top 5 rows
display(gfs_df.limit(5))
Out[24]:
In [1]:
# Azure storage access info
blob_account_name = "azureopendatastorage"
blob_container_name = "gfsweatherdatacontainer"
blob_relative_path = "GFSWeather/GFSProcessed"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))