略過導覽

NOAA Global Forecast System (GFS)

Weather GFS NOAA

來自美國國家海洋暨大氣總署 (NOAA) 全球天氣預報系統 (GFS) 的 15 天美國每小時天氣預報資料 (例如:氣溫、降雨、風)。

全球天氣預報系統 (GFS) 是由美國國家環境預測中心 (NCEP) 所製作的天氣預報模型。 該資料集提供數十種大氣和土地土壤變數 (從溫度、風和降雨到土壤濕度和大氣臭氧濃度)。 GFS 涵蓋整個地球,格線點之間的基本水平解析度為 18 英哩 (28 公里),供預測未來 16 天氣象的操作預報員使用。 在一週到兩週之間的天氣預報中,水平解析度會降至格線點之間的 44 英哩 (70 公里)。 此資料集特別取自 GFS4

磁碟區與保留期

此資料集以 Parquet 格式儲存, 每天皆以 15 天的前瞻性預報資料更新它。 到 2019 年為止,總共約有 90 億個資料列 (200GB)。

此資料集包含從 2018 年 12 月累積至今的歷史記錄。 在我們的 SDK 中,您可以使用參數設定來擷取特定時間範圍內的資料。

儲存位置

此資料集儲存於美國東部 Azure 區域。 建議您在美國東部配置計算資源,以確保同質性。

其他資訊

此資料集源自 NOAA 全球天氣預報系統。 如需此資料集的其他資訊,請參閱這裡這裡。 如果您對資料來源有任何疑問,請傳送電子郵件至

通知

Microsoft 係依「現況」提供 Azure 開放資料集。 針對 貴用戶對資料集的使用,Microsoft 不提供任何明示或默示的擔保、保證或條件。 在 貴用戶當地法律允許的範圍內,針對因使用資料集而導致的任何直接性、衍生性、特殊性、間接性、附隨性或懲罰性損害或損失,Microsoft 概不承擔任何責任。

此資料集是根據 Microsoft 接收來源資料的原始條款所提供。 資料集可能包含源自 Microsoft 的資料。

Access

Available inWhen to use
Azure Notebooks

Quickly explore the dataset with Jupyter notebooks hosted on Azure or your local machine.

Azure Databricks

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Azure Synapse

Use this when you need the scale of an Azure managed Spark cluster to process the dataset.

Preview

currentDatetime forecastHour latitude longitude precipitableWaterEntireAtmosphere seaLvlPressure snowDepthSurface temperature windSpeedGustSurface year month day
2/24/2021 12:00:00 PM 0 -90 0 0.604883 68805.5 1 233.113 6.31342 2021 2 24
2/24/2021 12:00:00 PM 0 -90 4.5 0.604883 68805.5 1 233.113 6.31342 2021 2 24
2/24/2021 12:00:00 PM 0 -90 0.5 0.604883 68805.5 1 233.113 6.31342 2021 2 24
2/24/2021 12:00:00 PM 0 -90 1 0.604883 68805.5 1 233.113 6.31342 2021 2 24
2/24/2021 12:00:00 PM 0 -90 1.5 0.604883 68805.5 1 233.113 6.31342 2021 2 24
2/24/2021 12:00:00 PM 0 -90 2 0.604883 68805.5 1 233.113 6.31342 2021 2 24
2/24/2021 12:00:00 PM 0 -90 2.5 0.604883 68805.5 1 233.113 6.31342 2021 2 24
2/24/2021 12:00:00 PM 0 -90 3 0.604883 68805.5 1 233.113 6.31342 2021 2 24
2/24/2021 12:00:00 PM 0 -90 3.5 0.604883 68805.5 1 233.113 6.31342 2021 2 24
2/24/2021 12:00:00 PM 0 -90 4 0.604883 68805.5 1 233.113 6.31342 2021 2 24
Name Data type Unique Values (sample) Description
currentDatetime timestamp 2,863 2018-12-08 12:00:00
2018-12-08 18:00:00

預報模型週期執行階段。

day int 31 1
5

currentDateTime 的日期。

forecastHour int 129 336
102

從 currentDatetime、預測或觀察時間後經歷的小時數。

latitude double 361 50.0
13.0

緯度,degrees_north。

longitude double 1,079 124.0
116.0

經度,degrees_east。

month int 12 12
1

currentDateTime 的月份。

precipitableWaterEntireAtmosphere double 5,520,555 0.5
1.0

整個大氣層中的可降水量。 單位:kg.m-2

seaLvlPressure double 8,568,607 101088.0
101080.0

地表或水面的壓力。 單位:帕

snowDepthSurface double 1,245 nan
1.0

地表或水面的雪深。 單位:公尺

temperature double 5,840,727 273.0
273.1

地表或水面的溫度。 單位:K

totalCloudCoverConvectiveCloud double 82 1.0
2.0

對流雲層的總雲量。 單位:%

windSpeedGustSurface double 19,188,997 4.5
5.0

地表或水面的風速 (陣風)。 單位:公尺/秒

year int 5 2019
2020

currentDateTime 的年份。

Select your preferred service:

Azure Notebooks

Azure Databricks

Azure Synapse

Azure Notebooks

Package: Language: Python Python
In [1]:
# This is a package in preview.
from azureml.opendatasets import NoaaGfsWeather

from dateutil import parser


start_date = parser.parse('2018-12-20')
end_date = parser.parse('2018-12-21')
gfs = NoaaGfsWeather(start_date, end_date)
gfs_df = gfs.to_pandas_dataframe()
ActivityStarted, to_pandas_dataframe Due to size, we only allow getting 1-day data into pandas dataframe! We are taking the latest day: /year=2018/month=12/day=21/ Target paths: ['/year=2018/month=12/day=21/'] Looking for parquet files... Reading them into Pandas dataframe... Reading GFSWeather/GFSProcessed/year=2018/month=12/day=21/part-00000-tid-570650763889113128-ff3109d0-23cf-4024-a096-63964952b0c7-4397-c000.snappy.parquet under container gfsweatherdatacontainer Reading GFSWeather/GFSProcessed/year=2018/month=12/day=21/part-00001-tid-570650763889113128-ff3109d0-23cf-4024-a096-63964952b0c7-4398-c000.snappy.parquet under container gfsweatherdatacontainer ... Reading GFSWeather/GFSProcessed/year=2018/month=12/day=21/part-00199-tid-570650763889113128-ff3109d0-23cf-4024-a096-63964952b0c7-4596-c000.snappy.parquet under container gfsweatherdatacontainer Done. ActivityCompleted: Activity=to_pandas_dataframe, HowEnded=Success, Duration=91914.45 [ms]
In [2]:
gfs_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 24172560 entries, 0 to 120634 Data columns (total 10 columns): currentDatetime datetime64[ns] forecastHour int32 latitude float64 longitude float64 precipitableWaterEntireAtmosphere float64 seaLvlPressure float64 snowDepthSurface float64 temperature float64 windSpeedGustSurface float64 totalCloudCoverConvectiveCloud float64 dtypes: datetime64[ns](1), float64(8), int32(1) memory usage: 1.9 GB
In [1]:
# Pip install packages
import os, sys

!{sys.executable} -m pip install azure-storage-blob
!{sys.executable} -m pip install pyarrow
!{sys.executable} -m pip install pandas
In [2]:
# Azure storage access info
azure_storage_account_name = "azureopendatastorage"
azure_storage_sas_token = r""
container_name = "gfsweatherdatacontainer"
folder_name = "GFSWeather/GFSProcessed"
In [3]:
from azure.storage.blob import BlockBlobServicefrom azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

if azure_storage_account_name is None or azure_storage_sas_token is None:
    raise Exception(
        "Provide your specific name and key for your Azure Storage account--see the Prerequisites section earlier.")

print('Looking for the first parquet under the folder ' +
      folder_name + ' in container "' + container_name + '"...')
container_url = f"https://{azure_storage_account_name}.blob.core.windows.net/"
blob_service_client = BlobServiceClient(
    container_url, azure_storage_sas_token if azure_storage_sas_token else None)

container_client = blob_service_client.get_container_client(container_name)
blobs = container_client.list_blobs(folder_name)
sorted_blobs = sorted(list(blobs), key=lambda e: e.name, reverse=True)
targetBlobName = ''
for blob in sorted_blobs:
    if blob.name.startswith(folder_name) and blob.name.endswith('.parquet'):
        targetBlobName = blob.name
        break

print('Target blob to download: ' + targetBlobName)
_, filename = os.path.split(targetBlobName)
blob_client = container_client.get_blob_client(targetBlobName)
with open(filename, 'wb') as local_file:
    blob_client.download_blob().download_to_stream(local_file)
In [4]:
# Read the parquet file into Pandas data frame
import pandas as pd

print('Reading the parquet file into Pandas data frame')
df = pd.read_parquet(filename)
In [5]:
# you can add your filter at below
print('Loaded as a Pandas data frame: ')
df
In [6]:
 

Azure Databricks

Package: Language: Python Python
In [1]:
# This is a package in preview.
# You need to pip install azureml-opendatasets in Databricks cluster. https://docs.microsoft.com/en-us/azure/data-explorer/connect-from-databricks#install-the-python-library-on-your-azure-databricks-cluster
from azureml.opendatasets import NoaaGfsWeather

from dateutil import parser


start_date = parser.parse('2018-12-20')
end_date = parser.parse('2018-12-21')
gfs = NoaaGfsWeather(start_date, end_date)
gfs_df = gfs.to_spark_dataframe()
ActivityStarted, to_spark_dataframe ActivityCompleted: Activity=to_spark_dataframe, HowEnded=Success, Duration=92636.3 [ms]
In [2]:
display(gfs_df.limit(5))
currentDatetimeforecastHourlatitudelongitudeprecipitableWaterEntireAtmosphereseaLvlPressuresnowDepthSurfacetemperaturewindSpeedGustSurfacetotalCloudCoverConvectiveCloudyearmonthday
2018-12-20T00:00:00.000+00000-90.079.03.54831433296203671160.97656251.0099999904632568260.606781005859412.820813179016113null20181220
2018-12-20T00:00:00.000+00000-90.0268.03.54831433296203671160.97656251.0099999904632568260.606781005859412.820813179016113null20181220
2018-12-20T00:00:00.000+00000-89.536.53.448314189910888770757.77343751.0099999904632568258.606781005859412.620813369750977null20181220
2018-12-20T00:00:00.000+00000-89.543.03.348314285278320370597.77343751.0099999904632568258.306793212890612.720812797546387null20181220
2018-12-20T00:00:00.000+00000-89.5144.03.24831438064575269701.77343751.0099999904632568259.5067749023437512.620813369750977null20181220
In [1]:
# Azure storage access info
blob_account_name = "azureopendatastorage"
blob_container_name = "gfsweatherdatacontainer"
blob_relative_path = "GFSWeather/GFSProcessed"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))

Azure Synapse

Package: Language: Python Python
In [23]:
# This is a package in preview.
from azureml.opendatasets import NoaaGfsWeather

from dateutil import parser


start_date = parser.parse('2018-12-20')
end_date = parser.parse('2018-12-21')
gfs = NoaaGfsWeather(start_date, end_date)
gfs_df = gfs.to_spark_dataframe()
In [24]:
# Display top 5 rows
display(gfs_df.limit(5))
Out[24]:
In [1]:
# Azure storage access info
blob_account_name = "azureopendatastorage"
blob_container_name = "gfsweatherdatacontainer"
blob_relative_path = "GFSWeather/GFSProcessed"
blob_sas_token = r""
In [2]:
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)
In [3]:
# SPARK read parquet, note that it won't load any data yet by now
df = spark.read.parquet(wasbs_path)
print('Register the DataFrame as a SQL temporary view: source')
df.createOrReplaceTempView('source')
In [4]:
# Display top 10 rows
print('Displaying top 10 rows: ')
display(spark.sql('SELECT * FROM source LIMIT 10'))