This repository provides the preprocessed datasets, which are used in the study
Temperature forecasting by deep learning methods by Gong et al. (2022). This allows the user to reproduce the presented results without running the preprocessing chain from the raw ERA5 data.
Data description
The datasets used to train, validate, and test the deep neural networks are based on the ERA5 reanalysis data provided by the European Centre for Medium-range Weather Forecast (ECMWF). Five different datasets have been created. All incorporate data between the years 2007 and 2019, but cover slightly varying domains over Central Europe and include different meteorological variables.
The datasets are made available in compressed tar-archives (see
Storage Location URL below). The file names thereby encapsulate some meta-information using the following naming convention:
ERA5-Y[yyyy]-[yyyy]M[mm]to[mm]-[nx]x[ny]-[nn.nn]N[ee.ee]E-[var1]_[var2]_[var3]
where
-
Y[yyyy]-[yyyy]M[mm]to[mm]
denotes the years and the months describing the data period,
-
[nx]x[ny]
is the number of grid points/pixels of the target domain in longitude and latitude direction,
-
[nn.nn]N[ee.ee]E
stands for the geographical coordinates in degree of the target domain's south-west corner and
-
[var1]_[var2]_[var3]
denote the short names of the variables according to
ECMWF's parameter database
In particular, the following datasets are provided:
1)
era5-Y2007-2019M01to12-92x56-3840N0000E-2t_tcc_t850.tar.bz2
: The target domain extends from 38.4°N to 54.9°N and 0.0°E to 27.3°E (92x56 grid points). The 2m-temperature (2t), the total cloud cover (tcc), and the 850 hPa temperature (t_850) are included as variables. This data corresponds to Datasets ID 1-3 in table A1 of the manuscript.
2)
era5-Y2007-2019M01to12-80x48-3960N0180E-2t_tcc_t850.tar.bz2
: The target domain extends from 39.6°N to 53.7°N and 1.8°E to 25.5°E (80x48 grid points). The 2t, tcc, and the t_850 are included as variables. This data corresponds to Dataset ID 4 in table A1 of the manuscript.
3)
era5-Y2007-2019M01to12-72x44-4020N0300E-2t_tcc_t_850.tar.bz2
: The target domain extends from 40.2°N to 53.1°N and 3.0°E to 24.3°E (72x44 grid points). The 2t, tcc, and t_850 are included as variables. This data corresponds to Dataset ID 5 in table A1 of the manuscript.
4)
era5-Y2007-2019M01to12-80x48-3960N0180E-2t_t850.tar.bz2
: The target domain extends from 39.6°N to 53.7°N and 1.8°E to 25.5°E (80x48 grid points). The 2t and the t_850 are the only variables included. This data set is actually a subset of No. 2. This data corresponds to Dataset ID 6 in table A1 of the manuscript.
5)
era5-Y2007-2019M01to12-80x48-3960N0180E-2t.tar.bz2
: The target domain extends from 39.6°N to 53.7°N and 1.8°E to 25.5°E (80x48 grid points). 2t is exclusively included. This data set is also a subset of No. 2. This data corresponds to Dataset ID 7 in table A1 of the manuscript.
Data creation
The original ERA5 data can be retrieved from the (
MARS archive). Once access is granted, data can be downloaded by specifying a resolution of 0.3° in the retrieval script.
The datasets provided in this repository are the processed ERA5 data after the extraction and the two preprocessing steps using the Atmospheric Machine learning Benchmarking System (AMBS) workflow tool (more details are provided in the README of the corresponding
code repository). The data is available in
TFRecords format that is used directly in the training step.
Data access and decompression
Data are stored in the archived and compressed format
tar.bz2
and available via:
https://datapub.fz-juelich.de/esde/esde-nfs/online_publication/2mT_by_DL/. After downloading, the compressed archives can be unpacked on Linux using
tar xjf [filename].tar.bz2
. On Windows, decompressing can be performed using
WinZip.
Dataset content
After decompressing, the following subdirectory structure is created from each compressed tar-archive:
-
tfrecords_seq_len_[sequence_length]
: This folder holds the TFRecords files that are streamed to the deep neural networks during training and postprocessing. Each TFRecord file contains 10 samples, where each sample comprises a sequence over
[sequence_length]
hours.
-
pickle
: This folder contains the normalized hourly data saved in monthly pickle files (
X_[month].pkl
). The corresponding timestamps are included in
T_[month].pkl
. Furthermore, statistical information for each month is provided in the files
stat_[month].json
.
-
metadata.json
: This file provides important meta information including the coordinates of the target domain, the included variables (e.g. 2t and t_850) and the origin of the processed data.
-
statsitic.json
: This file includes the statistical information (maximum, minimum, and average values) used for normalizing the data. It also includes other information such as the total number of the timestamps (
nfiles
) and the list of JSON files (
stat_[month].json
) to compute the statistics.
Data integrity and verification
The tar-archives have been recursively checksummed with the
md5
hash function. The generated file is uploaded to ensure the integrity of the files and no alteration to the dataset. To verify the integrity of the downloaded data, use the following snippet:
find -type f -exec md5sum '{}' \; > md5sum.txt
It will generate a single text file that should be identical to the file in this entry.
License
Original data by ECMWF Copyright "
© 2022 European Centre for Medium-Range Weather Forecasts (ECMWF)". Source
www.ecmwf.int. This data is published under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
https://creativecommons.org/licenses/by/4.0/
Contact
Bing Gong (
b.gong@fz-juelich.de)