Ensemble numerical summary measures with GEFS

As Atmospheric Data Scientists, we want to perform some statistical methods on the GEFS members with the intent that we can observe how the behavior and statistical features of the atmosphere change by various initial conditions and disturbances. This blog is going to examine the Numerical Summary Measures of the GEFS data.

# Importing required libraries
import xarray as xr # dealing with atmospheric data
import numpy as np # dealing with arrays
from visjobs.datas import get_MODEL # getting model data
import rioxarray # dealing with raster files
import geopandas as gpd # dealing with shapefiles
from shapely.geometry import mapping # dealing with shapefiles 
from scipy import stats # statistics

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

Get GEFS data

Visjobs is an exceptional library for fetching the latest atmospheric model datasets remotely. We created this hilarious library for those atmospheric scientists struggling or who do not want to download the atmospheric model outputs. With a minimum number of parameters, Visjobs will bring you the data you desire.

We are going to define only the hour parameter, the rest is going to be handled by latest=True parameter in visjobs :)

# Define date parameters

hour = '00' 

# so easily getting the latest GEFS data
data  = get_MODEL.pick_data(hour   = '00',
                            latest = True,
                            model  = 'GEFS',
                            chunks = {'time': -1,
                                      'lon' : 80,
                                      'lat' : 80,
                                      'lev' : -1 })
# look at inside of the data
data
Addressing Data:  http://nomads.ncep.noaa.gov:80/dods/gefs/gefs20211018/gefs_pgrb2bp5_all_00z
Connected GEFS Data via OpenDAP
<xarray.Dataset>
Dimensions:         (ens: 31, time: 65, lev: 31, lat: 361, lon: 720)
Coordinates:
  * ens             (ens) float64 1.0 2.0 3.0 4.0 5.0 ... 28.0 29.0 30.0 31.0
  * time            (time) datetime64[ns] 2021-10-18 ... 2021-11-03
  * lev             (lev) float64 1e+03 975.0 950.0 925.0 ... 5.0 3.0 2.0 1.0
  * lat             (lat) float64 -90.0 -89.5 -89.0 -88.5 ... 89.0 89.5 90.0
  * lon             (lon) float64 0.0 0.5 1.0 1.5 ... 358.0 358.5 359.0 359.5
Data variables: (12/279)
    absvprs         (ens, time, lev, lat, lon) float32 dask.array<chunksize=(31, 65, 31, 80, 80), meta=np.ndarray>
    no4lftxsfc      (ens, time, lat, lon) float32 dask.array<chunksize=(31, 65, 80, 80), meta=np.ndarray>
    no5wavh500mb    (ens, time, lat, lon) float32 dask.array<chunksize=(31, 65, 80, 80), meta=np.ndarray>
    acpcpsfc        (ens, time, lat, lon) float32 dask.array<chunksize=(31, 65, 80, 80), meta=np.ndarray>
    albdosfc        (ens, time, lat, lon) float32 dask.array<chunksize=(31, 65, 80, 80), meta=np.ndarray>
    aptmp2m         (ens, time, lat, lon) float32 dask.array<chunksize=(31, 65, 80, 80), meta=np.ndarray>
    ...              ...
    vwshneg1p5pv    (ens, time, lat, lon) float32 dask.array<chunksize=(31, 65, 80, 80), meta=np.ndarray>
    vwsh2pv         (ens, time, lat, lon) float32 dask.array<chunksize=(31, 65, 80, 80), meta=np.ndarray>
    vwshneg2pv      (ens, time, lat, lon) float32 dask.array<chunksize=(31, 65, 80, 80), meta=np.ndarray>
    vwshtrop        (ens, time, lat, lon) float32 dask.array<chunksize=(31, 65, 80, 80), meta=np.ndarray>
    watrsfc         (ens, time, lat, lon) float32 dask.array<chunksize=(31, 65, 80, 80), meta=np.ndarray>
    wiltsfc         (ens, time, lat, lon) float32 dask.array<chunksize=(31, 65, 80, 80), meta=np.ndarray>
Attributes:
    title:        ALL GFS Ensemble .5 Degree (Additional Parms) starting from...
    Conventions:  COARDS\nGrADS
    dataType:     Grid
    history:      Mon Oct 18 18:35:45 UTC 2021 : imported by GrADS Data Serve...

We have just used Xarray’s Dask feature to fetch the remote data lazily. Dask will bring you lots of ease while dealing with big-sized datasets.

Quick look at inside the GEFS data

# Some of the variables
for i, (key, value) in enumerate(data.data_vars.items()):
    print(key)
    
    if i == 5: break
    
print('\nvariable length is: ', len(data.data_vars))
absvprs
no4lftxsfc
no5wavh500mb
acpcpsfc
albdosfc
aptmp2m

variable length is:  279
# Dimensions of the dataset
data.dims
Frozen({'ens': 31, 'time': 65, 'lev': 31, 'lat': 361, 'lon': 720})

Here, the ens dimension corresponds to the 31 ensemble members of the GEFS data. That means we have 31 unique disturbed model runs in total.

Shrink the data

# Let's choose the variables we are going to study with
variables = ['tmpsfc', 'gustsfc']
time_range = 30

# shrink lat lon to nearest Turkey values
lat_range = [34, 44]
lon_range = [24, 47]

data_shrinked = data[variables].isel(time = slice(0,time_range))\
                                .sel(lat  = slice(lat_range[0], lat_range[1]),
                                     lon  = slice(lon_range[0], lon_range[1]))
# quick look
data_shrinked['tmpsfc'][0, 0, :, :].plot()
<matplotlib.collections.QuadMesh at 0x1b548772c10>
../_images/numerical-summary-measures-with-gefs_14_1.png

Clip data into the shape of Turkey

The data itself is big-sized that we can not, remotely, deal with the whole part of it. Thus, we are going to clip the data into the shape of Turkey.

# check for the dataset reference system
data_shrinked.rio.crs

Notice that nothing appeared that means the data does not belong to any geographic reference system yet. We are going to assign WGS-84 as the CRS of the GEFS data. For more info on WGS-84, visit: https://en.wikipedia.org/wiki/World_Geodetic_System#WGS84

# assign WGS-84 (epsg:4326)
data_shrinked = data_shrinked.rio.write_crs('epsg:4326')
# now check the reference system
data_shrinked.rio.crs
CRS.from_epsg(4326)
# Open the shapefile data on which the data is going to be clipped.
turkey_shape = r'C:/Users/USER/JupyterLab/AFAD_PROJECT/Codes/Threshold_Analysis/datasets/Shapefiles/tr_iller_ilceler_adjusted/tr_iller_ilceler2.shp'
shpdata = gpd.read_file(turkey_shape)
# tell rioxarray that our x and y dimensions are those:
data_shrinked = data_shrinked.rio.set_spatial_dims(x_dim='lon', y_dim='lat')
# clip the data into Turkey shapefile
clipped_data = data_shrinked.rio.clip(shpdata.geometry.apply(mapping), shpdata.crs, drop=True)

Numerical summary measures

A bunch of the numerical summary measure methods (location, spread, skewness, etc.) is defined in various statistical textbooks (Statistical Methods in Atmospheric Sciences)
Here, we are going to calculate and compare some of them:

  • For location measures: mean, 50th percentile (0.5 quantiles), and trimean.

  • For spread measures: IQR, standard deviation, median absolute deviation.

  • For Skewness measures: Yule-Kendall skewness

Define

# Class for calculating numerical summary measures
class numericalSummaryMeasures:
    """
    Calculates location, spread, symmetry and kurtosis over ensemble members.
    """
    
    def __init__(self, clippedData):
        self.data = clippedData
        # quantiles
        self.q025 = self.data.quantile(0.25, dim='ens')
        self.q05 = self.data.quantile(0.5, dim='ens')
        self.q075 = self.data.quantile(0.75, dim='ens')
        
    def location(self, ):
        """
        Calculates mean of the given dataset with location methods.
        """
        # mean
        mean = self.data.mean(dim='ens')
        # trimean
        trimean = (self.q025 + 2*self.q05 + self.q075) / 4
        
        # attributes
        mean.attrs['NSM']     = 'Mean'
        self.q05.attrs['NSM'] = 'Median'
        trimean.attrs['NSM']  = 'Trimean'
        
        return mean, self.q05, trimean.drop('quantile')
    
    def spread(self, ):
        """
        Calculates spread of the given dataset with spread methods.
        """
        # IQR
        iqr = self.q075 - self.q025
        
        # Std Deviation
        std = self.data.std(dim='ens')
        
        # Median Absolute Deviation
        xi_minus_q05_abs = np.abs(self.data - self.q05)
        mad = xi_minus_q05_abs.quantile(0.5, dim='ens')
        
        # attributes
        iqr.attrs['NSM'] = 'Interquantile Range (IQR)'
        std.attrs['NSM'] = 'Standard Deviation'
        mad.attrs['NSM'] = 'Median Absolute Deviation'

        return iqr, std, mad.drop('quantile')
    
    def symmetry_curtosis(self, ):
        """
        Calculates symmetry-curtosis of the given dataset.
        """
        # yule-kendal skewness coefficient
        YK = (self.q025 - 2*self.q05 + self.q075) / (self.q075 - self.q025)
        
        # attributes
        YK.attrs['NSM'] = 'Yule-Kendall Skewness'
        
        return YK.drop('quantile')

Calculate

Calculate NSMs using defined class

# compute NSM

# initiate instance
nsm      = numericalSummaryMeasures(clipped_data)

mean, q05, trimean = nsm.location()
iqr, std, mad      = nsm.spread()
yk                 = nsm.symmetry_curtosis()

NSM Plots

Without plotting the results we will not be able to see how ensemble members differ from each other. Let’s see their spatial features.

# variable to be plotted
variable = 'tmpsfc'

Location

Mean

# do not plot whole time range but 1 in 5.
mean[variable][::5, :, :].plot(x="lon", y="lat", col="time",
                               col_wrap=3, cmap = 'rainbow',
                               robust=True,
                               cbar_kwargs={
                                    "label": "Temperature (K)",
                                     })
<xarray.plot.facetgrid.FacetGrid at 0x1b5471ba340>
../_images/numerical-summary-measures-with-gefs_36_1.png

Median

# do not plot whole time range but 1 in 5.
q05[variable][::5, :, :].plot(x="lon", y="lat", col="time",
                              col_wrap=3, cmap = 'rainbow',
                              robust=True,
                              cbar_kwargs={
                                   "label": "Temperature (K)",
                                    })
<xarray.plot.facetgrid.FacetGrid at 0x1b5498ade20>
../_images/numerical-summary-measures-with-gefs_38_1.png

Trimean

# do not plot whole time range but 1 in 5.
trimean[variable][::5, :, :].plot(x="lon", y="lat", col="time",
                              col_wrap=3, cmap = 'rainbow',
                              robust=True,
                              cbar_kwargs={
                                   "label": "Temperature (K)",
                                    })
<xarray.plot.facetgrid.FacetGrid at 0x1b54e221e20>
../_images/numerical-summary-measures-with-gefs_40_1.png

Realize the daily surface temperature cycle! :)

Spread

IQR

# do not plot whole time range but 1 in 5.
iqr[variable][::5, :, :].plot(x="lon", y="lat", col="time",
                              col_wrap=3, cmap = 'rainbow',
                              vmin = 0, vmax = 2.5,
                              robust=True,
                              cbar_kwargs={
                                   "label": "IQR (K)",
                                    })
<xarray.plot.facetgrid.FacetGrid at 0x1b54d12ff70>
../_images/numerical-summary-measures-with-gefs_44_1.png

Standard Deviation

# do not plot whole time range but 1 in 5.
std[variable][::5, :, :].plot(x="lon", y="lat", col="time",
                              col_wrap=3, cmap = 'rainbow',
                              vmin = 0, vmax = 2.5,
                              robust=True,
                              cbar_kwargs={
                                   "label": "std (K)",
                                    })
<xarray.plot.facetgrid.FacetGrid at 0x1b54b977df0>
../_images/numerical-summary-measures-with-gefs_46_1.png

Median Absolute Deviation

# do not plot whole time range but 1 in 5.
mad[variable][::5, :, :].plot(x="lon", y="lat", col="time",
                              col_wrap=3, cmap = 'rainbow',
                              vmin = 0, vmax = 2.5,
                              robust=True,
                              cbar_kwargs={
                                   "label": "mad (K)",
                                    })
<xarray.plot.facetgrid.FacetGrid at 0x1b5540defd0>
../_images/numerical-summary-measures-with-gefs_48_1.png

Agreement between ensemble members is accumulating by time, right?
Also, realize the daily cycle of the spread!

Skewness

Yule-Kendall

# do not plot whole time range but 1 in 5.
yk[variable][::5, :, :].plot(x="lon", y="lat", col="time",
                             col_wrap=3, cmap = 'rainbow',
                             robust=True,
                             cbar_kwargs={
                                  "label": "YK skewness",
                                   })
<xarray.plot.facetgrid.FacetGrid at 0x1b54992ea60>
../_images/numerical-summary-measures-with-gefs_52_1.png

Hmm, this plot is colorful!
What could mean the negative and positive values of skewness?

It is Over!

Now, we know how to deal with ensemble datasets and calculate some of the very important numerical summary measures.
It is your turn to interpret what these plots mean??


Blog by:
Kutay DÖNMEZ : LinkedIn | Github | Twitter | Instagram
Berkay DÖNMEZ : Linkedin | Github | Twitter | Instagram

ABOUT US