Python Library for Accessing Modern Climate Data and Machine Learning Models


Extreme weather conditions have become a typical occurrence, especially in recent years. Climate change is the main factor to blame for such extreme weather-related phenomena, from the torrential downpours seen in Pakistan that have submerged large portions of the country under water to the exceptional heat waves that have fueled wildfires throughout Portugal and Spain. The Earth’s average surface temperature is predicted to rise by about four degrees during the next decade if the proper actions are not taken soon. According to scientists, this temperature rise will further contribute to the occurrence of more frequent extreme weather events.

General circulation models (GCMs) are tools that scientists use to forecast the weather and climate in the future. GCMs are a system of differential equations that can be integrated across time to produce forecasts for various variables, including temperature, wind speed, precipitation, etc. These models are very simple to comprehend and produce appreciably accurate results. However, the core problem with these models is that executing the simulations requires significant computational power. Additionally, fine-tuning the models gets difficult when there is a lot of training data.

This is where machine learning techniques are proven to be useful. Particularly in “weather forecasting” and “spatial downscaling,” these algorithms have proven to be competitive with more established climate models. Weather forecasting refers to anticipating future climate variables. For instance, we must forecast the amount of rainfall for the upcoming week in Meghalaya using the information on the daily rainfall (in cm) for the previous week. The issue of downscaling spatially coarse climate model projections, for instance, from a grid of 100 km x 100 km to 1 km x 1 km, is known as spatial downscaling.

Forecasting and downscaling can be analogous to a variety of computer vision tasks. However, the main distinction in weather forecasting, spatial downscaling, and other CV tasks is that the machine learning model needs to utilize exogenous inputs in various modalities. For instance, several elements, like humidity and wind speed, along with historical surface temperatures, will have an impact on future surface temperatures. These variables must be provided as inputs to the model, along with surface temperatures. 

Deep learning research has exploded in recent years, and scientists studying machine learning and climate change are now looking into how deep learning techniques might address weather forecasting and spatial downscaling issues. When it comes to applying machine learning, the two take contrasting approaches. Scientists studying machine learning place more emphasis on what architectures are best suited for what problems and how to process data in a way that is well suited to modern machine learning methods, whereas climate scientists make more use of physical equations and keep in mind the necessary evaluation metrics.

However, ambiguous language (“bias” in climate modeling versus “bias” in machine learning), a lack of standardization in the application of machine learning for climate science challenges, and a lack of expertise in the analysis of climate data have hindered their ability to unlock their full potential. To address these issues, researchers at the University of California, Los Angeles (UCLA) have developed ClimateLearn, a Python package that enables easy, standardized access to enormous climate data and cutting-edge machine-learning models. A variety of datasets, state-of-the-art baseline models, and a set of metrics and visualizations are all accessible through the package, which enables large-scale benchmarking of weather forecasting and spatial downscaling techniques.

ClimateLearn delivers data in a format that current deep learning architectures can easily utilize. The package includes data from ERA5, the fifth-generation reanalysis of historical global climate, and meteorological data from the European Centre for Medium-Range Weather Forecasts (ECMWF). A reanalysis dataset uses modeling and data assimilation techniques to merge historical data into global estimations. By virtue of this combination of real data and modeling, reanalysis solutions can have entire global data with reasonable accuracy. ClimateLearn also supports preprocessed ERA5 data from WeatherBench, a benchmark dataset for data-driven weather forecasting, in addition to the raw ERA5 data.

The baseline models implemented in ClimateLearn are well-tuned for the climate tasks and can even be easily extended for other downstream pipelines in climate science. Simple statistical techniques like linear regression, persistence, and climatology are just a few examples of the range of standard machine learning algorithms supported by ClimateLearn. More sophisticated deep learning algorithms like residual convolutional neural networks, U-nets, and vision transformers are also available. The package also provides support for quickly visualizing model predictions using metrics like (latitude-weighted) root mean squared error, anomaly correlation coefficient, and Pearson’s correlation coefficient. Additionally, ClimateLearn provides the visualization of model predictions, ground truth, and the discrepancy between the two.

Researchers’ primary goal in developing ClimateLearn was to close the gap between the communities of climate science and machine learning by making climate datasets easily accessible, providing baseline models for easy comparison, and visualization metrics to comprehend the model outputs. In the near future, the researchers intend to add support for new datasets, like CMIP6 (the sixth generation Climate Modeling Intercomparison Project). The team will also support probabilistic forecasting with new uncertainty quantification metrics and several machine learning methods like Bayesian neural networks and diffusion models. The additional opportunities that machine learning researchers can open up by knowing more about model performance, expressiveness, and robustness have the researchers incredibly enthusiastic. Additionally, climate scientists will be able to comprehend how altering the values of the input variables will change the distributions of the results. The team also plans on making the package open-source and looks forward to all the community’s contributions.

n