Gaussian Process Regression on Mauna Loa CO2 data
Now let’s get to the point of applying GPR on the Mauna Loa CO2 data. Moana Loa is a volcano present in the US state of Hawaii. Mauna Loa is considered as Earth’s largest active volcano in terms of volume and the area covered. Mauna Loa, due to its remote location is known for the atmospheric research which measures the continuous monitoring of carbon dioxide (CO2) levels on its surface. The dataset observing the increase in the CO2 concerns the Global Warming aspect and the involvement on human activities in it.
The dataset that we are going to be considering will be the daily increase in the level of CO2 on the Mauna Loa surface. We will try to visualize the Gaussian Process Regressor’s prediction with the actual observed data of CO2 concentration. In the modelling process we will be considering the Month, Year and CO2 concentration as the columns, where Month, Year will be our main feature and CO2 concentration will be the label value which we will be predicting.
Here’s the step-by-step code of GPR on Mauna Loa CO2 data with explanation:
Importing Libraries:
First, we will be importing important libraries which will be useful in this method. Here we will be importing pandas for data handling and manipulation, numpy for mathematical operations on arrays, matplotlib for the visualization of prediction, datetime for date-time manipulation, and from the sklearn library’s gaussian_process we will be importing GaussianProcessRegressor which will be our model on which data will be fed and fetch_openml to get the Mauna Loa CO2 dataset.
Python3
# Importing Necessary Libraries import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import fetch_openml from sklearn.gaussian_process import GaussianProcessRegressor import datetime import numpy as np |
Fetch and Display Dataset:
The CO2 dataset of Mauna Loa is fetched from openml using its data ID which is 41187. The parameter as_frame is set to True to make sure that the data is loaded as a pandas dataframe. The parser parameter is set to pandas so that the data parser used in other code is pandas. Finally, the first five rows of the dataset are printed using head().
Python3
data = fetch_openml(data_id = 41187 ).frame data.head() |
Output:
year month day weight flag station co2
0 1958 3 29 4 0 MLO 316.1
1 1958 4 5 6 0 MLO 317.3
2 1958 4 12 4 0 MLO 317.6
3 1958 4 19 6 0 MLO 317.5
4 1958 4 26 2 0 MLO 316.4
Convert to Datetime and Set as Index:
The year, month, and day columns are combined into a single “date” column using pandas to_datetime method. The ‘date’ column is set to index of the pandas dataframe and the columns date and co2 are filtered such that all the other columns are removed. At last, the first five rows of the new dataframe are printed.
Python3
data[ "date" ] = pd.to_datetime(data[[ "year" , "month" , "day" ]]) data = data[[ "date" , "co2" ]].set_index( "date" ) data.head() |
Output:
co2
date
1958-03-29 316.1
1958-04-05 317.3
1958-04-12 317.6
1958-04-19 317.5
1958-04-26 316.4
Resampling and Plotting Monthly Trend:
The data is resampled to a monthly frequency, calculating the mean CO2 concentration for each month. dropna() removes any rows with missing values (NaNs). After resampling the monthly CO2 average data is plotted with respect to each month.
Python3
# Resampling by month co2_weekly = data.resample( "W" ).mean().dropna() # Ploting monthly data plt.plot(co2_weekly[ 'co2' ]) plt.ylabel( "Weekly average of CO$_2$ concentration (ppm)" ) plt.title( "Weekly average of air samples measurements" ) plt.show() |
Output:
Preparing Data for Gaussian Process Regression:
After proper visualization of the dataset, data is be prepared so that it could be fed to the gaussian process regression model, variables input – ‘X’ and output – ‘y’ are prepared in this process. ‘X’ is created by combining the year and the fractional part of the month, effectively creating a continuous time variable. Whereas ‘y’ is the CO2 concentration levels.
Python3
X = (data.index.year + data.index.month / 12 ).to_numpy().reshape( - 1 , 1 ) y = data[ "co2" ].to_numpy() |
Split the train and Test dataset
Python3
X_train = X[: int (X.shape[ 0 ] * 0.7 )] y_train = y[: int (X.shape[ 0 ] * 0.7 )] X_test = X[ int (X.shape[ 0 ] * 0.7 ):] y_test = y[ int (X.shape[ 0 ] * 0.7 ):] print (f "Training data shape :{X_train.shape},{y_train.shape}" ) print (f "Testing data shape :{X_test.shape},{y_test.shape}" ) |
Output:
Training data shape :(1557, 1),(1557,)
Testing data shape :(668, 1),(668,)
Defining Gaussian Process Kernels:
Different kernels are defined which captures different aspects of the data used in this example. The kernels include RBF kernel, periodic kernel, white kernel and RationalQuadratic kernel. Each kernel carries different aspects of the data some are mentioned here: the long_term_trend_kernel captures the long-term trends in the data, the seasonal_kernel models seasonal variation in data, the irregularities_kernel marks the irregularities in the dataset and the noise_kernel models the noise component of the data.
The final combined_kernel is the combination of all the kernels mentioned.
Python3
from sklearn.gaussian_process.kernels import RBF, ExpSineSquared, RationalQuadratic, WhiteKernel long_term_trend_kernel = 40.0 * * 2 * RBF(length_scale = 40.0 ) seasonal_kernel = ( 2.0 * * 2 * RBF(length_scale = 100.0 ) * ExpSineSquared( length_scale = 1.0 , periodicity = 1.0 , periodicity_bounds = "fixed" )) irregularities_kernel = 0.5 * * 2 * RationalQuadratic(length_scale = 1.0 , alpha = 1.5 ) noise_kernel = 0.1 * * 2 * RBF(length_scale = 0.1 ) + \ WhiteKernel(noise_level = 0.1 * * 2 , noise_level_bounds = ( 1e - 5 , 1e5 )) combined_kernel = long_term_trend_kernel + \ seasonal_kernel + irregularities_kernel + noise_kernel |
Training the Gaussian Process Regressor:
Here, first the mean of the CO2 level value is computed and then GaussianProcessRegressor is instantiated with the help of combined_kernel defined in the previous step and normalize_y parameter is set to False since y value will be centered by the substraction of ‘y.mean()’ value manually and the gaussian_process model will be trained on this data.
Python3
gaussian_process = GaussianProcessRegressor( kernel = combined_kernel, normalize_y = False ) gaussian_process.fit(X_train, y_train - y_train.mean()) |
Making Predictions:
Python3
mean_y_pred, std_y_pred = gaussian_process.predict(X_test, return_std = True ) mean_y_pred + = y_train.mean() plt.plot(X_train, y_train, color = "gray" , label = "Training data" ) plt.plot(X_test, mean_y_pred, color = "orange" , alpha = 0.4 , label = "Gaussian process Prediction" ) plt.plot(X_test, y_test, color = "green" , alpha = 0.4 , label = "Actual values" ) plt.legend() plt.xlabel( "Year" ) plt.ylabel( "Monthly average of CO$_2$ concentration in ppm" ) plt.title( "Monthly average of air samples measurements from the Mauna Loa Observatory" ) plt.show() |
Output:
Forecasting
First, we will be determining the current month and year to create a future timeline for prediction. ‘forecast_x’ is an array of time points (from 2001 to the current month) for which we want to predict CO2 levels. The ‘predict’ method of the Gaussian Process Regressor is used to get predictions (‘mean_y_pred’) and their standard deviations (‘std_y_pred’). return_std=True enables the return of standard deviation estimates for the predictions. The mean predictions are adjusted by adding back the ‘y_mean’ to reverse the earlier centering.
Python3
today = datetime.datetime.now() current_month = today.year + today.month / 12 forecast_x = np.linspace(start = 2001 , stop = current_month, num = 365 ).reshape( - 1 , 1 ) mean_y_forecast, std_y_forecast = gaussian_process.predict(forecast_x, return_std = True ) mean_y_forecast + = y_train.mean() |
Plotting the Results:
The predictions are shown as a line plot with a shaded area representing the uncertainty (± standard deviation). The ‘fill_between’ method creates a shaded area around the prediction line, representing the uncertainty in the predictions. The plot is then labeled with a title, legend, and axis labels to complete the visualization.
Python3
plt.plot(X_train[ 1000 :], y_train[ 1000 :], color = "gray" , label = "Training data" ) plt.plot(X_test, mean_y_pred, color = "orange" , alpha = 0.4 , label = "Gaussian process Prediction" ) plt.plot(X_test, y_test, color = "green" , alpha = 0.4 , label = "Actual values" ) plt.plot(forecast_x, mean_y_forecast, color = "red" , alpha = 0.3 , label = "Forecasting values" ) plt.fill_between( forecast_x.ravel(), mean_y_forecast - std_y_forecast, mean_y_forecast + std_y_forecast, color = "red" , alpha = 0.1 , ) plt.legend() plt.xlabel( "Year" ) plt.ylabel( "Monthly average of CO$_2$ concentration in ppm" ) plt.title( "Forcasting Mauna Loa Observatory" ) plt.show() |
Output:
Hence, through the output visualization, we can see that the prediction matches the original observation in a pretty good way, and the Gaussian Process Regression is effective in monitoring and predicting the CO2 rise level on Mauna Loa volcano.
Gaussian Process Regression (GPR) on Mauna Loa CO2 data
In article explores the application of Gaussian Process Regression (GPR) on the Mauna Loa CO2 dataset using Scikit-Learn.