Gaussian Process Regression on Mauna Loa CO2 data

Now let’s get to the point of applying GPR on the Mauna Loa CO2 data. Moana Loa is a volcano present in the US state of Hawaii. Mauna Loa is considered as Earth’s largest active volcano in terms of volume and the area covered. Mauna Loa, due to its remote location is known for the atmospheric research which measures the continuous monitoring of carbon dioxide (CO2) levels on its surface. The dataset observing the increase in the CO2 concerns the Global Warming aspect and the involvement on human activities in it.

The dataset that we are going to be considering will be the daily increase in the level of CO2 on the Mauna Loa surface. We will try to visualize the Gaussian Process Regressor’s prediction with the actual observed data of CO2 concentration. In the modelling process we will be considering the Month, Year and CO2 concentration as the columns, where Month, Year will be our main feature and CO2 concentration will be the label value which we will be predicting.

Here’s the step-by-step code of GPR on Mauna Loa CO2 data with explanation:

Importing Libraries:

First, we will be importing important libraries which will be useful in this method. Here we will be importing pandas for data handling and manipulation, numpy for mathematical operations on arrays, matplotlib for the visualization of prediction, datetime for date-time manipulation, and from the sklearn library’s gaussian_process we will be importing GaussianProcessRegressor which will be our model on which data will be fed and fetch_openml to get the Mauna Loa CO2 dataset.

Python3

# Importing Necessary Libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.gaussian_process import GaussianProcessRegressor
import datetime
import numpy as np

                    

Fetch and Display Dataset:

The CO2 dataset of Mauna Loa is fetched from openml using its data ID which is 41187. The parameter as_frame is set to True to make sure that the data is loaded as a pandas dataframe. The parser parameter is set to pandas so that the data parser used in other code is pandas. Finally, the first five rows of the dataset are printed using head().

Python3

data = fetch_openml(data_id=41187).frame
data.head()

                    

Output:

   year  month  day  weight  flag station    co2
0 1958 3 29 4 0 MLO 316.1
1 1958 4 5 6 0 MLO 317.3
2 1958 4 12 4 0 MLO 317.6
3 1958 4 19 6 0 MLO 317.5
4 1958 4 26 2 0 MLO 316.4

Convert to Datetime and Set as Index:

The year, month, and day columns are combined into a single “date” column using pandas to_datetime method. The ‘date’ column is set to index of the pandas dataframe and the columns date and co2 are filtered such that all the other columns are removed. At last, the first five rows of the new dataframe are printed.

Python3

data["date"] = pd.to_datetime(data[["year", "month", "day"]])
data = data[["date", "co2"]].set_index("date")
data.head()

                    

Output:

              co2
date
1958-03-29 316.1
1958-04-05 317.3
1958-04-12 317.6
1958-04-19 317.5
1958-04-26 316.4

Resampling and Plotting Monthly Trend:

The data is resampled to a monthly frequency, calculating the mean CO2 concentration for each month. dropna() removes any rows with missing values (NaNs). After resampling the monthly CO2 average data is plotted with respect to each month.

Python3

# Resampling by month
co2_weekly = data.resample("W").mean().dropna()
 
# Ploting monthly data
plt.plot(co2_weekly['co2'])
plt.ylabel("Weekly average of CO$_2$ concentration (ppm)")
plt.title("Weekly average of air samples measurements")
plt.show()

                    

Output:

Weekly average of air samples measurements

Preparing Data for Gaussian Process Regression:

After proper visualization of the dataset, data is be prepared so that it could be fed to the gaussian process regression model, variables input – ‘X’ and output – ‘y’ are prepared in this process. ‘X’ is created by combining the year and the fractional part of the month, effectively creating a continuous time variable. Whereas ‘y’ is the CO2 concentration levels.

Python3

X = (data.index.year + data.index.month / 12).to_numpy().reshape(-1, 1)
y = data["co2"].to_numpy()

                    

Split the train and Test dataset

Python3

X_train = X[:int(X.shape[0]*0.7)]
y_train = y[:int(X.shape[0]*0.7)]
 
X_test = X[int(X.shape[0]*0.7):]
y_test = y[int(X.shape[0]*0.7):]
 
print(f"Training data shape :{X_train.shape},{y_train.shape}")
print(f"Testing data shape :{X_test.shape},{y_test.shape}")

                    

Output:

Training data shape :(1557, 1),(1557,)
Testing data shape :(668, 1),(668,)

Defining Gaussian Process Kernels:

Different kernels are defined which captures different aspects of the data used in this example. The kernels include RBF kernel, periodic kernel, white kernel and RationalQuadratic kernel. Each kernel carries different aspects of the data some are mentioned here: the long_term_trend_kernel captures the long-term trends in the data, the seasonal_kernel models seasonal variation in data, the irregularities_kernel marks the irregularities in the dataset and the noise_kernel models the noise component of the data.

The final combined_kernel is the combination of all the kernels mentioned.

Python3

from sklearn.gaussian_process.kernels import RBF, ExpSineSquared, RationalQuadratic, WhiteKernel
 
long_term_trend_kernel = 40.0**2 * RBF(length_scale=40.0)
 
seasonal_kernel = (2.0**2 * RBF(length_scale=100.0) * ExpSineSquared(
    length_scale=1.0, periodicity=1.0, periodicity_bounds="fixed"))
 
irregularities_kernel = 0.5**2 * RationalQuadratic(length_scale=1.0, alpha=1.5)
 
noise_kernel = 0.1**2 * RBF(length_scale=0.1) + \
    WhiteKernel(noise_level=0.1**2, noise_level_bounds=(1e-5, 1e5))
 
combined_kernel = long_term_trend_kernel + \
    seasonal_kernel + irregularities_kernel + noise_kernel

                    

Training the Gaussian Process Regressor:

Here, first the mean of the CO2 level value is computed and then GaussianProcessRegressor is instantiated with the help of combined_kernel defined in the previous step and normalize_y parameter is set to False since y value will be centered by the substraction of ‘y.mean()’ value manually and the gaussian_process model will be trained on this data.

Python3

gaussian_process = GaussianProcessRegressor(
    kernel=combined_kernel, normalize_y=False)
gaussian_process.fit(X_train, y_train - y_train.mean())

                    

Making Predictions:

Python3

mean_y_pred, std_y_pred = gaussian_process.predict(X_test, return_std=True)
mean_y_pred += y_train.mean()
 
plt.plot(X_train, y_train, color="gray", label="Training data")
plt.plot(X_test, mean_y_pred, color="orange", alpha=0.4,
         label="Gaussian process Prediction")
plt.plot(X_test, y_test, color="green", alpha=0.4,
         label="Actual values")
 
plt.legend()
plt.xlabel("Year")
plt.ylabel("Monthly average of CO$_2$ concentration in ppm")
 
plt.title(
    "Monthly average of air samples measurements from the Mauna Loa Observatory"
)
plt.show()

                    

Output:

Monthly average of air samples measurements

Forecasting

First, we will be determining the current month and year to create a future timeline for prediction. ‘forecast_x’ is an array of time points (from 2001 to the current month) for which we want to predict CO2 levels. The ‘predict’ method of the Gaussian Process Regressor is used to get predictions (‘mean_y_pred’) and their standard deviations (‘std_y_pred’). return_std=True enables the return of standard deviation estimates for the predictions. The mean predictions are adjusted by adding back the ‘y_mean’ to reverse the earlier centering.

Python3

today = datetime.datetime.now()
current_month = today.year + today.month / 12
forecast_x = np.linspace(start=2001, stop=current_month, num=365).reshape(-1, 1)
 
mean_y_forecast, std_y_forecast = gaussian_process.predict(forecast_x, return_std=True)
mean_y_forecast += y_train.mean()

                    

Plotting the Results:

The predictions are shown as a line plot with a shaded area representing the uncertainty (± standard deviation). The ‘fill_between’ method creates a shaded area around the prediction line, representing the uncertainty in the predictions. The plot is then labeled with a title, legend, and axis labels to complete the visualization.

Python3

plt.plot(X_train[1000:], y_train[1000:], color="gray", label="Training data")
plt.plot(X_test, mean_y_pred, color="orange", alpha=0.4,
         label="Gaussian process Prediction")
plt.plot(X_test, y_test, color="green", alpha=0.4,
         label="Actual values")
 
plt.plot(forecast_x, mean_y_forecast, color="red", alpha=0.3,
         label="Forecasting values")
 
plt.fill_between(
    forecast_x.ravel(),
    mean_y_forecast - std_y_forecast,
    mean_y_forecast + std_y_forecast,
    color="red",
    alpha=0.1,
)
plt.legend()
plt.xlabel("Year")
plt.ylabel("Monthly average of CO$_2$ concentration in ppm")
 
plt.title("Forcasting Mauna Loa Observatory")
plt.show()

                    

Output:

Forcasting Mauna Loa Observatory



Hence, through the output visualization, we can see that the prediction matches the original observation in a pretty good way, and the Gaussian Process Regression is effective in monitoring and predicting the CO2 rise level on Mauna Loa volcano.



Gaussian Process Regression (GPR) on Mauna Loa CO2 data

In article explores the application of Gaussian Process Regression (GPR) on the Mauna Loa CO2 dataset using Scikit-Learn.

Similar Reads

What is Gaussian Process Regression (GPR)?

Gaussian Process Regression is the process of making predictions about an unknown function and dealing with regression problems with the help of statistical methods. GPR allows for flexibility it can model complex and non-linear relationships without explicitly defining a formula for the underlying function. Here in GPR, the predictions are treated as probability distributions over possible functions. Before getting the data GPR makes a wild guess about what the unknown function might look like, as the data component is accumulated the beliefs about the unknown function change and the model adapts to the data and the predictions start becoming more certain. After incorporation of data with the GPR model, it gives a posterior distribution which is a close estimate of the underlying function. When we make predictions for new, unseen data points, GPR provides a range of possible values with associated confidence intervals....

Gaussian Process Regression on Mauna Loa CO2 data

Now let’s get to the point of applying GPR on the Mauna Loa CO2 data. Moana Loa is a volcano present in the US state of Hawaii. Mauna Loa is considered as Earth’s largest active volcano in terms of volume and the area covered. Mauna Loa, due to its remote location is known for the atmospheric research which measures the continuous monitoring of carbon dioxide (CO2) levels on its surface. The dataset observing the increase in the CO2 concerns the Global Warming aspect and the involvement on human activities in it....