Generating the dataset

This code utilizes Pandas and NumPy libraries to create a synthetic dataset representing weather conditions over a 100-hour period, starting from April 1, 2006. Random weather conditions, including temperature, humidity, wind speed, pressure, visibility, and apparent temperature, are generated and stored in a Pandas data frame. Each row in the DataFrame corresponds to a specific hour, with columns indicating the date, weather conditions, and various meteorological parameters.

Python

import pandas as pd
import numpy as np
# Generate example data
np.random.seed(0)
dates = pd.date_range('2006-04-01', periods=100, freq='H')
formatted_dates = [date.strftime('%Y-%m-%d %H:%M:%S.000 +0200') for date in dates]

# Generate weather conditions and daily summaries for each date
weather_conditions = ['Partly cloudy', 'Sunny', 'Rainy', 'Cloudy']
weather_conditions = [random.choice(weather_conditions) for _ in range(100)]

daily_summary = [' '.join([random.choice(weather_conditions), 'throughout the day.']) for _ in range(100)]

temperature = np.random.randint(50, 100, size=100)
humidity = np.random.randint(40, 90, size=100)
wind_speed = np.random.randint(0, 15, size=100)
pressure = np.random.randint(980, 1050, size=100)
visibility = np.random.randint(0, 15, size=100)
apparent_temperature = np.random.randint(50, 100, size=100)

# Create DataFrame
df = pd.DataFrame({
    'Formatted Date': formatted_dates,
    'Weather Conditions': weather_conditions,
    'Temperature (C)': temperature,
    'Humidity': humidity,
    'Wind Speed (km/h)': wind_speed,
    'Pressure (mbar)': pressure,
    'Visibility (km)': visibility,
    'Apparent Temperature (C)': apparent_temperature,
    'Daily Summary': daily_summary
})

print(df.head())

Output:

Formatted Date    Weather Conditions    Temperature (C)    Humidity    Wind Speed (km/h)    Pressure (mbar)    Visibility (km)    Apparent Temperature (C)    Daily Summary
0    2006-04-01 00:00:00.000 +0200    Partly cloudy    94    45    8    1016    2    80    Rainy throughout the day.
1    2006-04-01 01:00:00.000 +0200    Cloudy    97    81    8    1028    13    58    Partly cloudy throughout the day.
2    2006-04-01 02:00:00.000 +0200    Rainy    50    75    9    1005    7    70    Cloudy throughout the day.
3    2006-04-01 03:00:00.000 +0200    Rainy    53    40    2    1047    8    57    Rainy throughout the day.
4    2006-04-01 04:00:00.000 +0200    Rainy    53    71    8    1015    4    53    Rainy throughout the day.

Applying Transformations to our data,

1. Time Series Data Transformation Using Power Transform:

The power transform is mainly used to make the variance of the data constant. It involves mathematically transforming the data so it changes its distribution to be more Gaussian (normal). This can be particularly useful in cases where the data has a skewed distribution or heteroscedasticity (varying variance).

The code utilizes a statistical technique called power transformation, specifically the Yeo-Johnson method.

Python

from sklearn.preprocessing import PowerTransformer
# Compute variance of the original 'Temperature (C)' column
original_variance = df['Temperature (C)'].var()

# Apply power transform to 'Temperature (C)' column
pt = PowerTransformer(method='yeo-johnson')
df['Temperature (C)'] = pt.fit_transform(df[['Temperature (C)']])

# Compute variance of the transformed 'Temperature (C)' column
transformed_variance = df['Temperature (C)'].var()

print("Original Variance:", original_variance)
print("Transformed Variance:", transformed_variance)

Output:

Original Variance: 217.82828282828282 
Transformed Variance: 1.0101010101010097

This significant reduction in variance indicates that the power transform successfully stabilized the variance of the data. The original variance of the ‘Temperature (C)’ column was 217.82828282828282, and after we apply the Yeo-Johnson power transform, the variance became 1.0101010101010097.

2. Time Series Data Transformation Using Difference Transform:

The difference transform is a technique used to make a time series data stationary by computing the differences between consecutive observations. This transformation is useful for removing trends or seasonal patterns in the data, making it easier to model using techniques like ARIMA.

This code applies a differencing transformation to the ‘Humidity’ column of a DataFrame and performs the Augmented Dickey-Fuller (ADF) test to check for stationarity.

Python

from statsmodels.tsa.stattools import adfuller
# Apply difference transform to 'Humidity' column
df['Humidity difference'] = df['Humidity'].diff()


# Perform Dickey-Fuller test for stationarity
result = adfuller(df['Humidity difference'].dropna())
print("Humidity difference ADF Statistic:", result[0])
print("Humidity difference p-value:", result[1])
print("Humidity difference Critical Values:")
for key, value in result[4].items():
    print(f"   {key}: {value}")

Output:

Humidity difference ADF Statistic: -6.594772523405528
Humidity difference p-value: 6.969838186303788e-09
Humidity difference Critical Values:
   1%: -3.50434289821397
   5%: -2.8938659630479413
   10%: -2.5840147047458037

Here, we performed the Dickey-Fuller test to test for stationarity after applying the differenceing transformation.

And the results of the Dickey-Fuller test for the ‘Humidity difference’ column indicated that the data is likely stationary which is supported by the very low p-value (6.969838186303788e-09), which is less than the typical significance level of 0.05. Additionally, the ADF statistic is lower than the critical values at the 1%, 5%, and 10% levels, further indicating that we can reject the null hypothesis of non-stationarity.

3. Time Series Data Transformation Using Standardization:

Standardization, also known as z-score normalization, is a preprocessing technique used to scale the features of a dataset to have a mean of 0 and a standard deviation of 1. This transformation can be useful when working with features that have different scales, as it helps to bring all features to a similar scale.

This code demonstrates how to use the StandardScaler from scikit-learn to standardize the ‘Humidity’ and ‘Pressure (mbar)’ columns in a DataFrame

Python

from sklearn.preprocessing import StandardScaler
# Create a StandardScaler object
scaler = StandardScaler()

# Fit the scaler to the data and transform the 'Humidity' and 'Pressure (mbar)' columns
df['Humidity standardized'] = scaler.fit_transform(df[['Humidity']])
df['Pressure standardized'] = scaler.fit_transform(df[['Pressure (mbar)']])

# Display the transformed DataFrame
print(df[['Humidity standardized', 'Pressure standardized']].head())

Output:

   Humidity standardized  Pressure standardized
0              -1.264019               0.045409
1               1.151303               0.664619
2               0.748750              -0.522201
3              -1.599480               1.645036
4               0.480381              -0.006192

The columns ‘Humidity standardized’ and ‘Pressure standardized’ are now standardized, with their values now having a mean of 0 and a standard deviation of 1, which brings them to a similar scale.

4. Time Series Data Transformation Using Normalization

Normalization is another data preprocessing technique used to scale the features of a dataset to a fixed range. This is achieved by subtracting the minimum value of the feature and then dividing by the range of the feature. Normalization is particularly useful when the features have different ranges and unit.

Here’s the code fits the scaler to the data and transforms the ‘Humidity’ column, then prints out the first few rows of the transformed data.

Python

from sklearn.preprocessing import MinMaxScaler
# Create a MinMaxScaler object
scaler = MinMaxScaler()

# Fit the scaler to the data and transform the 'Humidity' column
df['Humidity normalized'] = scaler.fit_transform(df[['Humidity']])

# Display the transformed DataFrame
print(df['Humidity normalized'].head())

Output:

0    0.102041
1    0.836735
2    0.714286
3    0.000000
4    0.632653
Name: Humidity normalized, dtype: float64

The ‘Humidity normalized’ column has been normalized, which will be computationally efficient when we apply to our model.

Conclusion

In conclusion, time series data transformation is a crucial step in time series analysis and forecasting. It involves converting raw time series data into a format that is suitable for analysis and modeling. We applied these transformations to a sample dataset, showcasing how each transformation affects the data and its suitability for modeling. These transformations are essential for preparing time series data for analysis and modeling, ensuring that the data is in a suitable format for accurate and effective forecasting.

1. Time Series Data Transformation Using Power Transform:

2. Time Series Data Transformation Using Difference Transform:

3. Time Series Data Transformation Using Standardization:

4. Time Series Data Transformation Using Normalization

Conclusion

Time Series Data Transformation using Python

Similar Reads