Cosine Similarity

Cosine similarity is a measure of how similar two non-zero vectors in an inner product space are. The cosine similarity between two data sets is obtained in time-series analysis by considering each data set as a vector and computing the cosine of the angle between the two vectors. Cosine similarity is often employed in text mining and information retrieval applications, but it may also be useful for identifying shape-based similarities in time-series research. Cosine similarity is the cosine of the angle between two vectors, which ranges from -1 (completely dissimilar) to 1 (completely similar).

The following is the mathematical formula for cosine similarity between two vectors x and y:

cosine_similarity(x, y) = (x * y) / (||x|| * ||y||)

where ‘*’ denotes the dot product of two vectors and ||x|| and ||y|| denote the Euclidean norms of x and y, respectively.

The following is a pseudo-code for determining cosine similarity between two time-series data sets:

  1.  Subtract the means of the two time-series data sets and divide them by their standard deviations to normalize them.
  2. Determine the normalized time-series data sets’ dot product.
  3. Each normalized time-series data set’s Euclidean norm should be computed.
  4. Calculate the cosine similarity of the normalized time-series data sets using the dot product and Euclidean norms.

Here’s some Python code for calculating cosine similarity between two time-series data sets:

Python3

import numpy as np
 
def cosine_similarity(A, B):
    # The time-series data sets should be normalized.
    A_norm = (A - np.mean(A)) / np.std(A)
    B_norm = (B - np.mean(B)) / np.std(B)
 
    # Determining the dot product of the normalized time series data sets.
    dot_product = np.dot(A_norm, B_norm)
 
    # Determining the Euclidean norm for each normalized time-series data collection.
    norm_A = np.linalg.norm(A_norm)
    norm_B = np.linalg.norm(B_norm)
 
    # Calculate the cosine similarity of the normalized time series data
    # using the dot product and Euclidean norms. setse-series data set
    cosine_sim = dot_product / (norm_A * norm_B)
 
    return cosine_sim
 
# Now let's define two time-series data sets
time_series_A = np.array([1, 2, 3])
time_series_B = np.array([4, 5, 6])
 
cosineSimilarity = cosine_similarity(time_series_A, time_series_B)
print("cosine Similarity:",cosineSimilarity)

                    

Output:

cosine Similarity: 1.0

Cosine similarity simply computes the cosine of the angle between two time-series data sets, demonstrating the shape similarity independent of amplitude or offset.

The advantages of Cosine similarity are as follows:

  • Measures the similarity of two time-series vectors based on their angle; 
  • can handle time series of varying forms and magnitudes; 
  • commonly used in text and picture analysis.

The limitations of Cosine similarity are as follows:

  • Time series with variable magnitudes are not acceptable.

Pearson Correlation: 

Pearson correlation measures the linear relationship between two variables. The Pearson correlation is evaluated between two data sets with the same amount of observations in time-series analysis. The Pearson correlation captures linear correlations across time-series data sets, however, it may not capture shape-based similarities.

The Pearson correlation coefficient may be determined between two time-series data sets x and y as follows:

pearson_correlation(x, y) = (sum((x – mean(x)) * (y – mean(y)))) / (sqrt(sum((x – mean(x))2)) * sqrt(sum((y – mean(y))2)))

where mean(x) and mean(y) are the mean values of x and y, respectively.

The following is a pseudo-code for determining the Pearson correlation between two time-series data sets, x, and y:

  1. Determine the mean values of x and y.
  2. Remove the mean x and y values from x and y, correspondingly.
  3. Determine the dot product of x and y.
  4. Determine the sum of the squares of x and y.
  5. Determine the square root of the sum of the squares of x and y.
  6. To get the Pearson correlation coefficient, divide the dot product of x and y by the square root of the sum of squares of x and y.

This is the Pearson correlation pseudo code:

Python3

import numpy as np
 
def pearson_similarity(x, y):
    # Calculate the average of each time series.
    mean_x = np.mean(x)
    mean_y = np.mean(y)
     
    # Determine the standard deviation for each time series.
    std_x = np.std(x)
    std_y = np.std(y)
     
    # Calculate the covariance of the time series.
    cov = np.sum((x - mean_x) * (y - mean_y))
     
    # Figure out the Pearson correlation coefficient.
    if std_x == 0 or std_y == 0:
        return 0
    else:
        pearson = cov / (std_x * std_y)
        return pearson
     
# Now let's define two time-series data sets
time_series_A = np.array([1, 2, 3])
time_series_B = np.array([4, 5, 6])
 
pearsonSimilarity = pearson_similarity(time_series_A, time_series_B)
print("Pearson Similarity:",pearsonSimilarity)

                    

Output:

Pearson Similarity: 3.0

The Pearson correlation coefficient is calculated by first computing the mean and standard deviation of each time series, then computing the covariance between them using the dot product between the centered time series and then dividing the covariance by the product of the standard deviations. The Pearson correlation is set to zero if one of the standard deviations is zero (which might happen if one of the time series is constant).

Furthermore, similarity measures in time-series analysis are crucial for assessing the degree of similarity or dissimilarity between two or more time-series data sets. The similarity measure chosen is determined by the specific application, the size and complexity of the data collection, and the degree of noise and outliers in the data. Some of the most often used similarity metrics in time-series analysis are the Euclidean distance, DTW, shape-based techniques, cosine similarity, and Pearson correlation.

The advantages of Pearson correlation are as follows:

  • Measures the linear connection between two-time series; 
  • may deal with time series of varying forms and magnitudes.

The limitations of Pearson correlation are as follows:

  • Assumes that the time series are uniformly distributed and have a linear connection; 
  • sensitive to outliers and noise.

Similarity Search for Time-Series Data

Time-series analysis is a statistical approach for analyzing data that has been structured through time. It entails analyzing past data to detect patterns, trends, and anomalies, then applying this knowledge to forecast future trends. Time-series analysis has several uses, including in finance, economics, engineering, and the healthcare industry.

Time-series datasets are collections of data points that are recorded over time, such as stock prices, weather patterns, or sensor readings. In many real-world applications, it is often necessary to compare multiple time-series datasets to find similarities or differences between them.

Similarity search, which includes determining the degree to which similarities exist between two or more time-series data sets, is a fundamental task in time-series analysis. This is an essential phase in a variety of applications, including anomaly detection, clustering, and forecasting. In anomaly detection, for example, we may wish to find data points that differ considerably from the predicted trend. In clustering, we could wish to combine time-series data sets that have similar patterns, but in forecasting, we might want to discover the most comparable past data to reliably anticipate future trends.

In time-series analysis, there are numerous approaches for searching for similarities, including the Euclidean distance, dynamic time warping (DTW), and shape-based methods like the Fourier transform and Symbolic Aggregate ApproXimation (SAX). The approach chosen is determined by the individual purpose, the scope and complexity of the data collection, and the amount of noise and outliers in the data.

Although time-series analysis and similarity search are strong tools, they are not without their drawbacks. Handling missing data, dealing with big and complicated data sets, and selecting appropriate similarity metrics, can be challenging. Yet, these obstacles may be addressed with thorough data preparation and the selection of relevant procedures.

Types of similarity measures

Time-series analysis is the process of reviewing previous data to detect patterns, trends, and anomalies and then utilizing this knowledge to forecast future trends. Similarity search, which includes determining the degree to which similarities exist among two or more time-series data sets, is an essential problem in time-series analysis. 

Similarity metrics, which quantify the degree to which there is similarity or dissimilarity among two time-series data sets, are critical in this endeavor. This article will go through the several types of similarity metrics that are often employed in time-series analysis.

Similar Reads

Euclidean Distance

Euclidean distance is a distance metric that is widely used to calculate the similarity of two data points in an n-dimensional space. The Euclidean distance is used in time-series analysis to determine the degree of similarity between two time-series data sets with the same amount of observations. This distance metric is sensitive to noise and outliers, and it may not be effective in capturing shape-based similarities. The Euclidean distance between two places A(x1, y1) and B(x2, y2) is calculated as the square root of the sum of the squared differences between the corresponding dimensions of the two data points....

Dynamic Time Warping (DTW)

...

Shape-based Methods

Dynamic Time Warping (DTW) is a prominent similarity metric in time-series analysis, particularly when the data sets are of varying durations or exhibit phase changes or time warping. DTW, unlike Euclidean distance, allows for non-linear warping of the time axis to suit analogous patterns in time-series data sets. DTW is commonly used in speech recognition, signal processing, and finance....

Cosine Similarity:

...

Graph (plot) of time series dataset

Shape-based approaches are a type of similarity measure in which time-series data sets are transformed into a new representation, such as the Fourier transform or Symbolic Aggregate Approximation (SAX), and then compared based on their shape. These approaches are good at collecting shape-based similarities and are commonly used in pattern recognition, clustering, and anomaly identification. Nevertheless, the success of shape-based approaches is dependent on the transformation used and the amount of noise and outliers in the data....

Preprocessing techniques for time-series data

...

Applications of similarity search in time-series analysis

Cosine similarity is a measure of how similar two non-zero vectors in an inner product space are. The cosine similarity between two data sets is obtained in time-series analysis by considering each data set as a vector and computing the cosine of the angle between the two vectors. Cosine similarity is often employed in text mining and information retrieval applications, but it may also be useful for identifying shape-based similarities in time-series research. Cosine similarity is the cosine of the angle between two vectors, which ranges from -1 (completely dissimilar) to 1 (completely similar)....

Challenges in similarity search

...

Tools and libraries (In Python, C++, R & Java)

...