What is Jaccard Similarity?
Jaccard Similarity also known as Jaccard index, is a statistic to measure the similarity between two data sets. It is measured as the size of the intersection of two sets divided by the size of their union.
For example: Given two sets A and B, their Jaccard Similarity is provided by,
Where:
- is the cardinality (size) of the intersection of sets A and B.
- is the cardinality (size) of the union of sets A and B.
Jaccard Similarity is also known as the Jaccard index or Jaccard coefficient, its values lie between 0 and 1. where 0 means no similarity and the values get closer to 1 means increasing similarity 1 means the same datasets.
Computing Jaccard Similarity
EXAMPLE: 1
Python
A = { 1 , 2 , 3 , 4 , 6 } B = { 1 , 2 , 5 , 8 , 9 } # Intersaction and Union of two sets can also be done using & and | operators. C = A.intersection(B) D = A.union(B) print ( 'AnB = ' , C) print ( 'AUB = ' , D) print ( 'J(A,B) = ' , float ( len (C)) / float ( len (D))) |
Output:
AnB = {1, 2}
AUB = {1, 2, 3, 4, 5, 6, 8, 9}
J(A,B) = 0.25
EXAMPLE: 2
The Jaccard similarity can be used to compare the similarity of two sets of words, which are frequently represented as sets of unique terms.
Python3
def jaccard_similarity(set1, set2): # intersection of two sets intersection = len (set1.intersection(set2)) # Unions of two sets union = len (set1.union(set2)) return intersection / union set_a = { "Geeks" , "for" , "Geeks" , "NLP" , "DSc" } set_b = { "Geek" , "for" , "Geeks" , "DSc." , 'ML' , "DSA" } similarity = jaccard_similarity(set_a, set_b) print ( "Jaccard Similarity:" , similarity) |
Output:
Jaccard Similarity: 0.25
Significance of Jaccard Similarity
The Jaccard similarity is especially effective when the order of items is irrelevant and only the presence or absence of elements is examined. It is extensively used in:
- Text Analysis: Jaccard similarity can be used in natural language processing to compare texts, text samples, or even individual words.
- Recommendation Systems: Jaccard similarity can help in finding similar items or products based on user behavior.
- Data Deduplication: Jaccard similarity can be used to find duplicate or near-duplicate records in a dataset.
- Social Network Analysis: Jaccard similarity can be used in social networks to detect similarities between user profiles or groups.
- Genomics: Jaccard similarity is employed to compare gene sets in biological studies.
How to Calculate Jaccard Similarity in Python
In Data Science, Similarity measurements between the two sets are a crucial task. Jaccard Similarity is one of the widely used techniques for similarity measurements in machine learning, natural language processing and recommendation systems. This article explains what Jaccard similarity is, why it is important, and how to compute it with Python.