Cosine similarity is a metric used to measure how similar two vectors are, which is often used in the context of text similarity or clustering tasks in machine learning. The cosine similarity between two vectors is defined as the cosine of the angle between them, which can be computed as the dot product of the vectors divided by the product of their magnitudes.
Here is the formula for cosine similarity between two vectors AA and BB:
Cosine Similarity=A⋅B∥A∥∥B∥Cosine Similarity=∥A∥∥B∥A⋅B
Where:
- A⋅BA⋅B is the dot product of the vectors AA and BB.
- ∥A∥∥A∥ and ∥B∥∥B∥ are the magnitudes (or norms) of the vectors AA and BB.
Implementation Using Python:
Import Libraries:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
- numpy is used for numerical operations and creating arrays.
- cosine_similarity from sklearn.metrics.pairwise computes the cosine similarity between two vectors.
Define Embeddings:
embedding_1 = np.array([1, 2, 3])
embedding_2 = np.array([4, 5, 6])
embedding_3 = np.array([7, 8, 9])
- These are example embeddings represented as 3-dimensional vectors. In practice, these vectors would come from some embedding technique like Word2Vec, GloVe, or BERT.
Compute Cosine Similarity:
similarity_1_2 = cosine_similarity([embedding_1], [embedding_2])
similarity_1_3 = cosine_similarity([embedding_1], [embedding_3])
similarity_2_3 = cosine_similarity([embedding_2], [embedding_3])
- cosine_similarity function computes the similarity between each pair of embeddings.
- The function returns a 2D array (matrix), but since we are comparing single pairs of vectors, it’s a 1×1 matrix. We extract the value with [0][0].
Print Results:
print(f”Cosine Similarity between embedding_1 and embedding_2: {similarity_1_2[0][0]}”)
print(f”Cosine Similarity between embedding_1 and embedding_3: {similarity_1_3[0][0]}”)
print(f”Cosine Similarity between embedding_2 and embedding_3: {similarity_2_3[0][0]}”)
Results Interpretation
The cosine_similarity function computes the similarity between the given vectors. The output values range between -1 and 1:
- 1 means the vectors are identical (i.e., they point in the same direction).
- 0 means the vectors are orthogonal (i.e., they are at 90 degrees to each other, no similarity).
- -1 means the vectors are diametrically opposite.
Given the embeddings:
- embedding_1 = [1, 2, 3]
- embedding_2 = [4, 5, 6]
- embedding_3 = [7, 8, 9]
The similarity calculations are:
- Cosine similarity between embedding_1 and embedding_2:similarity_1_2=1⋅4+2⋅5+3⋅612+22+32⋅42+52+62=4+10+1814⋅77=321078≈0.974similarity_1_2=12+22+32⋅42+52+621⋅4+2⋅5+3⋅6=14⋅774+10+18=107832≈0.974
- Cosine similarity between embedding_1 and embedding_3:similarity_1_3=1⋅7+2⋅8+3⋅912+22+32⋅72+82+92=7+16+2714⋅194=502716≈0.974similarity_1_3=12+22+32⋅72+82+921⋅7+2⋅8+3⋅9=14⋅1947+16+27=271650≈0.974
- Cosine similarity between embedding_2 and embedding_3:similarity_2_3=4⋅7+5⋅8+6⋅942+52+62⋅72+82+92=28+40+5477⋅194=12214938≈0.974similarity_2_3=42+52+62⋅72+82+924⋅7+5⋅8+6⋅9=77⋅19428+40+54=14938122≈0.974
The embeddings are collinear, leading to cosine similarities very close to 1. This indicates a high degree of similarity between all pairs of vectors.