Skip to content Skip to sidebar Skip to footer

Cosine Similarity For Very Large Dataset

I am having trouble with calculating cosine similarity between large list of 100-dimensional vectors. When I use from sklearn.metrics.pairwise import cosine_similarity, I get Memor

Solution 1:

even though your (500000, 100) array (the parent and its children) fits into memory any pairwise metric on it won't. The reason for that is that pairwise metric as the name suggests computes the distance for any two children. In order to store these distances you would need a (500000,500000) sized array of floats which if my calculations are right would take about 100 GB of memory.

Thankfully there is an easy solution for your problem. If I understand you correctly you only want to have the distance between child and parents which will result in a vector of length 500000 which is easily stored in memory.

To do this, you simply need to provide a second argument to cosine_similarity containing only the parent_vector

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame(np.random.rand(500000,100)) 
df['distances'] = cosine_similarity(df, df.iloc[0:1]) # Here I assume that the parent vector is stored as the first row in the dataframe, but you could also store it separately

n = 10 # or however many you want
n_largest = df['distances'].nlargest(n + 1) # this contains the parent itself as the most similar entry, hence n+1 to get n children

hope that solves your question.


Solution 2:

I couldn't even fit the entire corpus in memory so a solution for me was to load it gradually and compute cosine similarity on smaller batches, always retaining the least/most n (depending on your usecase) similar items:

data = []
iterations = 0
with open('/media/corpus.txt', 'r') as f:
    for line in f:
        data.append(line)
        if len(data) <= 1000:
            pass
        else:
            print('Getting bottom k, iteration {x}'.format(x=iterations))
            data = get_bottom_k(data, 500)
            iterations += 1
filtered = get_bottom_k(data, 500) # final most different 500 texts in corpus


def get_bottom_k(corpus:list, k:int):
    pairwise_similarity = make_similarity_matrix(corpus) # returns pairwise similarity matrix
    sums = csr_matrix.sum(pairwise_similarity, axis=1)  # Similarity index for each item in corpus. Bigger > more
    sums = np.squeeze(np.asarray(sums))
    # similar to other txt.
    indexes = np.argpartition(sums, k, axis=0)[:k] # Bottom k in terms of similarity (-k for top and [-k:])
    return [corpus[i] for i in indexes]

This is by far an optimal solution but it's the easiest i found so far, maybe it will be of help to someone.


Solution 3:

This solution is insanely fast

child_vectors = np.array(child_vector_1, child_vector_2, ....., child_vector_500000)
input_norm = parent_vector / np.linalg.norm(parent_vector, axis=-1)[:, np.newaxis]
embed_norm =  child_vectors/ np.linalg.norm(child_vectors, axis=-1)[:, np.newaxis]
cosine_similarities = np.sort(np.round(np.dot(input_norm, embed_norm.T), 3)[0])[::-1]
paiswise_distances = 1 - cosine_similarities

Post a Comment for "Cosine Similarity For Very Large Dataset"