Cosine Similarity For Very Large Dataset
Solution 1:
even though your (500000, 100) array (the parent and its children) fits into memory any pairwise metric on it won't. The reason for that is that pairwise metric as the name suggests computes the distance for any two children. In order to store these distances you would need a (500000,500000) sized array of floats which if my calculations are right would take about 100 GB of memory.
Thankfully there is an easy solution for your problem. If I understand you correctly you only want to have the distance between child and parents which will result in a vector of length 500000 which is easily stored in memory.
To do this, you simply need to provide a second argument to cosine_similarity containing only the parent_vector
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame(np.random.rand(500000,100))
df['distances'] = cosine_similarity(df, df.iloc[0:1]) # Here I assume that the parent vector is stored as the first row in the dataframe, but you could also store it separately
n = 10 # or however many you want
n_largest = df['distances'].nlargest(n + 1) # this contains the parent itself as the most similar entry, hence n+1 to get n children
hope that solves your question.
Solution 2:
I couldn't even fit the entire corpus in memory so a solution for me was to load it gradually and compute cosine similarity on smaller batches, always retaining the least/most n
(depending on your usecase) similar items:
data = []
iterations = 0
with open('/media/corpus.txt', 'r') as f:
for line in f:
data.append(line)
if len(data) <= 1000:
pass
else:
print('Getting bottom k, iteration {x}'.format(x=iterations))
data = get_bottom_k(data, 500)
iterations += 1
filtered = get_bottom_k(data, 500) # final most different 500 texts in corpus
def get_bottom_k(corpus:list, k:int):
pairwise_similarity = make_similarity_matrix(corpus) # returns pairwise similarity matrix
sums = csr_matrix.sum(pairwise_similarity, axis=1) # Similarity index for each item in corpus. Bigger > more
sums = np.squeeze(np.asarray(sums))
# similar to other txt.
indexes = np.argpartition(sums, k, axis=0)[:k] # Bottom k in terms of similarity (-k for top and [-k:])
return [corpus[i] for i in indexes]
This is by far an optimal solution but it's the easiest i found so far, maybe it will be of help to someone.
Solution 3:
This solution is insanely fast
child_vectors = np.array(child_vector_1, child_vector_2, ....., child_vector_500000)
input_norm = parent_vector / np.linalg.norm(parent_vector, axis=-1)[:, np.newaxis]
embed_norm = child_vectors/ np.linalg.norm(child_vectors, axis=-1)[:, np.newaxis]
cosine_similarities = np.sort(np.round(np.dot(input_norm, embed_norm.T), 3)[0])[::-1]
paiswise_distances = 1 - cosine_similarities
Post a Comment for "Cosine Similarity For Very Large Dataset"