Compute Edit Distance For A Dataframe Which Has Only Column And Multiple Rows In Python
I have a dataframe which has one column and more that 2000 rows. How to compute the edit distance between each rows of the same column. My Dataframe looks like this: Name Joh
Solution 1:
This is a neat trick I learned courtesy Adirio. You can use itertools.product
, and then calculate edit distance in a loop.
from itertools import product
dist = np.empty(df.shape[0]**2, dtype=int)
for i, x inenumerate(product(df.Name, repeat=2)):
dist[i] = editdistance.eval(*x)
dist_df = pd.DataFrame(dist.reshape(-1, df.shape[0]))
dist_df
0123456789101112131400864575556456561807776887877888267034556666555434730465556446454574406555653544576566066676577665855560266553657586556206655466857655666011555696866676610256661047645655120454511575435555540443126856573456540441358544766564440114684546566653410
np.empty
initialises an empty array, which you then fill up through each call to editdistance.eval
.
Borrowing from senderle's cartesian_product
, we can achieve some speed gains:
def cartesian_product(*arrays):
la = len(arrays)
dtype = np.result_type(*arrays)
arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(np.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
v = np.apply_along_axis(func1d=lambda x: editdistance.eval(*x),
arr=cartesian_product(df.Name, df.Name), axis=1).reshape(-1, df.shape[0])
dist_df = pd.DataFrame(v)
Alternatively, you could define a function to compute edit distance and vectorise it:
def f(x, y):
return editdistance.eval(x, y)
v = np.vectorize(f)
arr = cartesian_product(df.Name, df.Name).T
arr = v(arr[0, :], arr[1, :])
dist_df = pd.DataFrame(arr.reshape(-1, df.shape[0]))
If you need annotated index and columns, you can just add it when constructing dist_df
:
dist_df = pd.DataFrame(..., index=df.Name, columns=df.Name)
dist_df
Name John Mrinmayee rituja ritz divya priyanka chetna chetan \
Name
John 0 8 6 4 5 7 5 5
Mrinmayee 8 0 7 7 7 6 8 8
rituja 6 7 0 3 4 5 5 6
ritz 4 7 3 0 4 6 5 5
divya 5 7 4 4 0 6 5 5
priyanka 7 6 5 6 6 0 6 6
chetna 5 8 5 5 5 6 0 2
chetan 5 8 6 5 5 6 2 0
mansi 5 7 6 5 5 6 6 6
mansvi 6 8 6 6 6 7 6 6
mani 4 7 6 4 5 6 5 5
aliya 5 7 5 4 3 5 5 5
shelia 6 8 5 6 5 7 3 4
Dilip 5 8 5 4 4 7 6 6
Dilipa 6 8 4 5 4 6 5 6
Name mansi mansvi mani aliya shelia Dilip Dilipa
Name
John 5 6 4 5 6 5 6
Mrinmayee 7 8 7 7 8 8 8
rituja 6 6 6 5 5 5 4
ritz 5 6 4 4 6 4 5
divya 5 6 5 3 5 4 4
priyanka 6 7 6 5 7 7 6
chetna 6 6 5 5 3 6 5
chetan 6 6 5 5 4 6 6
mansi 0 1 1 5 5 5 6
mansvi 1 0 2 5 6 6 6
mani 1 2 0 4 5 4 5
aliya 5 5 4 0 4 4 3
shelia 5 6 5 4 0 4 4
Dilip 5 6 4 4 4 0 1
Dilipa 6 6 5 3 4 1 0
Post a Comment for "Compute Edit Distance For A Dataframe Which Has Only Column And Multiple Rows In Python"