Skip to content Skip to sidebar Skip to footer

Python: Keepning Only The Outerloop Max Result When Comparing String Similarity Of Two Lists

I have two table with an unequal amount of columns but with the same order, lets call the old and new. old has more columns than new than new. The difference between them is that

Solution 1:

I am here to suggest a different approach altogether. Since you are using Levenshtein distance, I suggest using the fuzzywuzzy package that has a faster implementation of the same, as well as some ready-made methods that will perfectly fit your purpose.

from fuzzywuzzy import process
from fuzzywuzzy.fuzzimport ratio
old=['Item number','Item name', 'Item status', 'Stock volume EUR',
     'Stock volume USD', 'Location']

new=['Item_number','Item', 'Item_status','Stock volume EUR', 'Location']

mapper = [(new_hdr,process.extractOne(new_hdr,old,scorer=ratio)[0]) for new_hdr innew]
df = df[[i[1] for i in mapper]]
df.columns = [i[0] for i in mapper]

The solution is much more precise in terms of coding and readability. However, depending on the exact strings, the extractOne method may fail to identify the correct map for all cases. Check if that is happening or not. Check if some such cases are happening or not. Accordingly we may have to customize the scorer argument in extractOne

Post a Comment for "Python: Keepning Only The Outerloop Max Result When Comparing String Similarity Of Two Lists"