Lemmatization Of All Pandas Cells
I have a panda dataframe. There is one column, let's name it: 'col' Each entry of this column is a list of words. ['word1', 'word2', etc.] How can I efficiently compute the lemma o
Solution 1:
You can use apply
from pandas with a function to lemmatize each words in the given string. Note that there are many ways to tokenize your text. You might have to remove symbols like .
if you use whitespace tokenizer.
Below, I give an example on how to lemmatize a column of example dataframe.
importnltkw_tokenizer= nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
df = pd.DataFrame(['this was cheesy', 'she likes these books', 'wow this is great'], columns=['text'])
df['text_lemmatized'] = df.text.apply(lemmatize_text)
Solution 2:
|col|
['Sushi Bars', 'Restaurants']
['Burgers', 'Fast Food', 'Restaurants']
wnl = WordNetLemmatizer()
The below creates a function which takes list of words and returns list of lemmatized words. This should work.
deflemmatize(s):
'''For lemmatizing the word
'''
s = [wnl.lemmatize(word) for word in s]
return s
dataset = dataset.assign(col_lemma = dataset.col.apply(lambda x: lemmatize(x))
Post a Comment for "Lemmatization Of All Pandas Cells"