Remove Stopwords From Words Frequency
I am trying to remove stopwords from these data DateTime Clean 2020-01-07 then 28 and 28
Solution 1:
- Use stopwords from
nltk
- They load as a list
- Update the nltk Collections by
import nltk
and thennltk.download()
import pandas as pd
from nltk.corpus import stopwords
# stop words list
stop = stopwords.words('english')
# data and dataframe
data = {'Text': ['all information regarding the state of art',
'all information regarding the state of art',
'to get a good result you should'],
'DateTime': ['2020-01-07', '2020-02-04', '2020-03-06']}
df = pd.DataFrame(data)
# all strings to lowercase, strip whitespace from the ends, and split on space
df.Text = df.Text.str.lower().str.strip().str.split()
# remove stop words from Text
df['Clean'] = df.Text.apply(lambda x: [w.strip() for w in x if w.strip() not in stop])
# explode lists
df = df.explode('Clean')
# groupby DateTime and Clean
dfg = df.groupby(['DateTime', 'Clean']).agg({'Clean': 'count'})
Clean
DateTime Clean
2020-01-07 art 1
information 1
regarding 1
state 1
2020-02-04 art 1
information 1
regarding 1
state 1
2020-03-06 get 1
good 1
result 1
Post a Comment for "Remove Stopwords From Words Frequency"