Skip to content Skip to sidebar Skip to footer

Remove Stopwords From Words Frequency

I am trying to remove stopwords from these data DateTime Clean 2020-01-07 then 28 and 28

Solution 1:

  • Use stopwords from nltk
    • They load as a list
  • Update the nltk Collections by import nltk and then nltk.download()
import pandas as pd
from nltk.corpus import stopwords

# stop words list
stop = stopwords.words('english')

# data and dataframe
data = {'Text': ['all information regarding the state of art',
                 'all information regarding the state of art',
                 'to get a good result you should'],
        'DateTime': ['2020-01-07', '2020-02-04', '2020-03-06']}

df = pd.DataFrame(data)

# all strings to lowercase, strip whitespace from the ends, and split on space
df.Text = df.Text.str.lower().str.strip().str.split()

# remove stop words from Text
df['Clean'] = df.Text.apply(lambda x: [w.strip() for w in x if w.strip() not in stop])

# explode lists
df = df.explode('Clean')

# groupby DateTime and Clean
dfg = df.groupby(['DateTime', 'Clean']).agg({'Clean': 'count'})

                        Clean
DateTime   Clean             
2020-01-07 art              1
           information      1
           regarding        1
           state            1
2020-02-04 art              1
           information      1
           regarding        1
           state            1
2020-03-06 get              1
           good             1
           result           1

Post a Comment for "Remove Stopwords From Words Frequency"