Skip to content Skip to sidebar Skip to footer

Using Freqdist And Writing To Csv

I'm trying to use nltk and pandas to find the top 100 words from another csv and list them on a new CSV. I am able to plot the words but when I print to CSV I get word | count 52

Solution 1:

Here you go. The code is quite compressed, so feel free to expand if you like.

First, ensure the source file is actually a CSV file (i.e. comma separated). I copied/pasted the sample text from the question into a text file and added commas (as shown below).

Breaking the code down line by line:

  • Read the CSV into a DataFrame
  • Extract the text column and flatten into a string of words, and tokenise
  • Pull the 100 most common words
  • Write the results to a new CSV file

Code:

import pandas as pd
from nltk import FreqDist, word_tokenize

df = pd.read_csv('./SECParse3.csv')
words = word_tokenize(' '.join([line for line in df['text'].to_numpy()]))
common = FreqDist(words).most_common(100)
pd.DataFrame(common, columns=['word', 'count']).to_csv('words_out.csv', index=False

Sample Input:

filename,text
AAL_0000004515_10Q_20200331,generally industry may affected
AAL_0000004515_10Q_20200331,material decrease demand international air travel
AAPL_0000320193_10Q_2020032,february following initial outbreak virus china
AAP_0001158449_10Q_20200418,restructuring cost cost primarily relating early

Output:

word,count
cost,2
generally,1
industry,1
may,1
affected,1
material,1
decrease,1
...

Post a Comment for "Using Freqdist And Writing To Csv"