Removing Stop Words Without Using Nltk Corpus
I am trying to remove stop words in a text file without using nltk. I have f1,f2,f3 three text files. f1 has text line by line and f2 has stop words list and f3 is empty file. I wa
Solution 1:
YOu can use Linux Sed method for removing the stopwords
sed -f <(sed 's/.*/s|\\\<&\\\>||g/' stopwords.txt) all_lo.txt > all_remove1.txt
Solution 2:
What I would personally do is loop through the list of stop words (f2) and append each word to a list in your script. Ex:
stoplist = []
file1 = open('f1.txt','r')
file2 = open('f2.txt','r')
file3 = open('f3.txt','a') # append mode. Similar to rw
for line in f2:
w = line.split()
for word in w:
stoplist.append(word)
#end
for line in file1:
w = line.split()
for word in w:
if word in stoplist: continue
else:
file3.write(word)
#end
file1.close()
file2.close()
file3.close()
Solution 3:
your first for loop is wrong because by this command for word in words: t=word
you havnt all words in t the words is a list and you can work with it : also if your files contain multiple line your list dont contain all words !! you must do it like this ! it works correctly !
f1 = open("a.txt","r")
f2 = open("b.txt","r")
f3 = open("c.txt","w")
first_words=[]
second_words=[]
for line in f1:
words = line.split()
for w in words:
first_words.append(w)
for line in f2:
w = line.split()
for i in w:
second_words.append(i)
for word1 in first_words :
for word2 in second_words:
if word1==word2:
first_words.remove(word2)
for word in first_words:
f3.write(word)
f3.write(' ')
f1.close()
f2.close()
f3.close()
Post a Comment for "Removing Stop Words Without Using Nltk Corpus"