Skip to content Skip to sidebar Skip to footer

Word Counts In Python Using Regular Expression

What is the correct way to count English words in a document using regular expression? I tried with: words=re.findall('\w+', open('text.txt').read().lower()) len(words) but it see

Solution 1:

Using \w+ won't correctly count words containing apostrophes or hyphens, eg "can't" will be counted as 2 words. It will also count numbers (strings of digits); "12,345" and "6.7" will each count as 2 words ("12" and "345", "6" and "7").

Solution 2:

This seems to work as expected.

>>>import re>>>words=re.findall('\w+', open('/usr/share/dict/words').read().lower())>>>len(words)
234936
>>> 
bash-3.2$ wc /usr/share/dict/words
  234936  234936 2486813 /usr/share/dict/words

Why are you lowercasing your words? What does that have to do with the count?

I'd submit that the following would be more efficient:

words=re.findall(r'\w+', open('/usr/share/dict/words').read())

Post a Comment for "Word Counts In Python Using Regular Expression"