Removing Non-ascii Characters From File Text
Python experts: I have a sentence like: 'this time air\u00e6\u00e3o was filled\u00e3o' I wish to remove the non-Ascii unicode characters. I can just the following code an
Solution 1:
I have a feeling that instead of having the actual non-ascii
characters, the text in your file is actually displaying the utf-8 sequence for the character, ie instead of whatever character you think is there, it is actually the code \u00--
and so when you run your code, it reads every character and sees that they are completely fine so the filter leaves them.
IF this is the case, use this:
import re
defremoveNonAscii(s):
return re.sub(r'\\u\w{4}','',s)
and it will take away all instances of '\u----'
example:
>>> withopen(r'C:\Users\...\file.txt','r') as f:
for line in f:
print(re.sub(r'\\u\w{4}','',line))
this time airo was filledo
where file.txt has:
this time air\u00e6\u00e3o was filled\u00a3o
Post a Comment for "Removing Non-ascii Characters From File Text"