Removing Non-ascii Characters From File Text
Python experts: I have a sentence like:     'this time air\u00e6\u00e3o was filled\u00e3o'    I wish to remove the non-Ascii unicode characters.    I can just the following code an
Solution 1:
I have a feeling that instead of having the actual non-ascii characters, the text in your file is actually displaying the utf-8 sequence for the character, ie instead of whatever character you think is there, it is actually the code \u00-- and so when you run your code, it reads every character and sees that they are completely fine so the filter leaves them.
IF this is the case, use this:
import re
defremoveNonAscii(s):
    return re.sub(r'\\u\w{4}','',s)
and it will take away all instances of '\u----'
example:
>>> withopen(r'C:\Users\...\file.txt','r') as f:
    for line in f:
        print(re.sub(r'\\u\w{4}','',line))
this time airo was filledo
where file.txt has:
this time air\u00e6\u00e3o was filled\u00a3o
Post a Comment for "Removing Non-ascii Characters From File Text"