Skip to content Skip to sidebar Skip to footer

Removing Non-ascii Characters From File Text

Python experts: I have a sentence like: 'this time air\u00e6\u00e3o was filled\u00e3o' I wish to remove the non-Ascii unicode characters. I can just the following code an

Solution 1:

I have a feeling that instead of having the actual non-ascii characters, the text in your file is actually displaying the utf-8 sequence for the character, ie instead of whatever character you think is there, it is actually the code \u00-- and so when you run your code, it reads every character and sees that they are completely fine so the filter leaves them.

IF this is the case, use this:

import re
defremoveNonAscii(s):
    return re.sub(r'\\u\w{4}','',s)

and it will take away all instances of '\u----'

example:

>>> withopen(r'C:\Users\...\file.txt','r') as f:
    for line in f:
        print(re.sub(r'\\u\w{4}','',line))
this time airo was filledo

where file.txt has:

this time air\u00e6\u00e3o was filled\u00a3o

Post a Comment for "Removing Non-ascii Characters From File Text"