Codecs.open(utf-8) Fails To Read Plain Ascii File
Solution 1:
Found your problem:
When passed an encoding, codecs.open
returns a StreamReaderWriter
, which is really just a wrapper around (not a subclass of; it's a "composed of" relationship, not inheritance) StreamReader
and StreamWriter
. Problem is:
StreamReaderWriter
provides a "normal"read
method (that is, it takes asize
parameter and that's it)- It delegates to the internal
StreamReader.read
method, where thesize
argument is only a hint as to the number of bytes to read, but not a limit; the second argument,chars
, is a strict limiter, butStreamReaderWriter
never passes that argument along (it doesn't accept it) - When
size
hinted, but not capped usingchars
, ifStreamReader
has buffered data, and it's large enough to match thesize
hintStreamReader.read
blindly returns the contents of the buffer, rather than limiting it in any way based on thesize
hint (after all, onlychars
imposes a maximum return size)
The API of StreamReader.read
and the meaning of size
/chars
for the API is the only documented thing here; the fact that codecs.open
returns StreamReaderWriter
is not contractual, nor is the fact that StreamReaderWriter
wraps StreamReader
, I just used ipython
's ??
magic to read the source code of the codecs
module to verify this behavior. But documented or not, that's what it's doing (feel free to read the source code for StreamReaderWriter
, it's all Python level, so it's easy).
The best solution is to switch to io.open
, which is faster and more correct in every standard case (codecs.open
supports the weirdo codecs that don't convert between bytes
[Py2 str
] and str
[Py2 unicode
], but rather, handle str
to str
or bytes
to bytes
encodings, but that's an incredibly limited use case; most of the time, you're converting between bytes
and str
). All you need to do is import io
instead of codecs
, and change the codecs.open
line to:
f = io.open("test.py", encoding="utf-8")
The rest of your code can remain unchanged (and will likely run faster to boot).
As an alternative, you could explicitly bypass StreamReaderWriter
to get the StreamReader
's read
method and pass the limiting argument directly, e.g. change:
c = f.read(1)
to:
# Pass second, character limiting argument after size hintc = f.reader.read(6, 1) # 6 is sort of arbitrary; should ensure a full char read in one go
I suspect Python Bug #8260, which covers intermingling readline
and read
on codecs.open
created file objects, applies here, officially, it's "fixed", but if you read the comments, the fix wasn't complete (and may not be possible to complete given the documented API); arbitrarily weird combinations of read
and readline
will be able to break it.
Again, just use io.open
; as long as you're on Python 2.6 or higher, it's available, and it's just plain better.
Post a Comment for "Codecs.open(utf-8) Fails To Read Plain Ascii File"