Typeerror: 'str' Does Not Support The Buffer Interface In Html2text

May 10, 2024 Post a Comment

I'm using python3 to do some web scraping. I want to save a webpage and convert it to text using the following code: import urllib import html2text url='http://www.google.com' page

Solution 1:

I took the time to investigate this, and it turns out to be easily resolved.

Why You Got This Error

The problem is one of bad input: when you called page.read(), a byte string was returned, rather than a regular string.

Byte strings are Python's way of dealing with unfamiliar character encodings: basically there are characters in the raw text that don't map to Unicode (Python 3's default character encoding).

Because Python doesn't know what encoding to use, Python instead represents such strings using raw bytes - this is how all data is represented internally anyway - and lets the programmer decide what encoding to use.

Regular string methods called on these byte strings - such as replace(), which html2text tried to use - fail because byte strings don't have these methods defined.

Solution

html_content = page.read().decode('iso-8859-1')

Padraic Cunningham's solution in the comments is correct in its essence: you have to first tell Python which character encoding to use to try to map these bytes to correct character set.

Unfortunately, this particular text doesn't use Unicode, so asking it to decode using the UTF-8 encoding throws an error.

The correct encoding to use is actually contained in the request headers itself under the Content-Type header - this is a standard header that all HTTP-compliant server responses are guaranteed to provide.

Simply calling page.info().get_content_charset() returns the value of this header, which in this case is iso-8859-1. From there, you can decode it correctly using iso-8859-1, so that regular tools can operate on it normally.

A More Generic Solution

charset_encoding = page.info().get_content_charset()
html_content = page.read().decode(charset_encoding)

Solution 2:

The stream returned by urlopen is indicated as being a bytestream by b as the first character before the quoted string. If you exclude it, as in the appended code it seems to work as input for html2txt.

import urllib
import html2text
url='http://www.google.com'with urllib.request.urlopen(url) as page:
    html_content = page.read()
charset_encoding = page.info().get_content_charset()
rendered_content = html2text.html2text(str(html_content)[1:], charset_encoding)

Revised using suggestions about encoding. Yes, it's a hack, but it runs. Not using str() means the original TypeError problem remains.

Getting Started with Python