Python : How to fix Unexpected UTF-8 BOM error when using json.loads

python-unexpected-bom-error-json-loads-thumb

With Python, it is a really easy to retrieve data from 3rd party API services, so I made a script for this purpose. The script worked without any issue for many different API URLs, but recently, when I wanted to load the server content response from a specific API URL into json.loads method, it threw an "Unexpected UTF-8 BOM" error. In this article, we will examine what the error means and various ways to solve it.

To retrieve the data from 3rd party API service, I use this code in my Python script:

import requests
import json

url="API_ENDPOINT_URL"
r = requests.get(url)
data = json.loads(r.text)
#....do something with the data...

The above code uses requests library to read the data from URL and then it uses json.loads method to deserialize a server's string response containing JSON data into an object.

Until this particular case, the above code worked just fine, but now I was getting the following error:

json.decoder.JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)

The error was caused by the json.loads(r.text), so I examined the value of r.text, which had this:

\ufeff\n{retreived data from the api call}

The content from server's response contained the data from the API, but it also had that strange \ufeff Unicode character at the beginning. It turns out, the Unicode character with value u+feff (or \xef\xbb\xbf in binary) is a byte order mark (BOM) character.

What is BOM

According to Wikipedia, the BOM is an optional value at the beginning of a text stream and the presence can mean different things. With UTF-8 text streams, for example, it can be used to signal that the text is encoded in UTF-8 format, while with UTF-16 & UTF-32, the presence of BOM signals the byte order of a stream.

In my case, the data was in UTF-8 and has already been received, so having that BOM character in r.text seemed unnecessary and since it was causing the json.loads method to throw the JSONDecodeError, I wanted to get rid of it.

The hint on how to solve this problem can be found in the Python error itself. It mentions "decode using utf-8-sig", so let's examine this next.

What is utf-8-sig?

The utf-8-sig is a Python variant of UTF-8, in which, when used in encoding, the BOM value will be written before anything else, while when used during decoding, it will skip the UTF-8 BOM character if it exists and this is exactly what I needed.

So the solution is simple. We just need to decode the data using utf-8-sig encoding, which will get rid of the BOM value. There are several ways to accomplish that.

Solution 1 - using codecs module

First, I tried to use a codecs module which is a part of a Python standard library. It contains encoders and decoders, mostly for converting text. We can use the codecs.decode() method to decode the data using utf-8-sig encoding. Something like this:

import codecs
decoded_data=codecs.decode(r.text, 'utf-8-sig')

Unfortunately, the codecs.decode method didn't accept strings, as it threw the following error:

TypeError: decoding with 'utf-8-sig' codec failed (TypeError: a bytes-like object is required, not 'str')

Next, I tried to convert the string into a bytes object. This can be done using encode() method available for strings. If no specific encoding argument is provided, it will use the default encoding which is UTF-8 (at least on Windows):

decoded_data=codecs.decode(r.text.encode(), 'utf-8-sig')
data = json.loads(decoded_data)

The decoded_data variable finally contained data without the BOM byte order mark Unicode character and I was finally able to use it on json.loads method.

So, this worked, but I didn't like I was using an extra module just to get rid of one Unicode BOM character.

Solution 2 - without using the codecs module

It turns out, there is a way to encode/decode strings without the need of importing codecs module. We can simply use decode() method on the return value of string.encode() method, so we can just do this:

decoded_data=r.text.encode().decode('utf-8-sig') 
data = json.loads(decoded_data)

Let's try to simplify this further.

Solution 3 - using requests.response content property

So far, the code in this article used r.text that contains Request's content response in a string. We can skip the encoding part all together by simply using the r.content instead as this property already contains the server content response in bytes. We then just simply use decode() method on r.content:

decoded_d=r.content.decode('utf-8-sig')
data = json.loads(decoded_data)

Solution 4 - using requests.response encoding property

We can skip the part of calling encode() and decode() methods as shown in previous examples all together and instead use the encoding property of a requests.response object. We just need to make sure we set the value before the call to r.text as shown below:

r.encoding='utf-8-sig'
data = json.loads(r.text)

Conclusion

If the json.loads() method throws an Unexpected UTF-8 BOM error, it is due to a BOM value being present in the stream or a file. In this article, we first examined what this BOM is, then we touched a bit about utf-8-sig encoding and finally, we examined 4 ways to solve this problem.

5 Comments

Click HERE to add your Comment
  1. glbfor
    December 20, 2019
  2. Igor
    January 25, 2020
  3. Brice
    February 3, 2020
  4. Steve
    May 15, 2020
    • Ashwani Gupta
      September 28, 2020

Write a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.