Python : How to fix Unexpected UTF-8 BOM error when using json.loads

With Python, it is a really easy to retrieve data from 3rd party API services, so I made a script for this purpose. The script worked without any issue for many different API URLs, but recently, when I wanted to load the server content response from a specific API URL into json.loads method, it threw an "Unexpected UTF-8 BOM" error. In this article, we will examine what the error means and various ways to solve it.

To retrieve the data from 3rd party API service, I use this code in my Python script:

import requests
import json

url="API_ENDPOINT_URL"
r = requests.get(url)
data = json.loads(r.text)
#....do something with the data...

The above code uses requests library to read the data from URL and then it uses json.loads method to deserialize a server's string response containing JSON data into an object.

Until this particular case, the above code worked just fine, but now I was getting the following error:

json.decoder.JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)

The error was caused by the json.loads(r.text), so I examined the value of r.text, which had this:

\ufeff\n{retreived data from the api call}

The content from server's response contained the data from the API, but it also had that strange \ufeff Unicode character at the beginning. It turns out, the Unicode character with value u+feff (or \xef\xbb\xbf in binary) is a byte order mark (BOM) character.

Table of Contents

What is BOM

According to Wikipedia, the BOM is an optional value at the beginning of a text stream and the presence can mean different things. With UTF-8 text streams, for example, it can be used to signal that the text is encoded in UTF-8 format, while with UTF-16 & UTF-32, the presence of BOM signals the byte order of a stream.

In my case, the data was in UTF-8 and has already been received, so having that BOM character in r.text seemed unnecessary and since it was causing the json.loads method to throw the JSONDecodeError, I wanted to get rid of it.

The hint on how to solve this problem can be found in the Python error itself. It mentions "decode using utf-8-sig", so let's examine this next.

What is utf-8-sig?

The utf-8-sig is a Python variant of UTF-8, in which, when used in encoding, the BOM value will be written before anything else, while when used during decoding, it will skip the UTF-8 BOM character if it exists and this is exactly what I needed.

So the solution is simple. We just need to decode the data using utf-8-sig encoding, which will get rid of the BOM value. There are several ways to accomplish that.

Solution 1 - using codecs module

First, I tried to use a codecs module which is a part of a Python standard library. It contains encoders and decoders, mostly for converting text. We can use the codecs.decode() method to decode the data using utf-8-sig encoding. Something like this:

import codecs
decoded_data=codecs.decode(r.text, 'utf-8-sig')

Unfortunately, the codecs.decode method didn't accept strings, as it threw the following error:

TypeError: decoding with 'utf-8-sig' codec failed (TypeError: a bytes-like object is required, not 'str')

Next, I tried to convert the string into a bytes object. This can be done using encode() method available for strings. If no specific encoding argument is provided, it will use the default encoding which is UTF-8 (at least on Windows):

decoded_data=codecs.decode(r.text.encode(), 'utf-8-sig')
data = json.loads(decoded_data)

The decoded_data variable finally contained data without the BOM byte order mark Unicode character and I was finally able to use it on json.loads method.

So, this worked, but I didn't like I was using an extra module just to get rid of one Unicode BOM character.

Solution 2 - without using the codecs module

It turns out, there is a way to encode/decode strings without the need of importing codecs module. We can simply use decode() method on the return value of string.encode() method, so we can just do this:

decoded_data=r.text.encode().decode('utf-8-sig') 
data = json.loads(decoded_data)

Let's try to simplify this further.

Solution 3 - using requests.response content property

So far, the code in this article used r.text that contains Request's content response in a string. We can skip the encoding part all together by simply using the r.content instead as this property already contains the server content response in bytes. We then just simply use decode() method on r.content:

decoded_d=r.content.decode('utf-8-sig')
data = json.loads(decoded_data)

Solution 4 - using requests.response encoding property

We can skip the part of calling encode() and decode() methods as shown in previous examples all together and instead use the encoding property of a requests.response object. We just need to make sure we set the value before the call to r.text as shown below:

r.encoding='utf-8-sig'
data = json.loads(r.text)

Conclusion

If the json.loads() method throws an Unexpected UTF-8 BOM error, it is due to a BOM value being present in the stream or a file. In this article, we first examined what this BOM is, then we touched a bit about utf-8-sig encoding and finally, we examined 4 ways to solve this problem.

5 Comments

Click HERE to add your Comment

glbfor
December 20, 2019


I couldn't resolve my problem from the suggestions above, but I finally used the method below to solve the problem successfully.

resp = requests.get(url,params=j,headers=headers_url_encoded,verify=False) #print(resp.content.decode('utf-8')) resp.encoding='utf-8-sig' content = resp.text.encode().decode('utf-8-sig') return json.loads(content)
Igor
January 25, 2020


Very Good!
Brice
February 3, 2020


Could r.json() work directly ?
As in:
```
r.encoding='utf-8-sig' data = r.json()
```
Steve
May 15, 2020


I just added the 'b' option to read the file in as byte and then was able to work with the file without any of the above. But this page was helpful in getting me to think through the process:

input_file = open('filename', 'br')
- Ashwani Gupta
  September 28, 2020
  
  
  Superb Answer 😀

What is BOM

What is utf-8-sig?

Solution 1 - using codecs module

Solution 2 - without using the codecs module

Solution 3 - using requests.response content property

Solution 4 - using requests.response encoding property

Conclusion

Related Posts

5 Comments

glbfor

Igor

Brice

Steve

Ashwani Gupta

Write a Comment Cancel reply

Write a Comment
Cancel reply