With Python, it is a really easy to retrieve data from 3rd party API services, so I made a script for this purpose. The script worked without any issue for many different API URLs, but recently, when I wanted to load the server content response from a specific API URL into json.loads
method, it threw an "Unexpected UTF-8 BOM" error. In this article, we will examine what the error means and various ways to solve it.
To retrieve the data from 3rd party API service, I use this code in my Python script:
import requests
import json
url="API_ENDPOINT_URL"
r = requests.get(url)
data = json.loads(r.text)
#....do something with the data...
The above code uses requests library to read the data from URL and then it uses json.loads
method to deserialize a server's string response containing JSON data into an object.
Until this particular case, the above code worked just fine, but now I was getting the following error:
The error was caused by the json.loads(r.text)
, so I examined the value of r.text
, which had this:
The content from server's response contained the data from the API, but it also had that strange \ufeff Unicode character at the beginning. It turns out, the Unicode character with value u+feff (or \xef\xbb\xbf in binary) is a byte order mark (BOM) character.
What is BOM
According to Wikipedia, the BOM is an optional value at the beginning of a text stream and the presence can mean different things. With UTF-8 text streams, for example, it can be used to signal that the text is encoded in UTF-8 format, while with UTF-16 & UTF-32, the presence of BOM signals the byte order of a stream.
In my case, the data was in UTF-8 and has already been received, so having that BOM character in r.text
seemed unnecessary and since it was causing the json.loads
method to throw the JSONDecodeError, I wanted to get rid of it.
The hint on how to solve this problem can be found in the Python error itself. It mentions "decode using utf-8-sig", so let's examine this next.
What is utf-8-sig?
The utf-8-sig is a Python variant of UTF-8, in which, when used in encoding, the BOM value will be written before anything else, while when used during decoding, it will skip the UTF-8 BOM character if it exists and this is exactly what I needed.
So the solution is simple. We just need to decode the data using utf-8-sig encoding, which will get rid of the BOM value. There are several ways to accomplish that.
Solution 1 - using codecs module
First, I tried to use a codecs module which is a part of a Python standard library. It contains encoders and decoders, mostly for converting text. We can use the codecs.decode()
method to decode the data using utf-8-sig encoding. Something like this:
import codecs
decoded_data=codecs.decode(r.text, 'utf-8-sig')
Unfortunately, the codecs.decode
method didn't accept strings, as it threw the following error:
Next, I tried to convert the string into a bytes object. This can be done using encode()
method available for strings. If no specific encoding argument is provided, it will use the default encoding which is UTF-8 (at least on Windows):
decoded_data=codecs.decode(r.text.encode(), 'utf-8-sig')
data = json.loads(decoded_data)
The decoded_data variable finally contained data without the BOM byte order mark Unicode character and I was finally able to use it on json.loads
method.
So, this worked, but I didn't like I was using an extra module just to get rid of one Unicode BOM character.
Solution 2 - without using the codecs module
It turns out, there is a way to encode/decode strings without the need of importing codecs module. We can simply use decode()
method on the return value of string.encode()
method, so we can just do this:
decoded_data=r.text.encode().decode('utf-8-sig')
data = json.loads(decoded_data)
Let's try to simplify this further.
Solution 3 - using requests.response content property
So far, the code in this article used r.text
that contains Request's content response in a string. We can skip the encoding part all together by simply using the r.content
instead as this property already contains the server content response in bytes. We then just simply use decode()
method on r.content
:
decoded_d=r.content.decode('utf-8-sig')
data = json.loads(decoded_data)
Solution 4 - using requests.response encoding property
We can skip the part of calling encode()
and decode()
methods as shown in previous examples all together and instead use the encoding property of a requests.response object. We just need to make sure we set the value before the call to r.text
as shown below:
r.encoding='utf-8-sig'
data = json.loads(r.text)
Conclusion
If the json.loads()
method throws an Unexpected UTF-8 BOM error, it is due to a BOM value being present in the stream or a file. In this article, we first examined what this BOM is, then we touched a bit about utf-8-sig encoding and finally, we examined 4 ways to solve this problem.
glbfor
December 20, 2019I couldn't resolve my problem from the suggestions above, but I finally used the method below to solve the problem successfully.
resp = requests.get(url,params=j,headers=headers_url_encoded,verify=False)
#print(resp.content.decode('utf-8'))
resp.encoding='utf-8-sig'
content = resp.text.encode().decode('utf-8-sig')
return json.loads(content)
Igor
January 25, 2020Very Good!
Brice
February 3, 2020Could
r.json()
work directly ?As in:
```
r.encoding='utf-8-sig'
data = r.json()
```
Steve
May 15, 2020I just added the 'b' option to read the file in as byte and then was able to work with the file without any of the above. But this page was helpful in getting me to think through the process:
input_file = open('filename', 'br')
Ashwani Gupta
September 28, 2020Superb Answer 😀