a3nm's blog

Encode stdin in UTF-8

— updated

I just spent 3 miserable hours doing something which seemed easy enough, and I wouldn't wish it on anyone so I'll explain here and hopefully this will be useful to someone.

The task is: write a program which reads text on stdin and outputs it in UTF-8 to stdout. The painful job of guessing the encoding of a text stream has already been done, but I couldn't find any program which implemented the seemingly straightforward task of using this to convert text to UTF-8 in a shell pipe. I wasted some time with enca, which can be useful if you know the language of the file (and even then, it doesn't support latin1, which was a problem) before deciding I had to write this very simple thing myself.

if you know that the input encoding is either utf8 or latin1, I have a simpler solution which does not depend on chardet.

Well, it took more time than expected, and here is the result, using chardet, in Python 3. Note that we are assuming UTF-8 as a default and only fallback to chardet when it doesn't work.

#!/usr/bin/python3

"""a2utf8: print stdin in UTF-8 using chardet if needed"""

import sys
import chardet

sys.stdin = sys.stdin.detach()
sys.stdout = sys.stdout.detach()

data = sys.stdin.read()
try:
  sys.stdout.write(data.decode().encode())
except UnicodeDecodeError:
  encoding = chardet.detect(data)['encoding']
  if encoding == None:
    sys.stdout.write(data)
  else:
    sys.stdout.write(data.decode(encoding).encode())

Notice the mysterious calls to detach(), which are the main surprise. I thought that you just had to use the -u option to Python to get this behaviour, but it turns out that it does not switch stdin to binary mode anymore like it did in Python 2. By the way, to get that behavior when you open a file, you would use:

f = open("file", mode="rb")

The rest is easier. We read the data, try to write it as UTF-8, and, if it fails, try to detect the encoding. If one is found, we use it to decode, but it might also happen that none is found, in which case we output as-is.

annoyingly enough, Mark Pilgrim disappeared and took down chardet, diveintopython3 and the rest of his projects, so the following remark is outdated

[On a side note, you might be interested to know that as of this writing, the Python 3 version of Chardet won't detect UTF-16 and UTF-32 correctly because of a bug in the BOM detection. That's quite unlucky, since the porting of Chardet to Python 3 is the subject of a case study in the Dive Into Python 3 book by Mark Pilgrim, the developer of chardet. I'm just writing this in case someone also got confused trying to test the code above on UTF-16 or UTF-32 files.]

comments welcome at a3nm<REMOVETHIS>@a3nm.net