Introduction

文字化け (Mojibake) is the name given to scrambled characters resulting of incorrect characters decoding. With the help of concrete examples of such errors, this post will explore their nature and guide through steps that can be used to identify them and potentially recover from it.

Context

This post assumes a system for which UTF-8 is wished and configured. The system receive files from others systems and assume those are encoded in UTF-8 as well. If such system was to receive non UTF-8 data, it would require to know by specification or attached metadata the character encoding used to apply a transformation step. The post focuses on an UNIX/Linux system as example.

In the given context all following file examples are expected to be UTF-8 encoded.

The a.txt file is a case when reading the file there is a decoding issue.

$ cat a.txt 
zürich

The b.txt is another kind of decoding issue that will be used as comparison as it is a bit different.

$ cat b.txt
z�rich

The c.txt is the control case, just to ensure that everything works and as well for comparison later.

$ echo -n "zürich" > c.txt
$ cat c.txt
zürich

Validate the system character encoding

It is important to start with a reminder that the shell commands, cat and echo, are not aware of characters encoding and it is the terminal emulator that will be taking care of character encoding. In the previous example of the c.txt file creation, there should never be any issue because encoding and decoding happens in a consistent context. However it comes without guarantee that UTF-8 is being used.

Starting with locale is likely required as any locale-aware application will be using those values or be used as a default fall-back. However graphical terminal (such as Gnome Terminal or KDE Konsole) will likely have their own separate configuration which overrides the locale configuration when changed.

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
[...]

In case of doubt a failure-proof method is to have the terminal read a valid encoded UTF-8 data of a string with non-ASCII characters. Alternatively and if available a text editor which allows to specify the character encoding could be used to create a file (e.g. vim, Visual Studio Code) .

The following example uses xxd utility to read an UTF-8 valid hexadecimal representation and output it to stdout. If the terminal output contains the expected string then it can be confidently confirmed that UTF-8 is configured.

$ xxd --revert --plain - <<< "7ac3bc726963680a"
zürich

# => Terminal is using UTF-8.

Note: '7ac3bc72696368' is hexadecimal representation of zürich string encoded in UTF-8. xxd reverse mode converts hexadecimal representation back into binary.

An example of a mismatch when the terminal is configured to something else than UTF-8:

$ xxd --revert --plain - <<< "7ac3bc726963680a"
zürich

# => Terminal is incorrectly configured (using x-mac-roman)

The best confidence is reached by having more occurrences of non-ASCII characters in the encoded UTF-8 data. If you terminal emulator's font support Unicode emojis, they are well suited for that as they will be encoded with more than a byte.

$ xxd --revert --plain - <<< "f09f8d8bf09f8d8cf09f8d930a"
🍋🍌🍓

Note: If UTF-8 is configured and the font doesn't support emojis then the missing glyph will appear (e.g 􏿮). This is not considered a Mojibake as the issue is from the rendering by the font and not from character encoding.

Character encoding detection

Next is to identify (or more factually guess) the character encoding used within the file. It gives a good first information but as the case of a.txt will demonstrate it is not always sufficient. Character encoding detection is an educated guess implemented using heuristics. It can guides for resolving issue or used in software "after exhausting all other options".

Using file(1) heuristics:

# a.txt (zürich)
$ file --mime a.txt 
a.txt: text/plain; charset=utf-8

# b.txt (z�rich)
$ file --mime b.txt 
b.txt: text/plain; charset=iso-8859-1

Using chardetect heuristics:

# a.txt (zürich)
$ chardetect a.txt 
a.txt: utf-8 with confidence 0.505

# b.txt (z�rich)
$ chardetect b.txt 
b.txt: ISO-8859-1 with confidence 0.73

Note: The low confidence is likely due to the short length of the example text.

Funfact: Visual Studio Code can guess character encoding using a Javascript re-implementation of chardetect which itself is a re-implementation of a Mozilla library.

In the case of a.txt the content is surprisingly reported as UTF-8. The underlying reason is that the error was introduced by a previous incorrect character decoding operation that wrote the ü Mojibake as valid UTF-8 data.

However for b.txt file, the issue seems obvious now as the file's content is encoded using IS0-8859-1. Using iconv to read the file with the correct character encoding gives the expected string.

$ iconv --from-code=iso88591 --to-code=utf8 b.txt
zürich

Analysis of encoded Mojibake

As demonstrated previously the Mojibake of b.txt was happening by reading data with an incorrect character encoding, in comparison a.txt contains UTF-8 encoded Mojibake from previous transformations.

To have a complete picture while doing the analysis of a.txt let's introduce the file x.txt, a valid UTF-8 file which gives a similar Mojibake as with b.txt.

$ cat x.txt 
z�rich

It looks exactly the same as b.txt but x.txt is more similar to a.txt as it is an encoded UTF-8 Mojibake in the file. A diff on the content using hexadecimal representation clearly distinguish them:

$ diff <(xxd --plain b.txt) <(xxd --plain x.txt)
1c1
< 7afc72696368               <-- This is ISO-8859-1 data
---
> 7aefbfbd72696368           <-- This is UTF-8 data

Comparing the hexadecimal representation of all the UTF-8 files with the control file c.txt depicts the differences.

$ xxd --group 1 c.txt 
00000000: 7a c3 bc 72 69 63 68                             z..rich
$ xxd --group 1 a.txt 
00000000: 7a c3 83 c2 bc 72 69 63 68                       z....rich
$ xxd --group 1 x.txt 
00000000: 7a ef bf bd 72 69 63 68                          z...rich

The following table show the mapping from bytes to their UTF-8 characters. UTF-8 is a variable length character encoding and thus one character can be encoded between one and four bytes.

c.txt (zürich)a.txt (zürich)x.txt (z�rich)
hex UTF8 char
7a z
c3bc ü
72 r
69 i
63 c
68 h
hex UTF8 char
7a z
c383 Ã
c2bc ¼
72 r
69 i
63 c
68 h
hex UTF8 char
7a z
efbfbd
72 r
69 i
63 c
68 h

UTF-8 is backwards compatible with ASCII thanks to its variable length capability. In UTF-8 byte sequences starting with a leading 0 are used for ASCII encoded character using that one byte. Thus z is 0x7a=01111010 in UTF-8 as in ASCII. Since in ASCII ü character does not exist it is encoded with two bytes 0xc3bc=11000011 10111100 in UTF-8.

In both cases this confirms the character detection which was reporting UTF-8 being present. The above tables gives us pieces of information on what went wrong:

  • Ã (0xc383) and ¼ (0xc2bc) are valid UTF-8 characters, the combination of both characters is in place of the expected ü.
  • � (0xefbfbd) is the UTF-8 replacement character, which is used when UTF-8 is used for decoding and a series of bytes was not recognized.

Let's first discuss the case of x.txt. The presence of the unicode replacement character (�) indicates that non UTF-8 data was decoded with UTF-8. The replacement character overwrites the original unrecognized data. There is no information anymore about what was that unknown bytes. Without the original content it is tedious to reliably recover from it. The only information which remains are the surrounding characters that hints what might be the missing character. Such recovery for a longer text would require a human or a language processing algorithm using dictionaries to replace Mojibake occurrences.

Not all decoding errors create Mojibake of the same kind:

  • some can be recovered from easily by changing the character encoding used when reading (as seen previously with b.txt),
  • some are in fact holes from the original text (as above with x.txt).
  • and others with a bit of analysis (and luck) a reverse operation to recover intended characters can be found (as below with a.txt)

Recover from encoded Mojibake

There are two unknowns in the pipeline that created a.txt: the character encoding used to encode the data originally (<charset1>) and with which character encoding it was incorrectly decoded (<charset2>).

The assumed incorrect pipeline is:

$ echo -n "zürich" |
  iconv -f utf8 -t <charset1> |
  iconv -f <charset2> -t utf8 | # (1) Decoding error
  xxd -p
7ac383c2bc72696368

Note: UTF-8 is used to match the terminal's configuration of this example.

A pseudo-code with separate functions for encoding and decoding:

encode(decode(encode('zürich', <charset1>), <charset2>), "UTF-8") 
= 0x7ac383c2bc72696368 (zürich)

Step by step with focus on 'ü':

encode('ü', <charset1>) = [B1,B2]

decode([B1,B2], <charset2>) = 'ü'

encode('ü', 'UTF-8') = 0xc383c2bc

  • <charset1> encoded 'ü' with two bytes [B1,B2]
  • In <charset2> the byte B1 is mapping to the character Ã
  • In <charset2> the byte B2 is mapping to the character ¼
  • The bytes B1 and B2 are unknown

If the unknown charsets can be identified the original data can be recovered by reversing the invalid operation. The reverse of decoding <charset1> data with <charset2> is to encode the resulting data using <charset2> and decode it using <charset1>.

$ echo -n "7ac383c2bc72696368" |  # <--- zürich in UTF-8
  xxd -r -p |
  iconv -f utf8 -t <charset2> |   # (2) Reverse of the error
  iconv -f <charset1> -t utf8     # (2) Reverse of the error
zürich

Note: UTF-8 is used to match the terminal's configuration of this example.

decode(encode(decode(0x7ac383c2bc72696368, "UTF8"), <charset2>), <charset1>) 
= 'zürich'

decode(0xc383c2bc, 'UTF-8') = 'ü'

encode('ü', <charset2>) = [B1,B2]

decode([B1,B2], <charset1>) = 'ü'

The first encode operation transforms ü string back to [B1,B2] bytes which are valid bytes representing ü in <charset1>. The reverse transformation pipeline can be applied for a longer text and recover the original content.

Not so obvious within this example are the conditions for which such recovery operation is possible, which involves the luck mentioned before:

  • All bits sequence produced by <charset1> encoding has to map to any character when decoded with <charset2>. This is crucial to not have holes in the data resulting of decoding errors. It doesn't have to be the correct character, but the bits sequence has to be valid. If <charset2> decodes <charset1> data with errors, which skips or replaces some input data, then it would create unrecoverable holes, similar to the previous case of x.txt with Unicode replacement character (�).
  • <charset1> and <charset2> are supporting and encoding similarly some characters (i.e ASCII latin alphabet).
  • The original text is using enough compatible characters between <charset1> and <charset2>.

The first condition guarantees there is no loss of information. The last two are not essential for a recovery but enable to infer the expected character in place of ü from the surrounding letters and subsequently the character encodings involved. See appendix Harder cases of Mojibake for further examples when one of those conditions is not met.

Unfortunately there is not enough information within the file's content to reliably infer the identity of <charset1> or <charset2>. Discovering both will require on one or multiple iterations of: guesses, heuristics or brute force.

Let's summarize all gathered clues:

  • <charset1> is at least a two bytes character encoding as it uses two bytes to encode 'ü'.
  • <charset2> is likely a single byte charset and as it decodes the two bytes separately.
  • <charset2> and <charset1> seems to encode ASCII character similarly as the others characters are preserved (i.e, 'zrich').

In most circumstances, the regional geography context defines the languages used by the systems in which the Mojibake appears, and are helping at reducing the scope to good candidates. For this example let's reduce the scope to the most popular characters encoding on the Web. With the above clues, the candidates could be in order of popularity:

  • <charset1>, two or more bytes: UTF-8, GB2312, Shift JIS
  • <charset2>, single byte: ISO-8859-1, CP1251, CP1252

Testing the first candidate of each gives:

$ echo -n "7ac383c2bc72696368" |  # <--- zürich in UTF-8
  xxd -r -p |
  iconv -f utf8 -t iso88591 |     # -t <charset2>, (2) Reverse of the error
  iconv -f utf8 -t utf8           # -f <charset1>, (2) Reverse of the error
zürich

Note: The last iconv becomes superfluous but kept to match the previously shown pipeline.

This reverse operation gives the expected string as output. Given the historic prevalence of ISO-8859-1, CP1251 and CP1252, it is not surprising to find those creeping into today's Mojibake.

Decoding error resulting in UTF-8 encoded Mojibake are so common that ftfy exists to identify Mojibake using a collection of heuristics and apply the correct transformations.

$ echo -n "7ac383c2bc72696368" | # <--- zürich in UTF-8
  xxd -r -p |
  ftfy
zürich

The command line version has some limitations compared to all the available options of the Python library. For example using fix_encoding_and_explain function gives more details:

>>> from ftfy import fix_encoding_and_explain
>>> fix_encoding_and_explain("zürich")
ExplainedText(text='zürich', explanation=[
  ('encode', 'latin-1'), 
  ('decode', 'utf-8')])

While the example of this post was a single pass of decoding issue, ftfy is searching for Mojibake recursively and apply transformations until no heuristics identify a Mojibake in the resulting string.

>>> from ftfy import fix_encoding_and_explain
>>> fix_encoding_and_explain("zürich")
ExplainedText(text='zürich', explanation=[
  ('encode', 'sloppy-windows-1252'), 
  ('decode', 'utf-8'), 
  ('encode', 'latin-1'), 
  ('decode', 'utf-8')])

Please note that ftfy is not meant to recover from all possible mixes of character encodings but only "to cover the most common encoding mix-ups while keeping false positives to a very low rate".

Another example of a very unlikely case for which ftfy might not help, is when the original text is unfortunately using some combination of characters used by its heuristics. In the following example ¼ string is legitimately typed in the original text.

$ echo -n "zürich ¼" |
  iconv -f iso88591 -t utf8 |
  ftfy
zürich ¼

While using the manual reverse operation gives the expected string:

$ echo -n "zürich ¼" |
  iconv -f iso88591 -t utf8 |
  iconv -f utf8 -t iso88591
zürich ¼

Note: The last reverse step is done by the terminal which decodes ISO-8859-1 data with UTF-8.

Conclusion

In this post different examples of Mojibake were explored to demonstrate that they are of different kinds and how to distinguish them.

When trying to resolve Mojibake, here is a list of steps that can be followed:

  • Verify with which character encoding the data is read.
  • Use character detection to get quick hints but always doubt their results.
  • Search to reduce the scope of character encoding considered for recovery.
  • Use ftfy, especially when you assume the original text was in UTF-8.
  • Explore binary data for further analysis with the help of character encoding tables.

References

The following is a collection of already mentioned references but as well others resources used while writing this post.

ftfy:

Unicode tools:

Articles/docs:

Wikis:

xkcd:

A Mojibake puzzle from "The Simpsons":

Appendices

Cases creation

The pipelines that created the different Mojibake of this post.

a.txt, incorrectly decoding UTF-8 data with ISO-8859-1:

$ echo -n "7ac3bc72696368" | # this is valid UTF-8 for 'zürich'
  xxd -r -p |
  iconv -f iso88591 -t utf8 > a.txt

b.txt, creating a valid ISO-8859-1 file:

$ echo -n "7ac3bc72696368" | # this is valid UTF-8 for 'zürich'
  xxd -r -p |
  iconv -f utf8 -t iso88591 > b.txt

x.txt, incorrectly decoding ISO-8859-1 data with UTF-8:

$ echo -n "7ac3bc72696368" | # this is valid UTF-8 for 'zürich'
  xxd -r -p |
  iconv -f utf8 -t iso88591 |
  uconv -f utf8 --from-callback substitute -t utf8 > x.txt

Note: Switched to uconv for the --from-callback substitute which allows to not interrupt on decoding errors and use the replacement character (�).

Harder cases of Mojibake

Easy Mojibake have the following in common:

  1. No information loss.
  2. Some compatible characters between involved character encodings.
  3. The original text is using enough compatible characters.

No information loss is required for recovery with reverse operations. The last two are optional and mainly usefuf to identify the involved characters encoding.

Information loss

Using iconv or uconv will by default stop when a decoding error is encountered. This behavior prevents information loss to slip through an incorrect decoding. Thus in order to create information loss, an explicit option has to be used (i.e --from-callback skip).

$ echo -n "7ac3bc72696368" | # this is valid UTF-8 for 'zürich'
  xxd -r -p |
  iconv -f utf8 -t cp1252 |
  uconv -f shift-jis --from-callback skip -t utf8 # (1) Decoding error
zich

The resulting string is missing two characters and nothing is in place of them. There is no information to apply a recovery operation on such string. The behavior of stopping on error is the expected and wanted behavior, however in the wild some systems might silently fails on errors and create such issue.

The presence of the Unicode replacement character (�, 0xefbfbd) already encoded in bytes indicates, as well, an information loss (see appendix cases creation)

UTF-8 using Japanese decoded with ISO-8859-1

This example shows that while recovery is possible the encoded Mojibake leave less clues about what the original content is, thus rendering the identification of involved character encodings harder. In fact all the original content is turned into Mojibake.

Mojibake creation:

$ echo -n  "e69687e5ad97e58c96e38191" | # This valid UTF-8 for '文字化け'
  xxd -r -p |
  iconv -f iso88591 -t utf8    # (1) Decoding error
æå­ åã

In the above the output string shown is an approximation, as the actual encoded data contains several control characters which are messing with the terminal, and will not be present when using clipboard. The previous examples of this post were chosen to not contain any control characters to avoid this restriction.

Inspecting the hexadecimal representation:

$ echo -n  "e69687e5ad97e58c96e38191" | # This valid UTF-8 for '文字化け'
  xxd -r -p |
  iconv -f iso88591 -t utf8 |           # (1) Decoding error
  xxd -p
c3a6c296c287c3a5c2adc297c3a5c28cc296c3a3c281c291

Using a UTF-8 decoder utility that prints per character information (e.g 1, 2, 3 ) with this hexadecimal representation gives the following:

U+00E6 LATIN SMALL LETTER AE character (&#x00E6;)
U+0096 <control> character (&#x0096;)
U+0087 <control> character (&#x0087;)
U+00E5 LATIN SMALL LETTER A WITH RING ABOVE character (&#x00E5;)
U+00AD SOFT HYPHEN character (&#x00AD;)
U+0097 <control> character (&#x0097;)
U+00E5 LATIN SMALL LETTER A WITH RING ABOVE character (&#x00E5;)
U+008C <control> character (&#x008C;)
U+0096 <control> character (&#x0096;)
U+00E3 LATIN SMALL LETTER A WITH TILDE character (&#x00E3;)
U+0081 <control> character (&#x0081;)
U+0091 <control> character (&#x0091;)

Such Mojibake will require to directly feed the binary data to the recovery. Trying to copy paste the approximate string will not work.

Pipeline with recovery step:

$ echo -n  "e69687e5ad97e58c96e38191" | # This valid UTF-8 for '文字化け'
  xxd -r -p |
  iconv -f iso88591 -t utf8 |           # (1) Decoding error
  iconv -f utf8  -t iso88591 |          # (2) Reverse of the error
  iconv -f utf8 -t utf8                 # (2) Reverse of the error
文字化け

Pipeline with ftfy recovery:

$ echo -n  "e69687e5ad97e58c96e38191" | # This valid UTF-8 for '文字化け'
  xxd -r -p |
  iconv -f iso88591 -t utf8 |           # (1) Decoding error
  ftfy                                  # (2') ftfy recovery
文字化け