Transliteration of Emojis

Encoding

15 Nov

The wide variety of spoken languages on Earth (widely estimated to be in the order of 7000) are a significant addition to the diversity, culture and rich complexity of society. Of those languages that can be written down, they use many different alphabets and symbols. To record in binary the letters and symbols that make up those alphabets, many different encoding standards have been created. Some standards cover just the characters of a particular language while others cover a wide range of characters from a region or the entire world. Apparently there are 258 encoding standards (according to iana.org). Thankfully this problem has been addressed with the creation of Unicode, whose website defines it as follows: “[t]he Unicode Standard is a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world. In addition, it supports classical and historical texts of many written languages.” UTF-8 is a common implementation of the Unicode Standard.

If Unicode has already fixed the problem, then what’s the issue?

There are two main issues:

The Unicode standard is regularly updated (version 14.0 released September 14, 2021) to add new characters. The version 13 update added 5,930 new characters (including 55 new emoji characters e.g. Military Helmet [As image:, As symbol: 🪖, As hex code point: 1FA96] and Ladder [As image:, As symbol: 🪜, As hex code point: 1FA9C]. The issue is that if the application/operating system you are using (e.g. Windows 10, Android, IOS, Chrome browser, Microsoft Word, Outlook, etc) is not yet using the current version you will not see the correct symbol. In the example above you may or may not see the symbol depending on your OS updates, at the time of writing the Unicode 13 Emoji are correctly rendered on Android but not Windows 10. While this is visually an issue it's not a huge problem for processing/storage, so when this blog is looked at in the future and Windows has had the release with Unicode 13.0 (Version 21H1) it will work correctly. Examples of Unicode 14.0 are Melting Face [As image:, As symbol:🫠, As hex code point:1FAE0] and Troll [As image:, As symbol: 🧌, As hex code point:1F9CC] these will probably be added to Windows late 2022.

Legacy systems not using Unicode. Many operational systems are not using Unicode, instead using one of the alternate encoding standards. This becomes a problem when data is copied or transferred between systems that have a different encoding type. Normally the difference of encoding is not a problem due to most of the conversion between encoding standards is dealt with as part of the transfer mechanism. Most characters have a direct equivalent character across the common encoding standards used in western countries, therefore can be transferred without issues. The issue is with characters that don’t have an equivalent, this can occur with UTF-8 conversion to Latin-1 (ISO-8859-1), a previously common encoding standard, or even more common if transferring from UTF-8 to ASCII. There are a few ways to handle this situation, the most appropriate will depend on the data and the legacy systems. Some possible ways to deal with this are:
- Transliteration - A process of converting the meaning of the characters to western language. An issue with this is that it does not work for Emoji.
- Replacement - Nominate a replacement character that is used for all characters that can not be handled. I.e. all emoji converted to “?”
- Conversion to code representation - Convert all characters that cannot be handled to a text string representing it. For example this could be using the Hex code point or the decimal code point. An issue with this is the increase in length of the original data value which can lead to other system issues. An extreme example of this is due to the emoji modifiers, emoji joints and groups like the family icon e.g. Emoji can be bigger than the 7 character blocks, this is because you can add skin tones (EMOJI MODIFIER FITZPATRICK). The family example of 👨‍👩‍👧‍👦 can be made longer as 11 if each person is given a modifier 👨🏼‍👩🏽‍👧🏿‍👦🏽 (<u0001F468><u0001F3FC><u200D><u0001F469><u0001F3FD><u200D><u0001F467><u0001F3FF><u200D><u0001F466><u0001F3FD>).

From these issues we can begin to understand that with strong solutions still come some weaknesses in encoding. Whilst this transliteration will in most cases provide a human-understandable output, it by no means provides perfect coverage. The conclusion is that there is no simple one fit solution for this and any approach needs to be adopted on a case/system/feed specific basis.

Data TransformationData Quality

Robert Smith

Transliteration of Emojis

Climber to Coder

Benefits of Company Away Days