How to hack a user using Unicode characters ?

How to hack a user using Unicode characters ?

Unicode exceptionally complicated. Few people know all the tricks: from invisible characters and control characters to surrogate pairs and combo emojis (when adding two digits is obtained by a third). Standard includes 216 code positions in 17 planes. In fact, the study of Unicode can be compared to learning a programming language.

No wonder that web developers have overlooked some of the nuances. On the other hand, attackers can use the features of Unicode in the purposes that do.

Related Unicode bugs have such property that they can be found in any application that processes text entered by the user. There are vulnerabilities in web applications and native applications on Android and iOS. One of the most famous was the iOS from 2015, when several Unicode characters in a text message caused a crash of the operating system. Last year, a similar uniatowski bug found in iOS 11.3, it is known as “black point”. A similar failure occurred in the WhatsApp application for Android, if you touch Emoji.

Unicode — standard character encoding that includes the characters of nearly all written languages in the world. Its use began after it became clear that different languages need different character encodings and therefore need to be put together. Encoding is called the representation of numbers, letters and other characters in computer memory and in a language they understand. Encoding are different, such as, for example, cp1251 or ISO-8859-1but over time their use became uncomfortable because firstly, to correctly display characters of different languages need to use different encodings.

And secondly, a numerical representation of a character can be the same for different letters in different languages. For example, the binary representation 0b11011111 encoded cp1251 is the letter “I”, but at the same time in the encoding ISO-8859-1 this is the German Eszett. With the advent of Unicode the situation has improved and now all the letters and symbols of all languages in the world are in one huge table. Unicode is the standard by which symbols are connected with a certain numeric value, and for the representation of these numbers is elaborately Unicode-coding, the most common of which were UTF-8 and UTF-16.

Information is provided for informational purposes only. Do not break the law!


What is it? Together with ease of use Unicode there are new opportunities for criminals. Many of you know or heard about ARP/IP/DNS stuffing. For Unicode uses the same method, only in this case, the original characters are replaced by identical or most similar from other languages.

For example, in the address letter "a" you can replace "a"but already Russian, and visually they look identical. The problem is not new, because before it was possible to introduce users to the error and using ASCIIcode. For example, when writing an address example.comletter "l" the attackers changed to "I"which, depending on the font used, does not visually differ.

It’s called омографwhen the words look the same in spelling but have different pronunciation. A similar story happened with PayPal. These methods sluffing aimed solely at users, because if you type the address on the keyboard, then be wrong could be difficult, but people like to open links that are sent via e-mail or in any other way. And what URL at the open, when you click on the link of the website the user may not notice.

The second method is a bit similar to the previous — use Punycode. The fact that A-record DNS to allow only English characters, digits, and hyphen. But if there is a need to use symbols from another language, for example, using a domain name in Russian language пример.рфit is necessary Punycode.

Domain encoded will look like pentestit.ruand pеntеstit.ruwhere , with the help Punycode will use Russian letters "е" — how

These are criminals, forcing the visitors to follow links to malicious website where the domain doesn’t have any obvious typos since replaced the letter most similar to the original. Also it works in the opposite direction, when the Russian letter "о" replaced by Latin, or even Greek "омикрон"that is very similar to the original.

Turned on its head

A common option used by attackers is the use of “turncoats”. The fact that Unicode supports all languages and some require you to write not left to right and Vice versa. For this Unicode appended U+202E: right-to-left-overridethat just expands the inscription. This is actively used by attackers who, for example, gpj.exe turn in exe.jpg in the file name. The user, seeing the extension .jpglaunched file with malicious code inside.

This can be done in the following way:

  • Create a file with the extension .exefor example gpj.exe
  • Find and copy the symbol U+202E:Right-to-Left-override.
  • Insert the symbol at the beginning of the file name when it is modified. Therefore, the file gpj.exe it will appear as exe.jpg

In addition to social engineering, a feature of Unicode is used to bypass protection from hacker attacks, for example, WAF.

Other pleasures

If you go down to the level of applications or OSes, here to show bugs in incorrectly constructed the algorithms related to the conversion — normalization is a bad, overly long UTF-8, removing and eating the symbols, incorrect character conversion, etc. This all leads to a wide range of attacks, from XSS to remote code execution.

In General, in terms of fancy Unicode does not restrict you, but rather only supports. Many of the above attacks are often combined, combining the bypass filter with the attack on a specific target. Combining business with pleasure, so to speak. Moreover, the standard is not in place and who knows what will the new extensions, because there were those who later were removed due to safety concerns.

Happy end?!

So, as you know, the problems with Unicode are still problems number one and the reasons for the disparate attacks. But the root of evil here is one of misunderstanding or ignoring the standard. Of course, even the most famous vendors this sin, but it is not supposed to relax. On the contrary, you should think about the scale of the problem. You have already made sure that Unicode quite tricky and wait for the catch, if you give slack and don’t look in the standard. By the way, the standard is updated regularly and therefore do not rely on ancient books or article — outdated information is worse than its absence. But I hope that this article has not left you indifferent to the problem.

Source: and


0 0 vote
Article Rating
Notify of
Inline Feedbacks
View all comments

Do NOT follow this link or you will be banned from the site!
Would love your thoughts, please comment.x

Spelling error report

The following text will be sent to our editors: