Unicode and JavaScript explained
Last month, I made a sharing and introduced in detail the Unicode character set and the JavaScript language’s support for it. The following is the lecture notes shared this time.

## 1. What is Unicode?
Unicode originated from a very simple idea: to include all the characters in the world in a set, as long as the computer supports this character set, all characters can be displayed, and there will be no garbled codes anymore.

**It starts from 0 and assigns a number to each symbol, which is called “code point”. **For example, the symbol of code point 0 is null (meaning that all binary bits are 0).
U+0000 = null
In the above formula, U+ indicates that the hexadecimal number immediately following is a Unicode code point.

At present, the latest version of Unicode is version 7.0, which includes 109,449 symbols, of which 74,500 Chinese, Japanese and Korean characters are included. It can be approximated that more than two-thirds of the existing symbols in the world come from East Asian scripts. For example, the code point for Chinese “好” is 597D in hexadecimal notation.
U+597D = 好
With so many symbols, Unicode is not defined at once, but by partition. Each area can store 65536 (2 16 ) characters, which is called a plane. Currently, there are a total of 17 (2 5 ) planes, that is to say, the size of the entire Unicode character set is now 2 21 .
The first 65536 character bits are called basic plane (abbreviated BMP), and its code point range is from 0 to 2 16 -1, written in hexadecimal is from U+0000 to U+FFFF. All the most common characters are placed on this plane, which is the first plane defined and published by Unicode.
The remaining characters are placed on the auxiliary plane (abbreviated SMP), and the code point ranges from U+010000 to U+10FFFF.

## Two, UTF-32 and UTF-8
Unicode only specifies the code point of each character, and the encoding method is involved in what byte order is used to represent this code point.
**The most intuitive encoding method is that each code point is represented by four bytes, and the content of the byte corresponds to the code point one by one. This encoding method is called UTF-32. **For example, code point 0 is represented by four bytes of 0, and code point 597D is preceded by three bytes of 0.
U+0000 = 0x0000 0000 0000 0000
U+597D = 0x0000 0000 0000 597D

The advantage of UTF-32 is that the conversion rules are simple and intuitive, and the search efficiency is high. The disadvantage is that it wastes space. English text with the same content is four times larger than ASCII encoding. This shortcoming is so fatal that no one actually uses this encoding method. The HTML 5 standard clearly stipulates that web pages must not be encoded as UTF-32.

What people really need is a space-saving encoding method, which led to the birth of UTF-8. **UTF-8 is a variable-length encoding method, and the character length ranges from 1 byte to 4 bytes. **The more commonly used characters, the shorter the byte. The first 128 characters are represented by only 1 byte, which is exactly the same as the ASCII code.
Number range | byte |
0x0000-0x007F | 1 |
0x0080-0x07FF | 2 |
0x0800-0xFFFF | 3 |
0x010000-0x10FFFF | 4 |
Due to the space-saving feature of UTF-8, it has become the most common web page encoding on the Internet. However, it has little to do with today’s topic, so I won’t go into it. For the specific transcoding method, please refer to the [“Character Encoding Notes”](http://www.ruanyifeng.com/blog/) I wrote many years ago. 2007/10/ascii_unicode_and_utf-8.html).
## Three. Introduction to UTF-16
UTF-16 encoding is between UTF-32 and UTF-8, and combines the characteristics of fixed-length and variable-length encoding methods.
Its encoding rules are very simple: characters in the basic plane occupies 2 bytes, and characters in the auxiliary plane occupies 4 bytes. **In other words, the encoding length of UTF-16 is either 2 bytes (U+0000 to U+FFFF) or 4 bytes (U+010000 to U+10FFFF). **

So there is a question, when we encounter two bytes, how can we tell if it is a character itself, or should it be interpreted together with the other two bytes?
It’s very clever. I don’t know if it is a deliberate design. In the basic plane, from U+D800 to U+DFFF is an empty segment, that is, these code points do not correspond to any characters. Therefore, this empty segment can be used to map characters on the auxiliary plane.
Specifically, the auxiliary plane’s character bits total of 2 20 Ge, that is, corresponding to these characters need at least 20 bits. UTF-16 splits these 20 bits into two halves. The first 10 bits are mapped from U+D800 to U+DBFF (space size 2 10 ), which is called the high bit (H), and the last 10 bits are mapped from U+DC00 to U+DFFF. (The size of the space is 2 10 ), which is called the low bit (L). This means that one auxiliary plane character is split into two basic plane character representations.

**So, when we encounter two bytes and find that its code point is between U+D800 and U+DBFF, we can conclude that the code point of the next two bytes should be in U+ Between DC00 and U+DFFF, these four bytes must be read together. **
## Four, UTF-16 transcoding formula
When converting Unicode code points into UTF-16, first distinguish whether this is a basic flat character or an auxiliary flat character. If it is the former, directly convert the code point to the corresponding hexadecimal form, with a length of two bytes.
U+597D = 0x597D
If it is an auxiliary plane character, Unicode version 3.0 provides a transcoding formula.
H = Math.floor((c-0x10000) / 0x400)+0xD800
L = (c - 0x10000) % 0x400 + 0xDC00

In characters