Unicode and JavaScript explained
Last month, I made a sharing and introduced in detail the Unicode character set and the JavaScript language’s support for it. The following is the lecture notes shared this time.
1. What is Unicode?
Unicode originated from a very simple idea: to include all the characters in the world in a set, as long as the computer supports this character set, all characters can be displayed, and there will be no garbled codes anymore.
It starts from 0 and assigns a number to each symbol, which is called a “code point”. For example, the symbol of code point 0 is null (meaning that all binary bits are 0).
U+0000 = null
In the above formula, U+ indicates that the hexadecimal number immediately following is a Unicode code point.
At present, the latest version of Unicode is version 7.0, which includes 109,449 symbols, of which 74,500 Chinese, Japanese and Korean characters are included. It can be approximated that more than two-thirds of the existing symbols in the world come from East Asian scripts. For example, the code point for Chinese “好” is 597D in hexadecimal notation.
U+597D = 好
With so many symbols, Unicode is not defined at once, but by partition. Each area can store 65536 (2 16 ) characters, which is called a plane. Currently, there are a total of 17 (2 5 ) planes, that is to say, the size of the entire Unicode character set is now 2 21 .
The first 65536 character bits are called basic plane (abbreviated BMP), and its code point range is from 0 to 2 16 -1, written in hexadecimal is from U+0000 to U+FFFF. All the most common characters are placed on this plane, which is the first plane defined and published by Unicode.
The remaining characters are placed on the auxiliary plane (abbreviated SMP), and the code point ranges from U+010000 to U+10FFFF.
Two, UTF-32 and UTF-8
Unicode only specifies the code point of each character, and the encoding method is involved in what byte order is used to represent this code point.
The most intuitive encoding method is that each code point is represented by four bytes, and the content of the byte corresponds to the code point one by one. This encoding method is called UTF-32. For example, code point 0 is represented by four bytes of 0, and code point 597D is preceded by two bytes of 0.
U+0000 = 0x0000 0000 U+597D = 0x0000 597D
The advantage of UTF-32 is that the conversion rules are simple and intuitive, and the search efficiency is high. The disadvantage is that it wastes space. English text with the same content is four times larger than ASCII encoding. This shortcoming is so fatal that no one actually uses this encoding method. The HTML 5 standard clearly stipulates that web pages must not be encoded as UTF-32.
What people really need is a space-saving encoding method, which led to the birth of UTF-8. UTF-8 is a variable-length encoding method, with character lengths ranging from 1 byte to 4 bytes. The more commonly used characters, the shorter the byte. The first 128 characters are represented by only 1 byte, which is exactly the same as the ASCII code.
Number range | byte |
0x0000-0x007F | 1 |
0x0080-0x07FF | 2 |
0x0800-0xFFFF | 3 |
0x010000-0x10FFFF | 4 |
Due to the space-saving feature of UTF-8, it has become the most common web page encoding on the Internet. However, it has little to do with today’s topic, so I won’t go into it. For the specific transcoding method, please refer to the “Character Encoding Notes” I wrote many years ago .
3. Introduction to UTF-16
UTF-16 encoding is between UTF-32 and UTF-8, and combines the characteristics of fixed-length and variable-length encoding methods.
Its encoding rules are very simple: characters in the basic plane occupies 2 bytes, and characters in the auxiliary plane occupies 4 bytes. In other words, the encoding length of UTF-16 is either 2 bytes (U+0000 to U+FFFF) or 4 bytes (U+010000 to U+10FFFF).
So there is a question, when we encounter two bytes, how can we tell if it is a character itself, or should it be interpreted together with the other two bytes?
It’s very clever. I don’t know if it is a deliberate design. In the basic plane, from U+D800 to U+DFFF is an empty segment, that is, these code points do not correspond to any characters. Therefore, this empty segment can be used to map characters on the auxiliary plane.
Specifically, the auxiliary plane’s character bits total of 2 20 Ge, that is, corresponding to these characters need at least 20 bits. UTF-16 splits these 20 bits into two halves. The first 10 bits are mapped from U+D800 to U+DBFF (space size 2 10 ), which is called the high bit (H), and the last 10 bits are mapped from U+DC00 to U+DFFF. (The size of the space is 2 10 ), which is called the low bit (L). This means that one auxiliary plane character is split into two basic plane character representations.
Therefore, when we encounter two bytes and find that its code point is between U+D800 and U+DBFF, we can conclude that the code point of the next two bytes should be between U+DC00 and U+DC00. Between U+DFFF, these four bytes must be put together and interpreted.
Four, UTF-16 transcoding formula
When converting Unicode code points into UTF-16, first distinguish whether this is a basic flat character or an auxiliary flat character. If it is the former, directly convert the code point to the corresponding hexadecimal form, with a length of two bytes.
U+597D = 0x597D
If it is an auxiliary plane character, Unicode version 3.0 provides a transcoding formula.
H = Math.floor((c-0x10000) / 0x400)+0xD800 L = (c - 0x10000) % 0x400 + 0xDC00
Take a character
as an example. It is an auxiliary plane character with a code point of U+1D306. The calculation process for converting it to UTF-16 is as follows.
H = Math.floor((0x1D306-0x10000)/0x400)+0xD800 = 0xD834 L = (0x1D306-0x10000) % 0x400+0xDC00 = 0xDF06
Therefore,
the UTF-16 encoding of the
character
is 0xD834 DF06, and the length is four bytes.
5. What kind of encoding does JavaScript use?
The JavaScript language uses the Unicode character set, but only supports one encoding method.
This encoding is neither UTF-16, nor UTF-8, nor UTF-32. None of the above coding methods use JavaScript.
JavaScript uses UCS-2!
Six, UCS-2 encoding
How come out of a UCS-2 suddenly? This requires a bit of history.
In the era before the Internet, there were two teams who wanted to work on a unified character set. One is the Unicode team established in 1988, and the other is the UCS team established in 1989. When they discovered the existence of each other, they soon reached an agreement: There is no need for two unified character sets in the world.
In October 1991, the two teams decided to merge character sets. In other words, from now on, only one set of character sets will be released, which is Unicode, and if the previously released character set is revised, the code points of UCS will be exactly the same as Unicode.
The development progress of UCS is faster than that of Unicode. In 1990, the first encoding method UCS-2 was announced, which uses 2 bytes to represent characters with code points. (At that time, there was only one plane, the basic plane, so 2 bytes were enough.) The UTF-16 encoding was only announced in July 1996, and it was clearly announced that it was a superset of UCS-2, that is, the basic plane characters were used. UCS-2 encoding, the auxiliary plane character defines the representation method of 4 bytes.
The relationship between the two is simply that UTF-16 replaces UCS-2, or UCS-2 is integrated into UTF-16. So, now there is only UTF-16, not UCS-2.
Seven, the background of the birth of JavaScript
So, why didn’t JavaScript choose the more advanced UTF-16, but used UCS-2, which has been eliminated?
The answer is simple: I don’t want to, I can’t. Because when the JavaScript language appeared, there was no UTF-16 encoding.
In May 1995, Brendan Eich spent 10 days designing the JavaScript language; in October, the first interpretation engine came out; in November of the following year, Netscape officially submitted language standards to ECMA (see “The Birth of JavaScript” for the entire process ). Comparing the release time of UTF-16 (July 1996), you will understand that Netscape had no other choice at that time, only UCS-2 encoding method was available!
8. Limitations of JavaScript character functions
Since JavaScript can only handle UCS-2 encoding, all characters in this language are 2 bytes. If it is a 4-byte character, it will be treated as two double-byte characters. JavaScript’s character functions are all affected by this and cannot return correct results.
Taking the character
as an example, its UTF-16 encoding is 4 bytes 0xD834 DF06.
The problem is that the 4-byte encoding does not belong to UCS-2, JavaScript does not recognize it, and will only treat it as two separate characters U+D834 and U+DF06.
As mentioned earlier, these two code points are empty, so JavaScript will think it
is a string composed of two empty characters!
The above code indicates that JavaScript considers
the length of the
character
to be 2, the first character obtained is a null character, and the code point of the first character obtained is 0xDB34.
These results are not correct!
To solve this problem, you must make a judgment on the code point and then adjust it manually. The following is the correct way to traverse the string.
while (++index < length) { // ... if (charCode >= 0xD800 && charCode <= 0xDBFF) { output.push(character + string.charAt(++index)); } else { output.push(character); } }
The above code indicates that when traversing the string, you must make a judgment on the code point. As long as it falls within the range of 0xD800 to 0xDBFF, it must be read together with the next 2 bytes.
Similar problems exist in all JavaScript character manipulation functions.
- String.prototype.replace()
- String.prototype.substring()
- String.prototype.slice()
- …
The above functions are only valid for 2-byte code points. To correctly handle 4-byte code points, you must deploy your own versions one by one, and determine the code point range of the current character.
Nine, ECMAScript 6
The next version of JavaScript, ECMAScript 6 (ES6 for short), greatly enhances Unicode support and basically solves this problem.
(1) Correctly recognize characters
ES6 can automatically recognize 4-byte code points. Therefore, iterating over the string is much simpler.
for (let s of string ) { // ... }
However, in order to maintain compatibility, the length attribute is still the original behavior. In order to get the correct length of the string, you can use the following method.
Array.from(string).length
(2) Code point representation
JavaScript allows direct use of code points to represent Unicode characters, written as “backslash + u + code point”.
'好' === '\u597D' // true
However, this notation is invalid for 4-byte code points. ES6 fixes this problem, as long as the code point is placed in braces, it can be correctly identified.
(3) String processing function
ES6 has added several functions specifically for handling 4-byte code points.
- String.fromCodePoint() : Return the corresponding character from the Unicode code point
- String.prototype.codePointAt() : Return the corresponding code point from the character
- String.prototype.at() : Returns the character at a given position in the string
(4) Regular expression
ES6 provides the u modifier to add 4-byte code point support to regular expressions.
(5) Unicode normalization
In addition to letters, some characters have additional symbols . For example, in the Chinese pinyin Ǒ, the tones above the letters are additional symbols. For many European languages, tonal symbols are very important.
Unicode provides two representation methods. One is a single character with additional symbols, that is, a code point represents a character, for example, the code point of Ǒ is U+01D1; the other is to use the additional symbol as a single code point and display it in combination with the main character, that is, two The code point represents a character, for example, Ǒ can be written as O(U+004F) + ˇ(U+030C).
// 方法一 '\u01D1' // 'Ǒ' // 方法二 '\u004F\u030C' // 'Ǒ'
These two representation methods are exactly the same in terms of vision and semantics, and should be treated as equivalent situations. However, JavaScript cannot tell.
'\u01D1'==='\u004F\u030C' //false
ES6 provides the normalize method, allowing “Unicode normalization” , that is, converting the two methods into the same sequence.
'\u01D1'.normalize() === '\u004F\u030C'.normalize() // true
For more introduction to ES6, please see ” Introduction to ECMAScript 6″ .
==========================
My speech is the above content, please see here for the PPT of the day .
(over)