当前位置:网站首页>Gan, why ".Length! == 3??
Gan, why ".Length! == 3??
2022-07-25 18:54:00 【Java technology stack】
source :juejin.cn/post/7025400771982131236
Occasionally encountered in the development process about coding 、Unicode,Emoji The problem of , I found that I didn't fully grasp the basic knowledge of this aspect . So after some searching and learning , Organize a few easy to understand articles and share them .
I wonder if you have ever encountered such doubts , In the need to check the length of the form , Different characters found length May vary in size . For example, in the title "𠮷" length yes 2( We need to pay attention to , This is not a Chinese character !).
' ji '.length
// 1
'𠮷'.length
// 2
''.length
// 1
''.length
// 2
Copy code To explain this problem, we should start from UTF-16 Let's talk about .
UTF-16
from ECMAScript 2015 You can see in the specification ,ECMAScript Strings use UTF-16 code .
Definite and indefinite : UTF-16 The smallest symbol is two bytes , Even the first byte may be 0 Also take a seat , It's fixed . Not necessarily for the fundamental plane (BMP) Only two bytes are required for the character , Scope of representation
U+0000 ~ U+FFFF, For the supplementary plane, it needs to occupy four bytesU+010000~U+10FFFF.
In the last article , We have introduced utf-8 Coding details , come to know utf-8 Coding needs to occupy 1~4 Different bytes , While using utf-16 You need to take 2 or 4 Bytes . Let's see utf-16 How is it encoded .
UTF-16 Coding logic
UTF-16 The coding is simple , For a given Unicode Code points cp(CodePoint That is, this character is in Unicode Unique number in ):
- If the code point is less than or equal to
U+FFFF( That is, all characters of the basic plane ), No need to deal with , Use it directly . - otherwise , Split into two parts
((cp – 65536) / 1024) + 0xD800,((cp – 65536) % 1024) + 0xDC00To store .
Unicode The standard stipulates U+D800...U+DFFF The value of does not correspond to any character , So it can be used to mark .
Take a specific example : character A The code point is U+0041, It can be directly represented by a symbol .
'\u0041'
// -> A
A === '\u0041'
// -> true
Copy code Javascript in \u Express Unicode The escape character of , Followed by a hexadecimal number .
And characters The code point is U+1f4a9, Characters in the supplementary plane , after The formula calculates two symbols 55357, 56489 These two numbers are expressed in hexadecimal as d83d, dca9, Combine the two coding results into a proxy pair .
'\ud83d\udca9'
// -> ''
'' === '\ud83d\udca9'
// -> true
Copy code because Javascript String usage utf-16 code , So you can correctly pair the agent to \ud83d\udca9 Decode to get the code point U+1f4a9.
You can also use \u + {}, Characters are represented by code points directly followed in braces . Looks different , But they said the results were the same .
'\u0041' === '\u{41}'
// -> true
'\ud83d\udca9' === '\u{1f4a9}'
// -> true
Copy code Can open Dev Tool Of console panel , Run code validation results .
So why length There will be problems with judgment ?
To answer this question , You can continue to view the specification , Mentioned inside : stay ECMAScript Where the operation interprets the string value , Every Elements Are interpreted as Single UTF-16 Code unit .
Where ECMAScript operations interpret String values, each element is interpreted as a single UTF-16 code unit.
So it's like Characters actually take up two UTF-16 Symbol of , That is, two elements , So it's length The attribute is 2.( This is the same as the beginning JS Use USC-2 Coding is about , I thought 65536 One character can meet all the needs )
But for the average user , There's no way to understand , Why did you only fill in one '𠮷', The program prompts that it takes up two characters , How can we correctly identify Unicode Character length ?
I am here Antd Form Used by the form async-validator You can see the following code in the package
const spRegexp = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g;
if (str) {
val = value.replace(spRegexp, '_').length;
}
Copy code When it is necessary to judge the length of the string , All characters in the range of code points in the supplementary plane will be replaced with underscores , In this way, the length judgment is consistent with the actual display !!!
ES6 Yes Unicode Support for
length Attribute problem , Mainly the original design JS In this language , I didn't think there would be so many characters , It is considered that two bytes can be fully satisfied . So it's not just length, Some common operations of string are Unicode Support will also show abnormal .
The following content will introduce some exceptions API And in ES6 How to deal with these problems correctly .
for vs for of
For example, using for Loop print string , The string will follow JS Understand every “ Elements ” Traverse , The characters of the auxiliary plane will be recognized into two “ Elements ”, So there comes “ The statement ”.
var str = 'yo𠮷'
for (var i = 0; i < str.length; i ++) {
console.log(str[i])
}
// -> �
// -> �
// -> y
// -> o
// -> �
// -> �
Copy code While using ES6 Of for of Grammar will not .
var str = 'yo𠮷'
for (const char of str) {
console.log(char)
}
// ->
// -> y
// -> o
// -> 𠮷
Copy code Expand grammar (Spread syntax)
The use of regular expressions was mentioned earlier , Count the character length by replacing the characters of the auxiliary plane . The same effect can be achieved by using the expansion syntax .
[...''].length
// -> 1
Copy code slice, split, substr And so on .
Regular expressions u
ES6 It also aims at Unicode Characters added u The descriptor .
/^.$/.test('')
// -> false
/^.$/u.test('')
// -> true
Copy code charCodeAt/codePointAt
For strings , We also use charCodeAt To get Code Point, about BMP Flat characters are applicable , However, if the character is an auxiliary plane character charCodeAt The returned result will only be the number of the first symbol after encoding .
' plume '.charCodeAt(0)
// -> 32701
' plume '.codePointAt(0)
// -> 32701
''.charCodeAt(0)
// -> 55357
''.codePointAt(0)
// -> 128568
Copy code While using codePointAt Then the characters can be recognized correctly , And return the correct code point .
String.prototype.normalize()
because JS Understand a string as a sequence of two byte symbols , The determination of equality is based on the value of the sequence . So there may be as like some strings that look as like as two peas. , But the result of string equality is false.
'café' === 'café'
// -> false
Copy code The first one in the above code café Yes, there is cafe Add an indented phonetic character \u0301 Composed of , And the second one. café It's made up of caf + é The characters make up . So although they look the same , But the size point is different , therefore JS The result of equality judgment is false.
'cafe\u0301'
// -> 'café'
'cafe\u0301'.length
// -> 5
'café'.length
// -> 4
Copy code In order to correctly identify this code, the points are different , But the same semantic string judgment ,ES6 Added String.prototype.normalize Method .
'cafe\u0301'.normalize() === 'café'.normalize()
// -> true
'cafe\u0301'.normalize().length
// -> 4
Copy code summary
This article is mainly my recent study notes on relearning coding , Because of the rush of time && Level co., LTD. , There must be a lot of inaccurate descriptions in the article 、 Even the wrong content , If you find anything, please kindly point out .️
Recent hot article recommends :
1.1,000+ Avenue Java Arrangement of interview questions and answers (2022 The latest version )
2. Explode !Java Xie Cheng is coming ...
3.Spring Boot 2.x course , It's too complete !
4. Don't write about the explosion on the screen , Try decorator mode , This is the elegant way !!
5.《Java Development Manual ( Song Mountain version )》 The latest release , Download it quickly !
I think it's good , Don't forget to like it + Forward !
边栏推荐
- n-queens problem
- The auction house is a VC, and the first time it makes a move, it throws a Web3
- Vc/pe is running towards Qingdao
- Interface automation test platform fasterrunner series (III) - operation examples
- Interface automation test platform fasterrunner series (IV) - continuous integration and solution of multi domain names
- Weak network test tool -qnet
- Interface automation test platform fasterrunner series (II) - function module
- 7/24 training log
- App test point (mind map)
- 从目标检测到图像分割简要发展史
猜你喜欢

A brief history from object detection to image segmentation

关爱一线防疫工作者,浩城嘉业携手高米店街道办事处共筑公益长城

单臂路由实验演示(Huawei路由器设备配置)

如何创建一个有效的帮助文档?

韩国AI团队抄袭震动学界!1个导师带51个学生,还是抄袭惯犯

推特收购舆论战,被马斯克变成了小孩吵架

Alibaba cloud technology expert Qin long: reliability assurance is a must - how to carry out chaos engineering on the cloud?

Esp32 S3 vscode+idf setup

With a market value of 30billion yuan, the largest IPO in Europe in the past decade was re launched on the New York Stock Exchange

Visual model network connection
随机推荐
蓝牙协议详解(蓝牙是什么)
Typescript object proxy use
Northeast people know sexiness best
Esp32 S3 vscode+idf setup
大厂云业务调整,新一轮战争转向
Automatic machine learning library: Tpot の learning notes
Youwei low code: use resolutions
论文修改回复1
8 年产品经验,我总结了这些持续高效研发实践经验 · 研发篇
There are several browser cores. How to upgrade if the browser version is too low
阿里云技术专家郝晨栋:云上可观测能力——问题的发现与定位实践
3DE reply
「跨链互连智能合约」解读
年轻时代,噢,年轻时代
接口自动化测试平台FasterRunner系列(一)- 简介、安装部署、启动服务、访问地址、配置补充
JVM基础和问题分析入门笔记
The Yellow Crane Tower has a super shocking perspective. You've never seen such a VR panorama!
Experiment 2 goods purchase and sale management system
曾拿2亿融资,昔日网红书店如今全国闭店,60家店仅剩3家
[translation] logstash, fluent, fluent bit, or vector? How to choose the right open source log collector