当前位置:网站首页>[golang] delving into strings -- from byte run string to unicode and UTF-8
[golang] delving into strings -- from byte run string to unicode and UTF-8
2022-06-23 20:06:00 【DDGarfield】
Go Language use UTF-8 code , So any character can use Unicode Express . So ,Go A new term has been introduced into the code , be called rune.rune yes int32 Type alias for :
// rune is an alias for int32 and is equivalent to int32 in all ways. It is
// used, by convention, to distinguish character values from integer values.
type rune = int32
in addition , Strings are often converted to []byte Use , Be specific rune、byte、 The relationship between strings , We must start with the relationship between man and the universe , Bah ! It must start with character coding .
1. ASCII code
Knowledge of digital circuits , We know how to use binary to encode and measure information . The first modern computer was invented and used by Americans , It is natural to think about coding English , therefore ASCII Code is the binary bit corresponding to English characters , And it's still in use today ,ASCII code Occupy 1 Bytes , The highest position is uniformly specified as 0, So it's only used 7 position , All of them can be represented 27=128 Characters , Include 32 Characters that cannot be printed .
2.Unicode
Modern computers are no longer the only one in the United States , The Internet makes the world more interconnected . But there are many kinds of words , Each country has a set of coding rules , The same binary number will be interpreted as different symbols by different codes . If the coding method is not clear every time , No one knows how to decode . Is there any method that does not require blending ? Yes , Is to put aside the unique coding methods of each country , Unified use of one coding method :Unicode
3.UTF-8
Unicode Specifies the binary code of the character , But there's no rule on how to store it . and , The bytes occupied by each character may be different , For example, there are many Chinese characters 10 A few bits of binary , You may need to 2 Bytes ,3 Bytes , even to the extent that 4 Bytes . Although there are unicode Corresponding , It must store as many bytes as it should be , Instead of storing bytes of the same size for each character , After all unicode Yes 100 More than ten thousand , Save all bytes of the same size , It must be a waste of space . But there is a problem to be solved : When should I read 3 Bytes to indicate 1 Characters , When should I read 1 Bytes to represent characters ?
UTF-8 It's storage Unicode The way , But it's not the only one , other utf-16,utf-32 Let the children's shoes explore by themselves , We mainly delve into utf-8. Take a look UTF-8 How to solve the above problems :
When to read 1 Characters in bytes ?
The first bit of the byte is 0, Back 7 Bit signed unicode code . So look at it this way , Of the English alphabet utf-8 and ascii Agreement .
When to read characters with multiple bytes ?
For having n Characters in bytes ,(n>1).... The height of the first byte n It's just 1, let me put it another way :
- The first byte reads 0, That's reading 1 Bytes
- The first byte reads n individual 1, Just read n Bytes
Then the first word is high n Behind you 1 Set as 0, The first two bits of other subsequent bytes are set to 10
0xxxxxxx # read 1 Bytes
110xxxxx 10xxxxxx # Read two bytes
1110xxxx 10xxxxxx 10xxxxxx # read 3 Bytes
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx # read 4 Bytes
Unicode Symbol scope | UTF-8 Encoding mode
( Hexadecimal ) | ( Binary system )
----------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
How to finish UTF-8 Final code ?
It solves the problem of reading a few bytes , There's another problem :Unicode How to fill UTF-8 Each byte of ?
such as Zhang word ,unicode code 5F20, The corresponding Hex is in 0000 0800-0000 FFFF in , That is to say 3 Bytes .
- 1110xxxx 10xxxxxx 10xxxxxx
ZhangOf unicode Corresponding binary :101 111100 100000- Fill from back to front , The high position is not enough 0
- 010000 Fill to the third byte 10xxxxxx → 10100000
- 111100 Fill to the second byte 10xxxxxx → 10111100
- 101 Fill to the first byte 1110xxxx → 1110x101
- High compensation 0 1110x101 → 11100101
- final result :11100101 10111100 10100000 16 Base number E5BCA0
4.go Language string
The string is Go One of the most commonly used basic data types in the language , In fact, a string is a contiguous memory space , An array of characters , Since as an array , It takes up a contiguous piece of memory , This contiguous memory space stores multiple byte , The entire byte array Makes up a string .
5.rune And byte Use
Ascii Code character
package main
import (
"fmt"
"unsafe"
)
func main() {
s := 'a' //rune
fmt.Println(s) // 97
t := unsafe.Sizeof(s)
fmt.Println(t) // 4
}
a yes Ascii Code character , Single quotation marks ' ' The characters of the package ,go Language will think of it as rune type ,rune The type is int32, So take 4 Bytes .
All for Ascii Code string
package main
import (
"fmt"
"unsafe"
)
func main() {
b := "golang"
fmt.Println(b)
s_rune := []rune(b)
s_byte := []byte(b)
fmt.Println(s_byte) // [103 111 76 97 110 103]
fmt.Println(s_rune) // [103 111 76 97 110 103]
}
[]rune()Convert string to rune section[]byte()Convert string to byte section- Because it's all Ascii Code string , So the output integers are consistent
Include non ascii Code string
package main
import (
"fmt"
"unicode/utf8"
"unsafe"
)
func main() {
c := "go Language "
s_rune_c := []rune(c)
s_byte_c := []byte(c)
fmt.Println(s_rune_c) // [103 111 35821 35328]
fmt.Println(s_byte_c) // [103 111 232 175 173 232 168 128]
fmt.Println(utf8.RuneCountInString(c)) //4
fmt.Println(len(c)) //8
fmt.Println(len(s_rune_c)) //4
}
- Chinese characters account for 3 Bytes , So the conversion
[]byteThe length is 8 - Because it has been converted to
[]rune, So the length is 4 utf8.RuneCountInString()obtainUTF-8The length of the encoded string , So follow[]runeAgreement
6. The output of Chinese characters
package main
import (
"fmt"
"unsafe"
)
func main() {
f := " Zhang "
s_byte_f := []byte(f)
s_rune_f := []rune(f)
t := unsafe.Sizeof(s_byte_f)
fmt.Println(s_byte_f) // [299 188 160]
t = unsafe.Sizeof(s_rune_f)
fmt.Println(s_rune_f) // [24352]
e := ' Zhang '
s_byte_e := byte(e)
t = unsafe.Sizeof(s_byte_e)
fmt.Println(t) // 1
fmt.Println(s_byte_e) // Zhang 32?
}
24352?[299 188 160] ? 32???
ZhangOutput value 24352 yes unicode- Hexadecimal 5F20
- Decimal system 24352
- Binary system 101111100100000
- The storage mode is utf-8
- uft-8 code :11100101 10111100 10100000
- 11100101 - 299
- 10111100 - 188
- 10100000 - 160
- This explains why the converted []byte yes [299 188 160]
stay go In language ,byte It's actually uint8 Another name for ,byte and uint8 They can be exchanged directly , Can only be 0~255 Scope int Turn into byte. Beyond this range ,go During the transition , The extra data will be cut off ; however rune turn byte, It's a little different : I will put rune from UTF-8 Convert to Unicode, because Unicode Still beyond byte Scope of representation , So take the low 8 position , Throw away the rest 101111100100000, Can explain why it is output 32( Here is a special Chinese character correspondence table , It can be verified by others .)
7. summary
- Go A string in a language is a Read only byte slice
- Any single character of the declaration ,go Language will regard it as rune type
[]rune()You can convert a string to aruneArray ( namely unicode Array )- One rune It means a Unicode character
- Every
Unicodecharacter , In memory, I use utf-8 Form storage of Unicodecharacter , Output []rune, Will put eachUTF-8Convert toUnicodePost output
[]byte()You can convert a string to abyteArrayUnicodecharacter , Press[]byteOutput , It will UTF-8 Single output for each byte of- Output
[]byte, Will actually store the form in memory according to the string (UTF-8) Output
- and Unicode When characters are cast , Will give priority to Unicode value , Do the conversion again
- about Ascii Code character ,
runeAndbyteThe value is the same- This is because
AsciiCode characterUnicodeIt just needs 1 Bytes , And consistent
- This is because
边栏推荐
- Idea console displays Chinese garbled code
- The golden nine silver ten, depends on this detail, the offer obtains the soft hand!
- 【Golang】快速复习指南QuickReview(六)——struct
- Activity registration | introduction to mongodb 5.0 sequential storage features
- Kinsoku jikou desu新浪股票接口变动
- How to write a great online user manual in 7 steps
- What are the useful personnel management software? Personnel management system software ranking!
- Goldfish rhca memoirs: do447 managing user and team access -- effectively managing users with teams
- Rstudio 1.4 software installation package and installation tutorial
- UGeek大咖说 | 可观测之超融合存储系统的应用与设计
猜你喜欢

ZABBIX monitoring - Aruba AP operation data

重庆 奉节耀奎塔,建成后当地连中五名进士,是川江航运的安全塔

Game asset reuse: a new way to find required game assets faster

基于SSM实现微博系统

直播回顾 | 云原生混部系统 Koordinator 架构详解(附完整PPT)

Interview with Mo Tianlun | ivorysql wangzhibin - ivorysql, an Oracle compatible open source database based on PostgreSQL

How to use the low code platform of the Internet of things for process management?

SQL联合查询(内联、左联、右联、全联)的语法

火线沙龙第26期-多云安全专场

Eight misunderstandings, broken one by one (final): the cloud is difficult to expand, the customization is poor, and the administrator will lose control?
随机推荐
Save: software analysis, verification and test platform
好用的人事管理软件有哪些?人事管理系统软件排名!
教你如何用网页开发桌面应用
【Golang】快速复习指南QuickReview(二)——切片slice
Is it safe to make new debt
力扣每日一练之字符串Day6
【Golang】快速复习指南QuickReview(一)——字符串string
What are the useful personnel management software? Personnel management system software ranking!
20省市公布元宇宙路线图
【Golang】快速复习指南QuickReview(十)——goroutine池
Technology sharing | wvp+zlmediakit realizes streaming playback of camera gb28181
Gaussdb (DWS) database intelligent monitoring operation and maintenance service - node monitoring indicators
LeetCode 473. Match to square
String Day6 of Li Kou daily practice
活动报名 | MongoDB 5.0 时序存储特性介绍
基于SSM实现微博系统
Ready to migrate to the cloud? Please accept this list of migration steps
官宣.NET 7 预览版5
How to avoid the "black swan" incident in the gene field: a security war behind a preventive "recall"
如何避免基因领域“黑天鹅”事件:一场预防性“召回”背后的安全保卫战