当前位置:网站首页>Remember: never use UTF-8 in MySQL
Remember: never use UTF-8 in MySQL
2022-06-24 15:35:00 【PHP Development Engineer 】
from : Open source world
Recently I met a bug, I tried to pass Rails In order to “utf8” Coded MariaDB Save a UTF-8 character string , Then there was a strange mistake :
Incorrectstring value:‘\xF0\x9F\x98\x83 <…’ for column ‘summary’ at row 1
I use it UTF-8 Encoded client , The server is also UTF-8 Coded , So is the database , Even the string to be saved “ <…” It's also legal UTF-8.
The crux of the problem is ,MySQL Of “utf8” It's not really UTF-8.
“utf8” Only supports up to three bytes per character , And the real UTF-8 It's up to four bytes per character .
MySQL Never fixed this bug, They are 2010 In, a new one called “utf8mb4” Character set for , Bypassing the problem .
Of course , They didn't advertise the new character set ( Maybe it's because of this bug Make them feel embarrassed ), Even now, developers are still recommended to use “utf8”, But these suggestions are all wrong .
Briefly summarized as follows :
- MySQL Of “utf8mb4” It's true. “UTF-8”.
- 】MySQL Of “utf8” It's a kind of “ Exclusive code ”, It can code Unicode There are not many characters .
I want to clarify here : All in use “utf8” Of MySQL and MariaDB Users should use “utf8mb4”, Never use “utf8”.
So what is coding ? What is? UTF-8?
We all know , Computer use 0 and 1 To store text . Like characters “C” Be saved into “01000011”, So the computer needs to go through two steps to display this character :
- Computer reading “01000011”, Get figures 67, because 67 To be encoded as “01000011”.
- The computer is in Unicode Find... In character set 67, eureka “C”.
alike :
- My computer will “C” mapping Unicode In the character set 67.
- My computer will 67 Code as “01000011”, And send it to Web The server .
Almost all web applications use Unicode Character set , Because there's no reason to use other character sets .
Unicode The character set contains millions of characters . The simplest coding is UTF-32, Use... For each character 32 position . It's the easiest thing to do , Because all along , The computer will 32 Bits are treated as numbers , And computers are the best at dealing with numbers . But the problem is , This is a waste of space .
UTF-8 You can save space , stay UTF-8 in , character “C” It only needs 8 position , Some unusual characters , such as “” need 32 position . Other characters may use 16 Bit or 24 position . An article like this one , If you use UTF-8 code , The only space occupied is UTF-32 About a quarter of .
MySQL Of “utf8” Character set is not compatible with other programs , It's called “”, Maybe it's really a bunch of ……
MySQL Brief history
Why? MySQL Developers will let “utf8” invalid ? We may be able to find out from the submission log .
MySQL from 4.1 Version starting support UTF-8, That is to say 2003 year , And what we use today UTF-8 standard (RFC 3629) It was then that .
The old version of the UTF-8 standard (RFC 2279) At most, each character is supported 6 Bytes .2002 year 3 month 28 Japan ,MySQL Developers in the first MySQL 4.1 The preview version uses RFC 2279.
Same year 9 month , They are for MySQL The source code has been adjusted :“UTF8 Now it only supports 3 A sequence of bytes ”.
Who submitted the code ? Why did he do this ? The question is not known . Migrating to Git after (MySQL The first thing to use is BitKeeper),MySQL Many of the names of the committers in the code base are missing .2003 year 9 There is no clue to explain the change in the month's mailing list .
But I can try to guess .
2002 year ,MySQL Made a decision : If the user can guarantee that each row of the data table uses the same number of bytes , that MySQL You can make a big improvement in performance . So , The user needs to define the text column as “CHAR”, Every “CHAR” Columns always have the same number of characters . If fewer characters are inserted than defined ,MySQL Will be followed by a space , If you insert more characters than defined , The excess behind will be truncated .
MySQL Developers are at the beginning of trying to UTF-8 Every character is used 6 Bytes ,CHAR(1) Use 6 Bytes ,CHAR(2) Use 12 Bytes , And so on .
It should be said , Their first act was right , Unfortunately, this version has not been released . But that's what the document says , And it's widely spread , All you know UTF-8 People agree with what's written in the document .
But obviously ,MySQL Developers or vendors are worried that users will do these two things :
- Use CHAR Definition column ( In the present view ,CHAR It's an old thing , But at that time , stay MySQL Use in CHAR Will be faster , But in the 2005 It won't be like this after ).
- take CHAR The encoding of the column is set to “utf8”.
My guess is MySQL Developers wanted to help users who wanted to win in space and speed , But they screwed up “utf8” code .
So the result is no winner . Those who want to win in space and speed , When they are using “utf8” Of CHAR Column time , The space used is actually larger than expected , It's slower than expected . And users who want to be right , When they use “utf8” When coding , But can't save like “” Such characters .
After the illegal character set was released ,MySQL You can't fix it , Because that requires all users to rebuild their databases . Final ,MySQL stay 2010 It was redistributed in “utf8mb4” To support the real UTF-8.
Why does it make people so crazy
Because of this problem , I've been mad for a whole week . I was “utf8” Fooled , It took a lot of time to find this bug. But I must not be the only one , Almost all the articles on the Internet put “utf8” Think it's real UTF-8.
“utf8” It's just a proprietary character set , It brings us new problems , But it has not been solved .
summary
If you are using MySQL or MariaDB, Do not use “utf8” code , change to the use of sth. “utf8mb4”. here ( https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4 ) Provides a guide for character encoding of existing databases from “utf8” Turn into “utf8mb4”.
come from : Open source world
边栏推荐
- SF express: please sign for MySQL soul ten
- Redis consistency hash and hash slot
- Since the household appliance industry has entered the era of stock competition, why does Suning win the first channel for consecutive times?
- An accident caused by a MySQL misoperation, and the "high availability" cannot withstand it!
- Motion planning of floating base robot
- 【C语言刷题——Leetcode12道题】带你起飞,飞进垃圾堆
- Two way combination of business and technology to build a bank data security management system
- Bitmap of redis data structure
- 东方财富哪个开户更安全,更好点
- Security Analysis on mining trend of dogecoin, a public cloud
猜你喜欢

Do you really know the difference between H5 and applet?
Record the range of data that MySQL update will lock
Bitmap of redis data structure

A brief introduction to the lexical analysis of PostgreSQL
An accident caused by a MySQL misoperation, and the "high availability" cannot withstand it!

动作捕捉系统用于地下隧道移动机器人定位与建图

【C语言刷题——Leetcode12道题】带你起飞,飞进垃圾堆

I have been in the industry for 4 years and have changed jobs twice. I have learned a lot about software testing

Bert whitening vector dimension reduction and its application

高速公路服务区智能一体机解决方案
随机推荐
兴业证券靠谱吗?开证券账户安全吗?
Huangchuping presided over the video conference on fixed-point contact with Zhuhai, resolutely implemented the deployment requirements of the provincial Party committee, and ensured positive results i
Which account of Dongfang fortune is safer and better
Network engineers must know the network essence knowledge!
The future of robots -- deep space exploration
Differential privacy
Closed loop management of time synchronization service -- time monitoring
常见的缺陷管理工具——禅道,从安装到使用手把手教会你
SF express: please sign for MySQL soul ten
Kubernetes practical tips: using ksniff to capture packets
In 2021, big companies often ask IOS interview questions -- runloop
左手代码,右手开源,开源路上的一份子
How to resolve the 35 year old crisis? Sharing of 20 years' technical experience of chief architect of Huawei cloud database
Analysis of dompurify
【Kubernetes】1
Unimelb COMP20008 Note 2019 SM1 - Data formats
Domestic payment system and payment background construction
时间同步业务的闭环管理——时间监测
Tencent cloud native intelligent data Lake Conference will be held, revealing the panoramic matrix of Tencent cloud data Lake products for the first time
Actual combat | a tortuous fishing counteraction