当前位置:网站首页>How to improve data quality
How to improve data quality
2022-07-24 00:20:00 【000X000】
One 、 Preface
The key step of data quality assurance is data quality rules 、 Data quality indicators , Data exploration , Data guarantee mechanism and data cleaning , Whether you are doing data quality or planning to do data quality work, you can study it in detail , It should help .
This chapter contains the basis of number quality , Data quality rules 、 indicators ( Attached template download ), Data exploration ( Attached template download ), Data assurance mechanism , Data cleaning ( Attached template download ), Common quality problems ( Download documentation attached )

Two 、 Data quality fundamentals
Data quality management (Data Quality Management), It refers to data from plan 、 obtain 、 Storage 、 share 、 maintain 、 application 、 All kinds of data quality problems that may arise in every stage of the extinction life cycle , For identification 、 Measure 、 monitor 、 Early warning and other management activities , And by improving and improving the management level of the organization, the data quality can be further improved .
Data quality is the most critical 6 Dimensions :
1) integrity : It refers to data entry 、 There is no missing or omission in the transmission process , Including physical integrity 、 Attribute integrity 、 Record integrity and field value integrity .
2) timeliness : It refers to the timely recording and transmission of relevant data , Meet the time requirements of business for information acquisition .
3) effectiveness : Refers to the value of the data 、 The format and presentation form meet the requirements of data definition and business definition .
4) Uniformity : It refers to recording and transmitting data and information in accordance with unified data standards , Mainly reflected in data
Is the record standard 、 Is the data logical .
5) Uniqueness : The same data can only have unique identifiers .
6) accuracy : Means truly 、 Accurately record original data , No false data and information .
3、 ... and 、 Data quality rules , Data quality indicators
Data quality rules are the core content of data quality , Completeness and incompleteness of data quality rules and index design , Is it reasonable? , Determines the quality of data . The following is a version I synthesized based on Huawei's way of data, the way of digital transformation of industrial enterprises and my experience , If these rules are in place , Data quality should be guaranteed , Because there are many columns , Full version please get in official account. .
| object | Quality characteristics | Type of rule | indicators |
| Single column | integrity | Cannot be empty class | Null rate |
| effectiveness | Syntax constraint class | 1- Sample record outlier ratio | |
| effectiveness | Format specification class | ||
| effectiveness | Length constraint class | ||
| effectiveness | Range constraint class | ||
| accuracy | Fact reference standard class | Ratio of true records in sample records | |
| Cross column | integrity | Null value class expected | |
| timeliness | Timely warehousing | Ratio of sample records meeting time requirements | |
| Uniformity | Single table equivalent consistent constraint class | ||
| Uniformity | Single table logical consistency constraint class | ||
| enjambment | Uniqueness | Record unique class | |
| Uniformity | Hierarchical consistency constraints | ||
| Cross table | Uniformity | External association constraint class | Ratio of sample records with no corresponding primary key for foreign keys |
| Uniformity | Cross table equivalent consistency constraint class | ||
| Uniformity | Cross table logical consistency constraint class | ||
| Cross system | Uniformity | Cross system record consistency constraint class | Matching rate of sample records with other systems |
| timeliness | Timely warehousing | Ratio of sample records meeting time requirements |
Four 、 Data exploration
Data exploration is a very important step in data quality assurance , He is the foundation of design , Eliminate objective causes , Good efficiency and quality can be improved through design , If there is no data probe , Generally, data items are repeated many times , May affect personnel changes , Handover difficulty , Difficult to maintain , Long project completion cycle and other problems .
Here are just a few aspects of data exploration , For reference , Specific cases , Please get it in the official account. .
Common problems and categories identified are requested in official account. .
Probe item | Analytical significance | Analysis point | Analysis point interpretation |
Integrity analysis | Ensure the reliability of the analysis | Number of null records | The number of records with no value for the probe field at the probe time point |
Total number of records | Total records of probe field at probe time point | ||
Absence rate | The proportion of missing information records in the total records of the exploration field at the exploration time point | ||
Null value alert | The missing rate of probe field at the probe time point is higher than 10% Then give an early warning | ||
Primary key uniqueness | Probe whether the primary key field has duplicate records at the probe time point | ||
Range analysis | Analyze whether there is abnormal data | Maximum | Numerical type , Maximum value of date type field at probe time point |
minimum value | Numerical type , Minimum value of date type field at probe time point | ||
Enumeration value analysis | Lists all enumeration values for the detection field | Enumeration range | Enumeration value definition of property field |
Enumerate actual range values | The actual enumeration value and its distribution of the property field at the exploration time point | ||
Abnormal proportion | Probe time point , The proportion of enumeration values outside the scope of enumeration definition in the total number of records | ||
Logical exploration | Business logic | Probe whether the field follows the business logic according to the business logic |
5、 ... and 、 Data quality assurance Mechanism
The continuous improvement of data quality depends on the guarantee mechanism , Only Automation , Normalization , Continuously monitor data quality , To continuously improve the quality of data , Data quality assurance mainly includes the following key steps :
Design quantitative index —> Design quality scoring rules -> Design score assessment -> Abnormal data monitoring -> Indicators show -> Push and remind relevant responsible persons according to rules

example : Null rate >5%, remember 1 branch , Daily null rate indicator warning , Daily door wide notification , Affect year-end assessment .
This part needs to be designed in detail according to the actual situation of the company .
6、 ... and 、 Data cleaning
Data cleaning (Data cleaning)– The process of re examining and verifying data , The purpose is to remove duplicate information 、 Correct existing errors , And provide data consistency . There are mainly incomplete data 、 bad data 、 There are three categories of duplicate data ;
边栏推荐
- GBase 8c 访问权限查询函数(五)
- PayPal subscription process and API request
- My meeting of OA project (query)
- July 23, 2022 - mapper file description
- 2022年7月23日——mapper文件说明
- Adaptation scheme of large screen visualization
- OA项目之我的会议(查询)
- Gbase 8C session information function (III)
- Sed in-depth understanding and use
- Gbase 8C session information function (I)
猜你喜欢
随机推荐
What are blue-green deployment, Canary release and a/b test
数据模型设计方法概述
Gbase 8C system table information function (I)
Educational Codeforces Round 132 (Rated for Div. 2)(A-D)
Scheme for importing XMIND use cases into tapd (with code)
GBase 8c系统表信息函数(一)
The differences between text and image drawing, data storage, localstorage, sessionstorage, and cookies
腾讯将关闭“幻核”,数字藏品领域发展是否面临阻力?
加密技术应用
jenkins下使用声明式(Declarative)和Jenkinsfile的方式构建Pipeline流水线项目
Gbase 8C mode visibility query function (I)
Pytest interface automation test framework | summary
The universal esp32c3 configures partition tables based on the Arduino ide framework
English grammar_ Demonstrative pronoun - so
GBase 8c 会话信息函数(四)
总结谋划明方向 凝心聚力开新局——和数软件对口援疆项目显成效
Write all the code as soon as you change the test steps? Why not try yaml to realize data-driven?
分布式之 CAP 原则
Application of encryption technology
GBase 8c 二进制字符串操作符








