当前位置:网站首页>How to improve data quality
How to improve data quality
2022-07-24 00:20:00 【000X000】
One 、 Preface
The key step of data quality assurance is data quality rules 、 Data quality indicators , Data exploration , Data guarantee mechanism and data cleaning , Whether you are doing data quality or planning to do data quality work, you can study it in detail , It should help .
This chapter contains the basis of number quality , Data quality rules 、 indicators ( Attached template download ), Data exploration ( Attached template download ), Data assurance mechanism , Data cleaning ( Attached template download ), Common quality problems ( Download documentation attached )

Two 、 Data quality fundamentals
Data quality management (Data Quality Management), It refers to data from plan 、 obtain 、 Storage 、 share 、 maintain 、 application 、 All kinds of data quality problems that may arise in every stage of the extinction life cycle , For identification 、 Measure 、 monitor 、 Early warning and other management activities , And by improving and improving the management level of the organization, the data quality can be further improved .
Data quality is the most critical 6 Dimensions :
1) integrity : It refers to data entry 、 There is no missing or omission in the transmission process , Including physical integrity 、 Attribute integrity 、 Record integrity and field value integrity .
2) timeliness : It refers to the timely recording and transmission of relevant data , Meet the time requirements of business for information acquisition .
3) effectiveness : Refers to the value of the data 、 The format and presentation form meet the requirements of data definition and business definition .
4) Uniformity : It refers to recording and transmitting data and information in accordance with unified data standards , Mainly reflected in data
Is the record standard 、 Is the data logical .
5) Uniqueness : The same data can only have unique identifiers .
6) accuracy : Means truly 、 Accurately record original data , No false data and information .
3、 ... and 、 Data quality rules , Data quality indicators
Data quality rules are the core content of data quality , Completeness and incompleteness of data quality rules and index design , Is it reasonable? , Determines the quality of data . The following is a version I synthesized based on Huawei's way of data, the way of digital transformation of industrial enterprises and my experience , If these rules are in place , Data quality should be guaranteed , Because there are many columns , Full version please get in official account. .
| object | Quality characteristics | Type of rule | indicators |
| Single column | integrity | Cannot be empty class | Null rate |
| effectiveness | Syntax constraint class | 1- Sample record outlier ratio | |
| effectiveness | Format specification class | ||
| effectiveness | Length constraint class | ||
| effectiveness | Range constraint class | ||
| accuracy | Fact reference standard class | Ratio of true records in sample records | |
| Cross column | integrity | Null value class expected | |
| timeliness | Timely warehousing | Ratio of sample records meeting time requirements | |
| Uniformity | Single table equivalent consistent constraint class | ||
| Uniformity | Single table logical consistency constraint class | ||
| enjambment | Uniqueness | Record unique class | |
| Uniformity | Hierarchical consistency constraints | ||
| Cross table | Uniformity | External association constraint class | Ratio of sample records with no corresponding primary key for foreign keys |
| Uniformity | Cross table equivalent consistency constraint class | ||
| Uniformity | Cross table logical consistency constraint class | ||
| Cross system | Uniformity | Cross system record consistency constraint class | Matching rate of sample records with other systems |
| timeliness | Timely warehousing | Ratio of sample records meeting time requirements |
Four 、 Data exploration
Data exploration is a very important step in data quality assurance , He is the foundation of design , Eliminate objective causes , Good efficiency and quality can be improved through design , If there is no data probe , Generally, data items are repeated many times , May affect personnel changes , Handover difficulty , Difficult to maintain , Long project completion cycle and other problems .
Here are just a few aspects of data exploration , For reference , Specific cases , Please get it in the official account. .
Common problems and categories identified are requested in official account. .
Probe item | Analytical significance | Analysis point | Analysis point interpretation |
Integrity analysis | Ensure the reliability of the analysis | Number of null records | The number of records with no value for the probe field at the probe time point |
Total number of records | Total records of probe field at probe time point | ||
Absence rate | The proportion of missing information records in the total records of the exploration field at the exploration time point | ||
Null value alert | The missing rate of probe field at the probe time point is higher than 10% Then give an early warning | ||
Primary key uniqueness | Probe whether the primary key field has duplicate records at the probe time point | ||
Range analysis | Analyze whether there is abnormal data | Maximum | Numerical type , Maximum value of date type field at probe time point |
minimum value | Numerical type , Minimum value of date type field at probe time point | ||
Enumeration value analysis | Lists all enumeration values for the detection field | Enumeration range | Enumeration value definition of property field |
Enumerate actual range values | The actual enumeration value and its distribution of the property field at the exploration time point | ||
Abnormal proportion | Probe time point , The proportion of enumeration values outside the scope of enumeration definition in the total number of records | ||
Logical exploration | Business logic | Probe whether the field follows the business logic according to the business logic |
5、 ... and 、 Data quality assurance Mechanism
The continuous improvement of data quality depends on the guarantee mechanism , Only Automation , Normalization , Continuously monitor data quality , To continuously improve the quality of data , Data quality assurance mainly includes the following key steps :
Design quantitative index —> Design quality scoring rules -> Design score assessment -> Abnormal data monitoring -> Indicators show -> Push and remind relevant responsible persons according to rules

example : Null rate >5%, remember 1 branch , Daily null rate indicator warning , Daily door wide notification , Affect year-end assessment .
This part needs to be designed in detail according to the actual situation of the company .
6、 ... and 、 Data cleaning
Data cleaning (Data cleaning)– The process of re examining and verifying data , The purpose is to remove duplicate information 、 Correct existing errors , And provide data consistency . There are mainly incomplete data 、 bad data 、 There are three categories of duplicate data ;
边栏推荐
- Understanding polymorphism and letting different "people" do the same thing will produce different results
- Gbase 8C access authority query function (V)
- 【语音合成】TensorFlowTTS 中文文本转语音
- Gbase 8C access authority query function (6)
- 数仓数据指标和标签体系区别
- FPGA——SPI总线控制flash(3)含代码
- Pytest interface automation test framework | summary
- Redis data structure
- IIS deployment.Netcore
- 泛型机制和增强for循环
猜你喜欢

Pipeline pipeline project is built by declarative and jenkinsfile under Jenkins

FPGA - SPI bus control flash (3) including code

Docker builds sonarqube, mysql5.7 environment

Distributed cap principle

Scheme for importing XMIND use cases into tapd (with code)

Redis cluster construction (cluster cluster mode, fragment cluster)

Esp8266 - at command + network transparent transmission

泛型机制和增强for循环

docker 拉取redis镜像 并运行

数仓数据指标和标签体系区别
随机推荐
Pytest interface automated testing framework | common running parameters of pytest
理解多态,让不同的“人”做同一件事情会产生不同的结果
Educational Codeforces Round 132 (Rated for Div. 2)(A-D)
IIS deployment.Netcore
Codeforces Round #807 (Div. 2)(A-D)
docker 拉取redis镜像 并运行
尝试新的方法
合宙ESP32C3基于Arduino IDE框架下配置分区表
Gbase 8C session information function (I)
Docker builds sonarqube, mysql5.7 environment
Lac automatic dialing experiment of L2TP
Gbase 8C session information function (6)
蓝绿部署、金丝雀发布、A/B测试是什么
Write all the code as soon as you change the test steps? Why not try yaml to realize data-driven?
Gbase 8C session information function (III)
Pytest interface automated test framework | pytest generates simple test reports
vulnhub wpwn: 1
Multi table query_ External connection
Educational Codeforces Round 132 (Rated for Div. 2)(A-D)
加密技术应用