当前位置:网站首页>N methods of data De duplication using SQL
N methods of data De duplication using SQL
2022-06-28 01:02:00 【Ink Sky Wheel】
Remember many years ago , A test girl found me :
Brother Qiang , The data in my table is repeated , How to delete duplicate data ?
Similar scenarios that require data De duplication , It is quite common in practical work .
Let's talk about , Use SQL Statement come and go , What are the common methods .
If we have one student surface :
create table student(id int,name varchar(50),age int,address varchar(100));
The data in the table are as follows :

Method 1 : Use DISTINCT Keyword de duplication .
DISTINCT keyword , When use , It will be followed by the duplicate fields . It can ensure that the data of these de duplication fields are not repeated .
such as , Take out student In the table , Not repeated address What are they? , You can use the following SQL sentence :
select distinct addressfrom student;
The results are as follows :

This method , The biggest advantage is that it is easy to use .
But there is also a big drawback , That is, the de duplicated fields and the fields in the final returned result set , It's consistent . in other words , Above SQL In the sentence , Use address The fields are de duplicated , Final result , Only return address A field .
If you want to use address Field de duplication , And return other fields at the same time ,DISTINCT It can't be done .
Method 2 : Use GROUP BY Keyword de duplication
And DISTINCT Same keyword ,GROUP BY keyword , It's also the standard SQL Common de duplication methods supported . It can remove the weight at the same time , Synchronously return information of other fields .
Or to address Take the field de duplication as an example , Other fields can be obtained as needed using aggregate functions :
select min(id),max(name),max(age),addressfrom studentgroup by address;
The results are as follows :

In the above sentence , Not only for address The field has been de duplicated , It also returns id、name、age Field information .
At this point , Than DISTINCT It's easy to use .
however , Take a closer look. , It seems that something is wrong .
id=1 Of the students , It should be called zhoujunting , But in the above return result, it is yangxiaoyu , Back to age Field , The same problem applies .
in other words , In the returned results , In the same line id、name、age, May not belong to the same student , This makes the data look a bit confusing .
If there are requirements for data consistency , You can use the third method below .
Method 3 : Use the window function to remove duplicates .
There are several window functions , It's similar in use , This is just an introduction ROW_NUMBER() over(partition by ... order by ...).
selectid,name,age,addressfrom (select id,name,age,address,row_number() over(partition by addressorder by id asc) as rnfrom student)awhere a.rn = 1;
ROW_NUMBER() The meaning of window function is , First, the data is processed according to partition by The fields of , And then to order by To sort the fields of , The serial number from 1 Began to increase .
above SQL The result returned is :

This returns the result , It's much more perfect .
however , It should be noted that , Some databases do not support window functions . image MySQL In the database .
Method four : Use IN duplicate removal
The key to this approach is , Find the characteristics of a set of non repeating data , Then take the data with this feature .
such as : Press address To and fro , If the data is duplicated , take id The biggest one .
select *from studentwhere id in (select max(id)from studentgroup by address);
SQL The results are as follows :

Of course , Can also take id The smallest one , Put the... In the above statement max Change to min That's all right. .
This method is suitable for a field in the table where the data is not repeated ( above SQL Medium id Field ) The situation of .
If such a field does not exist in the table , This method is no longer applicable . But some databases , A similar field is built-in and can be used .
such as , stay ORACLE In the database , have access to ROWID Instead of the above SQL Medium id Field . Of course, only limited to ORACLE database :
select *from studentwhere rowid in (select max(rowid)from studentgroup by address);
Method five : Use NOT EXISTS duplicate removal
It is similar to the idea of method 4 , Use NOT EXISTS The same effect can be achieved .
select *from student awhere not exists(select 1from student bwhere a.address = b.addressand a.id > b.id);
SQL The results are as follows :

Methods six : Use ALL keyword
stay MySQL In the database , There is a special operator ALL, This is a set operator .
select *from student awhere a.id <= ALL(select b.idfrom student bwhere a.address = b.address);
SQL The results are as follows :

Above SQL in ,ALL The operator means to say ,a.id Field to <=ALL Operator all the values found in parentheses .
therefore , The core idea of this method is similar to method 4 .
Methods seven : Use INNER JOIN + GROUP BY keyword
The core idea of this method , It is also similar to method 4 .
selecta.*from student ainner join student bon a.address = b.addressand a.id >= b.idgroup by a.id,a.name,a.age,a.addresshaving count(*)=1;
SQL The results are as follows :

Use the above skillfully 7 A method of data De duplication , Basically, all data De duplication problems can be solved .
Of course , If you have a better way , Welcome to leave a message below .
ps. Click below to read the article , You can download the materials and SQL Example statement . You can also add me wechat 201855204 obtain .
Recommended reading :
边栏推荐
- Ten MySQL locks, one article will give you full analysis
- The limits of Technology (11): interesting programming
- Collection de cas d'effets spéciaux en cliquant sur la souris de la page Web
- 电商转化率这么抽象,到底是个啥?
- Alchemy (9): simple but not simple, never-ending test -- always_ run
- Alchemy (2): why use issue management software
- plot_ Model error: pydot and graphviz are not installed
- 互联网业衍生出来了新的技术,新的模式,新的产业类型
- Hcip/hcie Routing & Switching / datacom Reference Dictionary Series (19) comprehensive summary of PKI knowledge points (public key infrastructure)
- The Internet industry has derived new technologies, new models and new types of industries
猜你喜欢

Technical debt wall: a way to make technical debt visible and negotiable

JVM的内存模型简介

Matlb| improved forward push back method for solving power flow of low voltage distribution network

Taro---day2---编译运行

独立站卖家都在用的五大电子邮件营销技巧,你知道吗?

剑指 Offer 61. 扑克牌中的顺子

FB、WhatsApp群发消息在2022年到底有多热门?

为什么要选择不锈钢旋转接头

Deploy a mongodb single node server locally, enable auth authentication and enable oplog

【无标题】
随机推荐
Official announcement! Apache Doris graduated from the Apache incubator and officially became the top project of Apache!
无人机专用滑环定制要求是什么
手机股票开户安全吗,买股票在哪开户?
Alchemy (7): how to solve problems? Only reconstruction
[untitled]
电商转化率这么抽象,到底是个啥?
How many securities companies can a person open an account? Is it safe to open an account
#796 Div.2 C. Manipulating History 思维
Why are cloud vendors targeting this KPI?
TIME_ Solutions to excessive wait
plot_ Model error: pydot and graphviz are not installed
剑指 Offer 61. 扑克牌中的顺子
Which securities speculation account opening commission is the cheapest and safest
Request object, response object, session object
Acwing第 57 场周赛【未完结】
Function and usage of malloc function in C language
攻击队攻击方式复盘总结
Distortion model of SDF learning
Quickly master grep commands and regular expressions
Alchemy (6): iteratable models and use cases