当前位置：网站首页>N methods of data De duplication using SQL

N methods of data De duplication using SQL

2022-06-28 01:02:00 【Ink Sky Wheel】

Remember many years ago , A test girl found me ：

Brother Qiang , The data in my table is repeated , How to delete duplicate data ？

Similar scenarios that require data De duplication , It is quite common in practical work .

Let's talk about , Use SQL Statement come and go , What are the common methods .

If we have one student surface ：

create table student(
 id int,
 name varchar(50),
 age int,
 address varchar(100)
);

The data in the table are as follows ：

Method 1 ： Use DISTINCT Keyword de duplication .

DISTINCT keyword , When use , It will be followed by the duplicate fields . It can ensure that the data of these de duplication fields are not repeated .

such as , Take out student In the table , Not repeated address What are they? , You can use the following SQL sentence ：

select distinct address 
from student;

The results are as follows ：

This method , The biggest advantage is that it is easy to use .

But there is also a big drawback , That is, the de duplicated fields and the fields in the final returned result set , It's consistent . in other words , Above SQL In the sentence , Use address The fields are de duplicated , Final result , Only return address A field .

If you want to use address Field de duplication , And return other fields at the same time ,DISTINCT It can't be done .

Method 2 ： Use GROUP BY Keyword de duplication

And DISTINCT Same keyword ,GROUP BY keyword , It's also the standard SQL Common de duplication methods supported . It can remove the weight at the same time , Synchronously return information of other fields .

Or to address Take the field de duplication as an example , Other fields can be obtained as needed using aggregate functions ：

select min(id),
 max(name),
 max(age),
 address
from student 
group by address;

The results are as follows ：

In the above sentence , Not only for address The field has been de duplicated , It also returns id、name、age Field information .

At this point , Than DISTINCT It's easy to use .

however , Take a closer look. , It seems that something is wrong .

id=1 Of the students , It should be called zhoujunting , But in the above return result, it is yangxiaoyu , Back to age Field , The same problem applies .

in other words , In the returned results , In the same line id、name、age, May not belong to the same student , This makes the data look a bit confusing .

If there are requirements for data consistency , You can use the third method below .

Method 3 ： Use the window function to remove duplicates .

There are several window functions , It's similar in use , This is just an introduction ROW_NUMBER() over(partition by ... order by ...).

select
 id,name,age,address
from (
 select id,name,age,address,
        row_number() over(
         partition by address 
         order by id asc
        ) as rn
 from student
)a
where a.rn = 1;

ROW_NUMBER() The meaning of window function is , First, the data is processed according to partition by The fields of , And then to order by To sort the fields of , The serial number from 1 Began to increase .

above SQL The result returned is ：

This returns the result , It's much more perfect .

however , It should be noted that , Some databases do not support window functions . image MySQL In the database .

Method four ： Use IN duplicate removal

The key to this approach is , Find the characteristics of a set of non repeating data , Then take the data with this feature .

such as ： Press address To and fro , If the data is duplicated , take id The biggest one .

select * 
from student
where id in (
 select max(id) 
 from student 
 group by address
);

SQL The results are as follows ：

Of course , Can also take id The smallest one , Put the... In the above statement max Change to min That's all right. .

This method is suitable for a field in the table where the data is not repeated （ above SQL Medium id Field ） The situation of .

If such a field does not exist in the table , This method is no longer applicable . But some databases , A similar field is built-in and can be used .

such as , stay ORACLE In the database , have access to ROWID Instead of the above SQL Medium id Field . Of course, only limited to ORACLE database ：

select * 
from student
where rowid in (
 select max(rowid) 
 from student 
 group by address
);

Method five ： Use NOT EXISTS duplicate removal

It is similar to the idea of method 4 , Use NOT EXISTS The same effect can be achieved .

select *
from student a
where not exists(
 select 1 
 from student b 
 where a.address = b.address 
 and a.id > b.id
);

SQL The results are as follows ：

Methods six ： Use ALL keyword

stay MySQL In the database , There is a special operator ALL, This is a set operator .

select *
from student a
where a.id <= ALL(
 select b.id
 from student b
 where a.address = b.address
);

SQL The results are as follows ：

Above SQL in ,ALL The operator means to say ,a.id Field to <=ALL Operator all the values found in parentheses .

therefore , The core idea of this method is similar to method 4 .

Methods seven ： Use INNER JOIN + GROUP BY keyword

The core idea of this method , It is also similar to method 4 .

select
 a.*
from student a
inner join student b
on a.address = b.address
and a.id >= b.id
group by a.id,a.name,a.age,a.address
having count(*)=1;

SQL The results are as follows ：

Use the above skillfully 7 A method of data De duplication , Basically, all data De duplication problems can be solved .

Of course , If you have a better way , Welcome to leave a message below .

ps. Click below to read the article , You can download the materials and SQL Example statement . You can also add me wechat 201855204 obtain .

当前位置：网站首页>N methods of data De duplication using SQL

N methods of data De duplication using SQL

边栏推荐

猜你喜欢

随机推荐