当前位置：网站首页>Pandas vs. SQL 1_ nanyangjx

Pandas vs. SQL 1_ nanyangjx

2022-06-26 08:40:00 【Algorithmic Pilgrim】

The present python It is widely used in data processing and quantitative analysis data mining , This article will focus on python in pandas Under bag DataFrame The use of this data type is described in detail , And its equivalent in data processing SQL The contrast in , Don't talk much , On and off ！！！
https://blog.csdn.net/weixin_39818658/article/details/79944925

keyword ：pandas、DataFrame、SQL

1、 About filtering data table columns

Yes DataFrame（ That is, a two-dimensional data table ）data：

Use

data=DataFrame(np.arange(9).reshape(3,3),columns=[‘a’,‘b’,‘c’])

Statement to generate the DataFrame

data :

a b c

0 0 1 2

1 3 4 5

2 6 7 8

If we only take data Of a Column Use a slicing operation directly （ notes ： The selection slice operation on the column is a two-layer bracket ）：

data[[‘a’]]

--------->

0 0

1 3

2 6

Corresponding SQL Statement for ：

select a from data

If we want to take a and b The two fields are also slicing operations ：

data[[‘a’,‘b’]]

------>

a b

0 0 1

1 3 4

2 6 7

Corresponding SQL Statement for ：

select a,b from data

First, let's create two here DataFrame：

Basic information data sheet of the company

companydata=DataFrame({'ID':['aa','bb','cc'],
 
                    'name':[' company A',' company B',' company C'],
 
                       'establish_year':['2001','2002','2003'],
 
                       'city':['ShangHai','BeiJing','ShenZhen']})

----->

ID city establish_year name

0 aa ShangHai 2001 company A

1 bb BeiJing 2002 company B

2 cc ShenZhen 2003 company C

List of products issued by the company ：

productdata=DataFrame({'ID':['aa','aa','bb','bb','cc'],
 
                      'productname':['stock_A','bond_A','stock_B','bond_B','stock_C']      
 
                            })

--------->

ID productname

0 aa stock_A

1 aa bond_A

2 bb stock_B

3 bb bond_B

4 cc stock_C

2、 Index of data 、 selection 、 Filter

（1） Single condition screening

First select ’ company A’ All the information about , The operation is as follows ：

companydata[companydata[‘name’]==‘ company A’]
---------------------------------------------------------------------------->
ID city establish_year name

0 aa ShangHai 2001 company A
amount to SQL：
select * from companydata where name=’ company A’

（2） Multi criteria screening （ notes ： The multi criteria filter criteria should be written in brackets ）

selection ’ company B’ also ’city’ by ’BeiJing’, The operation is as follows ：

companydata[(companydata[‘name’]‘ company B’)&(companydata[‘city’]‘BeiJing’)]

-------->

ID     city       establish_year     name

1 bb BeiJing 2002 company B

amount to SQL:

select * from companydata where name=’ company A’ and city=’BeiJing’

selection ’ID’ by ’aa’ perhaps ’cc’ All the information about , The operation is as follows ：

companydata[(companydata[‘ID’]‘aa’)|(companydata[‘ID’]‘cc’)]

------>

ID      city              establish_year     name

0 aa ShangHai 2001 company A

2 cc ShenZhen 2003 company C

amount to SQL:

select * from companydata where ID=’aa’ or ID=’cc’

perhaps

companydata[companydata[‘ID’].isin([‘aa’,‘cc’])]

------->

  ID      city          establish_year     name

0 aa ShangHai 2001 company A

2 cc ShenZhen 2003 company C

amount to SQL：

select * from companydata where ID in (‘aa’,’cc’)

（3） Sometimes we don't need all the fields after filtering, but only one or several fields ,

for example ：

View company A also ’ID’ by ’aa’ Of ’ Year of establishment ’ and ’ city ’

We need to apply it to DataFrame An index method of ix, The operation is as follows ：

companydata.ix[(companydata[‘name’]‘ company A’)&(companydata[‘ID’]‘aa’),[‘city’,‘establish_year’]]

---------->

     city              establish_year

0 ShangHai 2001

amount to SQL:

select city,establish_year from companydata where ID=’aa’ and name=’ company A’

3、 Label according to different data , That is to say SQL Medium case when … then … operation

for example ： When the city where the company is located ’city’ ,’ShangHai’ When is ’SH’,’BeiJing’ When is ’BJ’,’ShenZhen’ When is ’SZ’, The operation is as follows ：

fuc=lambda x:'SH' if x=='ShangHai' else 'BJ' if x=='BeiJing' else 'SZ' if x=='ShenZhen' else np.NaN
 
companydata['citylabel']=companydata['city'].apply(fuc)
 
print(companydata)

（ I can only think of such a way of writing about this , If there is a clever way , Please grant me your advice ）

------->

ID city establish_year name citylabel

0 aa ShangHai 2001 company A SH

1 bb BeiJing 2002 company B BJ

2 cc ShenZhen 2003 company C SZ

amount to SQL：

select
 
    ID,
 
    city,
 
    establish_year,
 
    name,
 
    case ‘city’ when ‘'ShangHai' ’  then ‘SH’
 
    when 'BeiJing' then ‘BJ’
 
    when ‘'ShenZhen' ’ then ‘SZ’ else null 
 
    end as citylabel
 
    from companydata

4、 Aggregation of data

example ：

surface productdata Yes ID Grouping , And calculate each ID There are several financial products

productdata[productname].groupby(productdata[‘ID’]).count()

------>

aa 2

bb 2

cc 1

Name: ID, dtype: int64

amount to SQL：

select count(productname) from productdata group by ID

productdata[productname].groupby([productdata[‘ID’],productdata[‘productname’]]).count()

----------->

ID productname

aa bond_A 1

 stock_A                1

bb bond_B 1

 stock_B                1

cc stock_C 1

Name: ID, dtype: int64

amount to SQL：

select count(productname) from productdata group by ID,‘productname’

5、 Multi table association operation

In general, data processing is not a single table operation , Generally, multiple tables are associated .

example :

Make statistics on what financial products each company has , The operation is as follows ：

pd.merge(companydata,productdata,left_on=‘ID’,right_on=‘ID’,how=‘left’)[[‘ID’,‘name’,‘productname’]]

------>

ID       name   productname

0 aa company A stock_A

1 aa company A bond_A

2 bb company B stock_B

3 bb company B bond_B

4 cc company C stock_C

amount to SQL：

Select A.ID,A.name,B.productname from companydata A left join productdata on A.ID=B.ID

Here is a brief description of the method merge Parameters of

（1） The first two parameters that must be passed in are two DataFrame , I won't elaborate here

（2）left_on Is the left table as the associated foreign key

（3）right_on Is the right table as the associated foreign key

（4）how Is the way of association , There were left\right\inner\outer, Corresponding to SQL Medium left\right\inner\full Four kinds of join The way

原网站

版权声明
本文为[Algorithmic Pilgrim]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202170555218774.html

当前位置：网站首页>Pandas vs. SQL 1_ nanyangjx

Pandas vs. SQL 1_ nanyangjx

边栏推荐

猜你喜欢

随机推荐