当前位置:网站首页>How to customize sorting for pandas dataframe
How to customize sorting for pandas dataframe
2020-11-06 01:28:00 【Artificial intelligence meets pioneer】
author |B. Chen compile |VK source |Towards Data Science

Pandas DataFrame There's a built-in method sort_values(), You can sort values according to a given variable . The method itself is quite simple to use , But it doesn't work with custom sort , for example ,
-
t T-shirt size :XS、S、M、L and XL
-
month : January 、 February 、 March 、 April, etc
-
What day : Monday 、 Tuesday 、 Wednesday 、 Thursday 、 Friday 、 Saturday and Sunday .
In this paper , We will learn how to deal with Pandas DataFrame Custom sort .
Please check my Github repo To get the source code :https://github.com/BindiChen/machine-learning/blob/master/data-analysis/017-pandas-custom-sort/pandas-custom-sort.ipynb
problem
Suppose we have a data set about clothing stores :
df = pd.DataFrame({
'cloth_id': [1001, 1002, 1003, 1004, 1005, 1006],
'size': ['S', 'XL', 'M', 'XS', 'L', 'S'],
})

We can see , Each piece of cloth has a size value , The data should be sorted in the following order :
-
XS For extra large
-
S For Trumpet
-
M For medium
-
L For big
-
XL For extra large
however , When calling sort_values('size') when , You will get the following output .

The output is not what we want , But it's technically correct . actually ,sort_values() It is to sort numerical data in numerical order , Sort the object data in alphabetical order .
Here are two common solutions :
-
Create a new column for a custom sort
-
Use CategoricalDtype Cast data to an ordered category type
Create a new column for a custom sort
In this solution , A mapping data frame is needed to represent a custom sort , Then create a new column from the map , Finally, we can sort the data by new columns . Let's take an example to see how this works .
First , Let's create a mapping data frame to represent a custom sort .
df_mapping = pd.DataFrame({
'size': ['XS', 'S', 'M', 'L', 'XL'],
})
sort_mapping = df_mapping.reset_index().set_index('size')

after , Use sort_mapping Create a new column with the mapping values in size_num.
df['size_num'] = df['size'].map(sort_mapping['index'])
Last , Sort values by new column size .
df.sort_values('size_num')

This, of course, is our job . But it creates an alternate column , Efficiency may be reduced when dealing with large data sets .
We can use CategoricalDtype To solve this problem more effectively .
Use CategoricalDtype Cast data to an ordered category type
CategoricalDtype Is a type of categorical data with a category and order [1]. It's very useful for creating custom sorts [2]. Let's take an example to see how this works .
First , Let's import CategoricalDtype.
from pandas.api.types import CategoricalDtype
then , Create a custom category type cat_size_order
-
The first parameter is set to ['XS'、'S'、'M'、'L'、'XL'] As a unique value of size .
-
The second parameter ordered=True, Think of this variable as ordered .
cat_size_order = CategoricalDtype(
['XS', 'S', 'M', 'L', 'XL'],
ordered=True
)
then , call astype(cat_size_order) Cast size data to a custom category type . By running df['size'], We can see size Column has been converted to a category type , The order is [XS<S<M<L<XL].
>>> df['size'] = df['size'].astype(cat_size_order)
>>> df['size']
0 S
1 XL
2 M
3 XS
4 L
5 S
Name: size, dtype: category
Categories (5, object): [XS < S < M < L < XL]
Last , We can call the same method to sort the values .
df.sort_values('size')

It works better . Let's see what the principle is .
Use cat Of codes Attribute access
Now? size Column has been converted to category type , We can use .cat Accessor to view the classification properties . Behind the scenes , It USES codes Property to represent the size of an ordered variable .
Let's create a new column code , So we can compare size and code values side by side .
df['codes'] = df['size'].cat.codes
df

We can see XS、S、M、L and XL The codes for are 0、1、2、3、4 and 5.codes Is the actual value of the category . By running df.info(), We can see that it's actually int8.
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cloth_id 6 non-null int64
1 size 6 non-null category
2 codes 6 non-null int8
dtypes: category(1), int64(1), int8(1)
memory usage: 388.0 bytes
Sort by multiple variables
Next , Let's make things a little more complicated . here , We will sort the data frames by multiple variables .
df = pd.DataFrame({
'order_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007],
'customer_id': [10, 12, 12, 12, 10, 10, 10],
'month': ['Feb', 'Jan', 'Jan', 'Feb', 'Feb', 'Jan', 'Feb'],
'day_of_week': ['Mon', 'Wed', 'Sun', 'Tue', 'Sat', 'Mon', 'Thu'],
})
Similarly , Let's create two custom category types cat_day_of_week and cat_month, And pass them on to astype().
cat_day_of_week = CategoricalDtype(
['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
ordered=True
)
cat_month = CategoricalDtype(
['Jan', 'Feb', 'Mar', 'Apr'],
ordered=True,
)
df['day_of_week'] = df['day_of_week'].astype(cat_day_of_week)
df['month'] = df['month'].astype(cat_month)
To sort by multiple variables , We just need to pass a list instead of sort_values(). for example , Press month and day_of_week Sort .
df.sort_values(['month', 'day_of_week'])

Press ustomer_id,month and day_of_week Sort .
df.sort_values(['customer_id', 'month', 'day_of_week'])

That's it , Thanks for reading .
In my, please Github Export the notebook to get the source code :https://github.com/BindiChen/machine-learning/blob/master/data-analysis/017-pandas-custom-sort/pandas-custom-sort.ipynb
Reference
- [1] Pandas.CategoricalDtype API(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.CategoricalDtype.html)
- [2] Pandas Categorical CategoricalDtype tutorial (https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#categorical-categoricaldtype)
Link to the original text :https://towardsdatascience.com/how-to-do-a-custom-sort-on-pandas-dataframe-ac18e7ea5320
Welcome to join us AI Blog station : http://panchuang.net/
sklearn Machine learning Chinese official documents : http://sklearn123.com/
Welcome to pay attention to pan Chuang blog resource summary station : http://docs.panchuang.net/
版权声明
本文为[Artificial intelligence meets pioneer]所创,转载请带上原文链接,感谢
边栏推荐
- ES6学习笔记(二):教你玩转类的继承和类的对象
- Keyboard entry lottery random draw
- Python saves the list data
- Three Python tips for reading, creating and running multiple files
- The data of pandas was scrambled and the training machine and testing machine set were selected
- Python download module to accelerate the implementation of recording
- Analysis of partial source codes of qthread
- Vue.js Mobile end left slide delete component
- 至联云分享:IPFS/Filecoin值不值得投资?
- (2)ASP.NET Core3.1 Ocelot路由
猜你喜欢

Using Es5 to realize the class of ES6

Python saves the list data

教你轻松搞懂vue-codemirror的基本用法:主要实现代码编辑、验证提示、代码格式化

How to encapsulate distributed locks more elegantly

ES6学习笔记(四):教你轻松搞懂ES6的新增语法

Mac installation hanlp, and win installation and use

ES6学习笔记(五):轻松了解ES6的内置扩展对象

JVM memory area and garbage collection

Filecoin主网上线以来Filecoin矿机扇区密封到底是什么意思

NLP model Bert: from introduction to mastery (2)
随机推荐
Analysis of partial source codes of qthread
Relationship between business policies, business rules, business processes and business master data - modern analysis
6.1.2 handlermapping mapping processor (2) (in-depth analysis of SSM and project practice)
Word segmentation, naming subject recognition, part of speech and grammatical analysis in natural language processing
Programmer introspection checklist
High availability cluster deployment of jumpserver: (6) deployment of SSH agent module Koko and implementation of system service management
Python基础变量类型——List浅析
钻石标准--Diamond Standard
Using Es5 to realize the class of ES6
Tool class under JUC package, its name is locksupport! Did you make it?
6.1.1 handlermapping mapping processor (1) (in-depth analysis of SSM and project practice)
比特币一度突破14000美元,即将面临美国大选考验
[C / C + + 1] clion configuration and running C language
Did you blog today?
Solve the problem of database insert data garbled in PL / SQL developer
Python基础数据类型——tuple浅析
EOS创始人BM: UE,UBI,URI有什么区别?
Keyboard entry lottery random draw
git rebase的時候捅婁子了,怎麼辦?線上等……
零基础打造一款属于自己的网页搜索引擎