当前位置：网站首页>Using pycaret for data mining: association rule mining

Using pycaret for data mining: association rule mining

2022-07-23 12:19:00 【-Send gods-】

Association rules Used to discover interesting relationships between variables in large databases . It aims to use some interesting methods to identify strong rules found in the database . By mining association rules, we can find the regularity between commodities in the large-scale transaction data of supermarkets or e-commerce , The most famous case is “ Beer and diapers ” The story of ,20 century 90 In American supermarkets in the S , Supermarket managers found an incomprehensible phenomenon when analyzing sales data ： In certain circumstances ,“ beer ” And “ diapers ” Two seemingly unrelated items often appear in the same basket . For example, people find in the sales data of supermarkets , If customers buy bread and milk at the same time , They also buy butter . Such information can help enterprises make decisions about marketing activities .

Let's explain some basic theoretical knowledge about association rules , First we define ：

$I=\{i_{1},i_{2},\ldots ,i_{n}\}$ , among I Represents a collection of all items , ${i_{1},i_{2},\ldots$ It means the first one 1, The first 2, The first ... A commodity .

$D=\{t_{1},t_{2},\ldots ,t_{m}\}$ , among D Represents all transactions in the database ( Order ) Record , $t_{1},t_{2},\ldots$ Table is No 1, The first 2, The first ... Order records .

Each stroke transaction stay D Have a unique transaction ID And it contains a collection of goods I Subset .

Association rules are searched frequently in the database if-then A combination of strongly related items of a pattern , This combination is called frequent itemsets . But it is not easy to find all frequent itemsets in the database , Because it involves traversing all data to find all possible itemset combinations from all possible itemsets . The collection of possible item sets is the collection of items I Of Power set , And the size is $2^{n}-1$ , Of course, this means excluding empty sets that are not considered valid itemsets , But the size of the power set will vary with the items in the power set n The number increases exponentially by using the supported downward closure property （ Also known as anti monotonicity ）, You can search effectively . This will ensure that frequent itemsets and all their subsets are also frequent , Therefore, infrequent itemsets will not be regarded as a subset of frequent itemsets . Take advantage of this feature , Some efficient algorithms （ for example ,Apriori and Eclat） You can find all frequent itemsets .

The above item set :{a,b,c} The power set of is the set of all subsets ( Including empty sets ) The number of subsets is $2^{n}$ , The number of frequent itemsets is $2^{n}-1$ ( Exclude empty sets ).

Association rules are defined as ：

$X\Rightarrow Y$ , among $X,Y\subseteq I$ , here X,Y All are I Subset , It can be understood that if the user buys 1 One or more products , Then I will also buy another 1 One or more products . here X,Y yes I Therefore, all of them contain 1 One or more products . This is the general definition of association rules , But in the actual application scenario, it is generally defined Between a subset of products and a single product Association rules of ：

$X\Rightarrow i_{j}$ , among $i_{j}\in I$ . This rule can be understood in this way ： If the customer buys One or more goods , That will also buy another One goods . among ${\Rightarrow}$ The meaning is “ If ... that ...”, That is... In English if... then.

Here we use support( Support ) 、confidence( Degree of confidence )、lift( Promotion )、conviction( Error rate ) And other evaluation indicators to measure the strong correlation degree of frequent itemsets .

support ( Support ) Indicates the probability of two items appearing simultaneously in the data set :

support (X,Y)= $P(X \cap Y)$

That is, the frequency of strongly correlated item combinations in a given data set , It's understandable , The proportion of orders that have purchased bread and milk at the same time in all orders .

confidence( Degree of confidence ) He expresses a conditional probability ： $P(i_{j}| X)$ It's understandable , Among all orders for bread, the proportion of orders that have purchased milk . We can use the formula of conditional probability to calculate the confidence ：

lift( Promotion ), It can be used to compare expected Confidence And the actual Confidence, Means to contain X At the same time Y The possibility and unconditionally contain Y The probability ratio of , That is to say Y The possibility of its own emergence P(Y) On the basis of ,X The emergence of Y Of “ The rate of appearance ”P(Y|X) The degree of improvement ：

$lift(X\Rightarrow Y) = P(Y|X)/P(Y)$ .

If lift = 1, Express X And Y Are independent of each other ,X Yes Y The probability of occurrence does not increase , The greater the value （lift >1） Express X Yes Y The greater the degree of improvement , It also means that the stronger the relevance .

conviction( Error rate ) , Its function is to measure the probability of wrong prediction of correlation rules , surface ⽰X appear ⽽Y Probability of not appearing .

${\mathrm {conv}}(X\Rightarrow Y)={\frac {1-{\mathrm {supp}}(Y)}{1-{\mathrm {conf}}(X\Rightarrow Y)}}.$

Let's take a simple example to deepen our understanding of support,confidence,lift,conviction And other evaluation indicators ： The... In the database is shown below 5 The transaction record contains 7 A commodity .

Let's figure it out { beer , diapers } Of support( Support ),

supp( beer , diapers )=1/5=0.2 , Of all the 5 In orders , Only one order has bought beer and diapers at the same time ( The first 3 Order per order ).

supp( milk , bread , butter )=1/5=0.2 , Of all the 5 In orders , Only one order has purchased milk at the same time , Bread and butter ( The first 4 Order per order ).

conf({ butter , bread } ${\Rightarrow}$ { milk })=P( milk | butter , bread )=P( milk ∩ butter ∩ bread )/P( butter ∩ bread )=1, Here, it means that among all the orders that have purchased butter and bread, multiple orders have also purchased milk , From the data sheet, we can find that only one order has purchased butter and bread ( The first 4 Order per order ) And I also bought milk in this order , therefore P( milk ∩ butter ∩ bread )/P( butter ∩ bread )=1/1=1.

conf( Fruits ${\Rightarrow}$ egg )=P( egg | Fruits )=P( egg ∩ Fruits )/P( Fruits )=2/3=0.67 Here means there is 3 Fruit has been purchased on orders , Here 3 Among the orders for fruits, two orders have purchased eggs, so the probability is 2/3.

lift({ milk , bread } ${\Rightarrow}$ { butter })=P( butter | milk , bread ) / P( butter )=0.5/0.4=1.25, here lift>1 Explain buying bread , Milk is helpful in buying butter , What needs to be explained here is ： If lift (Y|X)= 1, Express X And Y Are independent of each other ,X Yes Y The probability of occurrence does not increase , The greater the value （lift >1） Express X Yes Y The greater the degree of improvement , It also means that the stronger the relevance .

conv({ milk , bread } ${\Rightarrow}$ { butter })=

=(1-0.4)/(1-0.5)=0.6/0.5=1.2

This shows that the probability of containing milk and bread but not butter increases 20%.

Apriori Algorithm

Apriori from R. Agrawal and R. Srikant stay 1994 Given in , For frequent itemset mining and association rule learning . It identifies individual items that frequently appear in the database and extends them to an increasing set of items , As long as these item sets appear frequently enough . The name of the algorithm is Apriori, Because it uses a priori knowledge of frequent itemset attributes .

Apriori The algorithm uses “ Bottom up ” Methods , Among them, frequent subsets expand one term at a time （ be called Candidate generation Steps for ）, And test candidate groups for data . When no further successful extensions are found , Algorithm was terminated .Apriori Use Breadth first search and Hash tree Structure to effectively calculate candidate item sets . It generates candidate length item sets from long item sets . Then it prunes candidates with uncommon sub patterns . According to the downward closure theorem , Candidate sets contain all frequent length itemsets . after , It scans the database to identify frequent itemsets in candidates .

Below is a sample of a cancer data set , The letters represent a specific mutation combination , We need to search for frequent itemsets in these combinations ：

First calculate the of each letter support ( Support ), Then set a minimum threshold :3 To trim the project set ：

Because the support of all letters is greater than or equal to 3, So no pruning . Frequent itemsets are {a}、{b}、{c} and {d}, Next we are {a}、{b}、{c} and {d} Calculate two tuples based on ( Any combination of two letters ) support ：

Now let's set the minimum support threshold of binary to 4, So after pruning, only {a, d} and {c, d} Will be kept . And then we have {a, d} and {c, d} Calculate triples based on ( Any combination of three letters ) support ：

Because in the end, we only have a triple , So the next round of quads is empty , So the algorithm will stop . The last frequent itemset retained is ：{a}、{b}、{c} 、{d}、{a, d} 、 {c, d}、{a,c,d}.

utilize pycaret To realize data association rule mining

PyCaret yes Python Open source in 、 Low code machine learning library , It can automatically execute machine learning workflow . It is an end-to-end machine learning and model management tool , It can double your work efficiency . Compared with other open source machine learning libraries ,PyCaret Is a low code machine learning framework , It can be used to replace hundreds of lines of code you wrote before with just a few lines of code . This makes the development process exponentially fast and efficient .PyCaret It is essentially a combination of several machine learning libraries and frameworks Python Wrappers , for example scikit-learn、XGBoost、LightGBM、CatBoost、spaCy、Optuna、Hyperopt、Ray wait . Today we will take a look at a simple example of association rule mining pycaret How powerful ！

pycaret To realize association rule mining, there are generally the following steps ：

Getting Data： from PyCaret The repository imports simulation data , Or you can read external real data
Setting up Environment： stay PyCaret Set the algorithm environment in and prepare to start implementing association rule mining
Create Model： Create the algorithm model of association rule mining
Plot Model： Use various plots to analyze model performance

data

We will use from UCI Data sets A small sample of , Called online retail dataset . This is a transaction data set , It includes 2010 year 1 month 12 solstice 2011 year 9 month 12 Transactions between the two days . The following is a brief description of the meaning of data fields ：

InvoiceNo： Invoice number , One 6 Unique integer of bits . If this code is written in letters “c” start , It means cancel .
StockCode: Commodity code , One 5 An integer , Each product has a unique code .
Description： Description of the product .
Quantity： The quantity of each commodity per transaction .
InvoiceDate: Invoice date and time , The date and time when each transaction was generated .
UnitPrice: Unit price of goods , The price of goods in pounds .
CustomerID: Customer number , Each customer is uniquely assigned 5 An integer .
Country: Country name , The name of each customer's country .

1.Getting Data

because pycaret This retail data set has been built in , So you can use pycaret Provided get_data Method to get the data ：

from pycaret.datasets import get_data
data = get_data('france')
print(' Unique invoice quantity :',data.InvoiceNo.nunique())
print(' Number of unique products :',data.Description.nunique())

We found that the only invoice quantity in the dataset is 461, The only quantity of goods is 1565.

2.Setting up Environment

Here we will use setup() function , Its function is to initialize PyCaret Environment in , And convert the data set into Apriori Acceptable shape of Algorithm . setup() Three mandatory parameters are required ：pandas Of dataframe、transaction_id, It means transaction id The name of the column ( In this case, the publication number ), Will be used to perspective the matrix ; item_id Is the name of the column used to create the rule ( In this case, it is Description). Usually , This will be the variable of interest . You can also pass optional parameters ignore_items To ignore item_id Some values in .

from pycaret.arules import *
exp=setup(data=data,transaction_id='InvoiceNo',item_id = 'Description')

Successful execution setup() after , It prints an information grid containing a small amount of important information ：

Transactions ： The number of unique transactions in the dataset . under these circumstances , Unique InvoiceNo The number of .
Items ： The number of unique items in the dataset .
Ignore Items ： Items to be ignored in rule mining . A lot of times , The combination of frequent itemsets is meaningless , So you may want to ignore them . for example ： Many transaction datasets will contain freight , Many other goods may be strongly related to freight , So in order to ignore this meaningless frequent itemset, we can setup() Use in ignore_items Parameter ignored . Later we will run setup() once , Later we will ignore some specific meaningless items .

It's calculated here Transactions and Items The quantity of is the same as the result of our previous statistics .

3.Create a Model

Creating an association rule model is simple , create_model() No mandatory parameters are required . It has 4 Optional parameters , As shown below ：

metric: Set evaluation indicators : 'confidence','support', 'lift', 'leverage', 'conviction', The default value is ‘confidence’
threshold: The minimum threshold of the evaluation index , The default setting is 0.5.
min_support:0 To 1 The floating point number between , The minimum support for the returned item set is 0.05
round： The decimal indicator in the score in the evaluation result will be rounded

Next, we will create an association rule model using default parameters , Eventually, all frequent itemsets will be mined { $antecedents\Rightarrow consequents$ }：

model1 = create_model()
print(' Number of frequent itemsets :',len(model1))
model1.head()

Here the association rule model mining out 141 Frequent itemsets { $antecedents\Rightarrow consequents$ }, And calculate 'confidence','support', 'lift', 'leverage', 'conviction' And so on .

4.Setup with `ignore_items`

We find that in the results of the above frequent itemsets antecedents and consequents There are many POSTAGE, and POSTAGE Means freight , So most frequent itemsets combined with freight are meaningless , So we have to get rid of POSTAGE. And now we're going to go to the setup() Method used in Ignore Items Parameters to exclude POSTAGE.

exp=setup(data=data,
          transaction_id='InvoiceNo',
          item_id = 'Description',
          ignore_items = ['POSTAGE'])

model2 = create_model()
print(' Number of frequent itemsets :',len(model2))
model2.head()

After elimination POSTAGE Association rule model mining out 45 Frequent itemsets .

5.Plot Model

Next, we use two-dimensional and three-dimensional visualization methods to show the distribution of evaluation scores of each group of frequent itemsets in plane and space ：