当前位置：网站首页>Advertising effect cluster analysis (kmeans)

Advertising effect cluster analysis (kmeans)

2022-06-25 15:12:00 【A window full of stars and milky way】

I did a project some time ago , The client is from the education industry , Its main means of publicity is to put advertisements in various channels , Use advertising to channel users to websites .

But there are many advertising channels , Which channels work well , Which effects are not good . It is necessary to make targeted advertising effect measurement and Optimization for advertising effect analysis . I think of what I learned before KMeans The method of clustering analysis advertisement , Sort out the methods and ideas here . For future reference .

import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer  #  String classification to integer Classification Library 
from sklearn.preprocessing import MinMaxScaler # MinMaxScaler library 
from sklearn.cluster import KMeans   # KMeans  modular 
from sklearn import metrics   #  Effect evaluation module 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(context='notebook',font='simhei',style='whitegrid') 
%matplotlib inline

data = pd.read_csv('./ad_data.txt',delimiter='\t')
data.head(3)

	Channel code	average per day UV	Average registration rate	Average search volume	Depth of visit	Average residence time	Order conversion rate	Total launch time	Material type	Type of advertisement	Way of cooperation	Advertising size	Advertising selling points
0	A203	3.69	0.0071	0.0214	2.3071	419.77	0.0258	20.0	jpg	banner	roi	140*40	Discount
1	A387	178.70	0.0040	0.0324	2.0489	157.94	0.0030	19.0	jpg	banner	cpc	140*40	Full reduction
2	A388	91.77	0.0022	0.0530	1.8771	357.93	0.0026	4.0	jpg	banner	cpc	140*40	Full reduction

Data review

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 13 columns):
 Channel code       889 non-null object
 average per day UV      889 non-null float64
 Average registration rate      889 non-null float64
 Average search volume      889 non-null float64
 Depth of visit       889 non-null float64
 Average residence time     887 non-null float64
 Order conversion rate      889 non-null float64
 Total launch time      889 non-null float64
 Material type       889 non-null object
 Type of advertisement       889 non-null object
 Way of cooperation       889 non-null object
 Advertising size       889 non-null object
 Advertising selling points       889 non-null object
dtypes: float64(7), object(6)
memory usage: 90.4+ KB

From the above, you can see the data types of each field , And find the field “ Average residence time ” There are two missing values .
There are many missing data , It's not easy to see , It can be counted like this ：

#  This shows how many missing values are in each field in tabular form 
pd.DataFrame(data.isnull().sum(),columns=["num"]).T

	Channel code	average per day UV	Average registration rate	Average search volume	Depth of visit	Average residence time	Order conversion rate	Total launch time	Material type	Type of advertisement	Way of cooperation	Advertising size	Advertising selling points
num	0	0	0	0	0	2	0	0	0	0	0	0	0

#  The missing value is replaced by the field mean 
data_2  = data.fillna(data[" Average residence time "].mean())

#  Descriptive statistics 

data_2.describe().round(3)

	average per day UV	Average registration rate	Average search volume	Depth of visit	Average residence time	Order conversion rate	Total launch time
count	889.000	889.000	889.000	889.000	889.000	889.000	889.000
mean	540.847	0.001	0.030	2.167	262.669	0.003	16.053
std	1634.410	0.003	0.106	3.801	224.112	0.012	8.509
min	0.060	0.000	0.000	1.000	1.640	0.000	1.000
25%	6.180	0.000	0.001	1.392	126.200	0.000	9.000
50%	114.180	0.000	0.003	1.793	236.660	0.000	16.000
75%	466.870	0.001	0.012	2.216	357.930	0.002	24.000
max	25294.770	0.039	1.037	98.980	4450.830	0.216	30.000

From descriptive statistics we can see

UV Our data fluctuate a lot , It shows that the differences between different channels are obvious . But the difference is not necessarily an outlier , The characteristics of advertising traffic are explosive , Therefore, it is generally not treated as an outlier .
You can see the average registration rate , Average search volume , Multiple statistics of order conversion rate are 0, But considering that the maximum itself is very small , It shows that the data itself is very small , In line with the actual situation , So it's normal .

data_2.corr().round(2)

	average per day UV	Average registration rate	Average search volume	Depth of visit	Average residence time	Order conversion rate	Total launch time
average per day UV	1.00	-0.05	-0.07	-0.02	0.04	-0.05	-0.04
Average registration rate	-0.05	1.00	0.24	0.11	0.22	0.32	-0.01
Average search volume	-0.07	0.24	1.00	0.06	0.17	0.13	-0.03
Depth of visit	-0.02	0.11	0.06	1.00	0.72	0.16	0.06
Average residence time	0.04	0.22	0.17	0.72	1.00	0.25	0.05
Order conversion rate	-0.05	0.32	0.13	0.16	0.25	1.00	-0.00
Total launch time	-0.04	-0.01	-0.03	0.06	0.05	-0.00	1.00

#  Map the distribution 
sns.pairplot(data_2,kind='reg')

Insert picture description here

From the above correlation analysis, we can see that , Only the average residence time and the depth of access are related 0.72, But the characteristics are not very obvious . The correlation between other features is not prominent .

Data preprocessing

#  Convert string to integer , discretization 

cols = [" Material type ", " Type of advertisement ", " Way of cooperation ", " Advertising size ", " Advertising selling points "]

convert_matrix = data_2[cols]
lines = convert_matrix.shape[0]
dict_list = []  #  A dictionary for storing strings and corresponding indexes 
unique_list = []  #  List of total unique values , A list of unique values for each column 

for col_name in cols:
    col_unique_vlaue = data_2[col_name].unique().tolist()  #  List of unique values per column 
    unique_list.append(col_unique_vlaue)  #  Save the list in the summary table 
    
for line_index in range(lines):
    each_record = convert_matrix.iloc[line_index]   #  Read each row of data , The result is series
    for each_index,each_data in enumerate(each_record):
        #  Read series The value of each row in the and its corresponding index（ The original name ） Numeric index value of .
        list_value = unique_list[each_index]
        #  Read the unique value of the column corresponding to the row index 
        each_record[each_index] = list_value.index(each_data)
        #  Map each value to an index in the list of unique values 
    each_dict = dict(zip(cols,each_record))
    dict_list.append(each_dict)
    
model_transform = DictVectorizer(separator=False,dtype=np.int64)
data_dicvec = model_transform.fit_transform(dict_list).toarray()

You can see from the data ,UV The data of these fields are in different orders of magnitude ,UV There are tens of thousands , But the conversion rate is less than 1, Therefore, data standardization is required , Here the MINMAX Standardization

#  Data standardization 
scaler_matrix = data_2.iloc[:,1:8]
minmax_scaler = MinMaxScaler()
data_scaler = minmax_scaler.fit_transform(scaler_matrix)

#  Merge data 
data3 = np.hstack((data_scaler,data_dicvec))  #  Horizontal merger

clustering

KMeans The key point of clustering algorithm is K Determination of value .KMeans As unsupervised learning , did not “ The best ”K value , But in terms of data characteristics , The best K Value is to minimize the distance within the class , Maximize the distance between classes . It is like the average contour coefficient , In class distance / Methods such as distance between classes can be used to evaluate K value . Here we use the enumeration method to calculate each K Average profile factor below , Then choose the maximum coefficient to be K value

score_list = []   #  List of storage factors 
score_init = -1   #  Initial profile factor 
for n_k in range(2,11):
    model_kmeans = KMeans(n_clusters=n_k,random_state=0)  #  Build a model 
    cluster_tmp = model_kmeans.fit_predict(data3) #  Training models 
    score_tmp = metrics.silhouette_score(data3,cluster_tmp)  #  obtain K Contour coefficient of the value 
    if score_tmp > score_init:  #  If this coefficient is higher 
        good_k = n_k  # Store K value 
        score_init = score_tmp  # Store the profile factor , Make the next comparison 
        good_model = model_kmeans  #  Storage model 
        good_cluster = cluster_tmp  #  Store cluster tags 
    score_list.append([n_k,score_tmp])


print(score_list)
print ('Best K is:{0} with average silhouette of {1}'.
       format(good_k, score_init.round(4)))

[[2, 0.4669282108253203], [3, 0.5490464644387694], [4, 0.5696854692292723], [5, 0.481866036548318], [6, 0.45477666842362924], [7, 0.4820426124661439], [8, 0.5044722277929435], [9, 0.5269749291473864], [10, 0.5433876151990182]]
Best K is:4 with average silhouette of 0.5697

You can see , When K=4 When , The largest outline . So here we choose 4 As the best K value

cluster_labels = pd.DataFrame(good_cluster,columns=['cluster'])

merge_data = pd.concat((data_2,cluster_labels),axis=1)

merge_data.head()

	Channel code	average per day UV	Average registration rate	Average search volume	Depth of visit	Average residence time	Order conversion rate	Total launch time	Material type	Type of advertisement	Way of cooperation	Advertising size	Advertising selling points	cluster
0	A203	3.69	0.0071	0.0214	2.3071	419.77	0.0258	20.0	jpg	banner	roi	140*40	Discount	3
1	A387	178.70	0.0040	0.0324	2.0489	157.94	0.0030	19.0	jpg	banner	cpc	140*40	Full reduction	3
2	A388	91.77	0.0022	0.0530	1.8771	357.93	0.0026	4.0	jpg	banner	cpc	140*40	Full reduction	3
3	A389	1.09	0.0074	0.3382	4.2426	364.07	0.0153	10.0	jpg	banner	cpc	140*40	Full reduction	3
4	A390	3.37	0.0028	0.1740	2.1934	313.34	0.0007	30.0	jpg	banner	cpc	140*40	Full reduction	3

Find the data characteristics of each category

#  Count each category 
cluster_count = pd.DataFrame(merge_data[" Channel code "].groupby(
                merge_data['cluster']).count()).T.rename({
    " Channel code ":"count"})

#  Calculate the proportion of each category 
cluster_ratio = (cluster_count / len(merge_data)).round(4).rename(
                {
    "count":"per"})

cluster_features = []  #  An empty list , Store characteristic information 
for line in range(good_k):
    label_data = merge_data[merge_data["cluster"] == line]  #  Get specific categories of data 
    part1_data = label_data.iloc[:,1:8] #  Get numeric data 
    part1_desc = part1_data.describe().round(3)
    merge_data_mean = part1_desc.iloc[2,:]   #  Mean characteristic 
    
    part2_data = label_data.iloc[:,8:-1]   #  Get string data characteristics 
    part2_desc = part2_data.describe(include="all")
    merge_data2_mean = part2_desc.iloc[2,:]   #  Mean characteristic 
    merge_line = pd.concat((merge_data_mean,merge_data2_mean),axis=0)  #  Merge 
    cluster_features.append(merge_line)  #  Add to list 
    
cluster_df = pd.DataFrame(cluster_features).T
cluster_all = pd.concat((cluster_count,cluster_ratio,cluster_df),axis=0)
cluster_all

	0	1	2	3
count	411	297	27	154
per	0.4623	0.3341	0.0304	0.1732
average per day UV	1369.81	1194.69	1263.03	2718.7
Average registration rate	0.003	0.003	0.003	0.005
Average search volume	0.082	0.144	0.151	0.051
Depth of visit	0.918	5.728	9.8	0.948
Average residence time	165.094	285.992	374.689	104.14
Order conversion rate	0.009	0.016	0.017	0.007
Total launch time	8.462	8.57	7.996	8.569
Material type	swf	jpg	swf	jpg
Type of advertisement	Not sure	Not sure	banner	banner
Way of cooperation	cpc	cpc	cpc	cpc
Advertising size	600*90	600*90	900*120	308*388
Advertising selling points	Discount	Straight down	Discount	Full reduction

#  Draw polar charts to visualize data features 

num_sets = cluster_df.iloc[:6,:].T.astype(np.float64)   #  Extract the displayed data 
num_sets_minmax = minmax_scaler.fit_transform(num_sets)  #  Standardization 


#  mapping 
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111,polar=True)
labels = np.array(merge_data_mean.index[:-1])   #  Data labels 
colors = ['r','g','b','y']
angles = np.linspace(0,2*np.pi,len(labels),endpoint=False)   #  Calculate the angle of each section 
angles = np.concatenate((angles,[angles[0]]))  #  Create the first same field to ensure closure 

for i in range(len(num_sets)):
    data_tmp =  num_sets_minmax[i,:]
    df = np.concatenate((data_tmp,[data_tmp[0]]))
    ax.plot(angles,df,'o-',c=colors[i],label=i) 

ax.set_thetagrids(angles*180 / np.pi,labels,fontsize=12)
ax.set_title(" Comparison of significant characteristics of each cluster category ",fontsize=20)  #  Set title placement 
ax.set_rlim(-0.2, 1.2)  #  Set the axis scale range 
plt.legend(loc=0)  #  Set legend position 
plt.show()  #  Show the image

Insert picture description here

Brief analysis

A preliminary analysis of the data ：

0 No. category accounts for the largest proportion , But there is no outstanding highlight in the data characteristics , Everything is mediocre .
1 The No. category has 33% The proportion of , At the same time, the average search volume , residence time , Outstanding performance in terms of access depth and order conversion rate
2 No. category and 1 No. is very similar , And in 1 No. performs better on typical characteristics , But the proportion is too low , Only 3%
3 No. category is obviously different from other categories , It shows the characteristics of large flow . But the flow quality is poor .

How to select different types of advertising channels for business .

0 All aspects of the advertising channel No , Need to rethink delivery value , When money is tight, you can consider trade-offs .
1 Number and 2 No. 1 advertising channel is a channel with high traffic quality , Especially the channel 2. So in the operation strategy , We should strengthen the guidance of registration , Guidance on registration incentives . Focus on promoting discounts , Straight down and other key points , The advertising size is appropriate 900*120 Size . Such advertising channels should play the role of supporting traffic quality , Focus on... In the launch portfolio .
3 Channel No. 1 is a typical flow channel , As the backbone of traffic in marketing activities , The drainage effect is obvious . The sales promotion point should be full or reduced , The advertising size is appropriate 308*388.

原网站

版权声明
本文为[A window full of stars and milky way]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202200508198049.html