当前位置:网站首页>Advertising effect cluster analysis (kmeans)
Advertising effect cluster analysis (kmeans)
2022-06-25 15:12:00 【A window full of stars and milky way】
I did a project some time ago , The client is from the education industry , Its main means of publicity is to put advertisements in various channels , Use advertising to channel users to websites .
But there are many advertising channels , Which channels work well , Which effects are not good . It is necessary to make targeted advertising effect measurement and Optimization for advertising effect analysis . I think of what I learned before KMeans The method of clustering analysis advertisement , Sort out the methods and ideas here . For future reference .
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer # String classification to integer Classification Library
from sklearn.preprocessing import MinMaxScaler # MinMaxScaler library
from sklearn.cluster import KMeans # KMeans modular
from sklearn import metrics # Effect evaluation module
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(context='notebook',font='simhei',style='whitegrid')
%matplotlib inline
data = pd.read_csv('./ad_data.txt',delimiter='\t')
data.head(3)
| Channel code | average per day UV | Average registration rate | Average search volume | Depth of visit | Average residence time | Order conversion rate | Total launch time | Material type | Type of advertisement | Way of cooperation | Advertising size | Advertising selling points | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A203 | 3.69 | 0.0071 | 0.0214 | 2.3071 | 419.77 | 0.0258 | 20.0 | jpg | banner | roi | 140*40 | Discount |
| 1 | A387 | 178.70 | 0.0040 | 0.0324 | 2.0489 | 157.94 | 0.0030 | 19.0 | jpg | banner | cpc | 140*40 | Full reduction |
| 2 | A388 | 91.77 | 0.0022 | 0.0530 | 1.8771 | 357.93 | 0.0026 | 4.0 | jpg | banner | cpc | 140*40 | Full reduction |
Data review
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 13 columns):
Channel code 889 non-null object
average per day UV 889 non-null float64
Average registration rate 889 non-null float64
Average search volume 889 non-null float64
Depth of visit 889 non-null float64
Average residence time 887 non-null float64
Order conversion rate 889 non-null float64
Total launch time 889 non-null float64
Material type 889 non-null object
Type of advertisement 889 non-null object
Way of cooperation 889 non-null object
Advertising size 889 non-null object
Advertising selling points 889 non-null object
dtypes: float64(7), object(6)
memory usage: 90.4+ KB
From the above, you can see the data types of each field , And find the field “ Average residence time ” There are two missing values .
There are many missing data , It's not easy to see , It can be counted like this :
# This shows how many missing values are in each field in tabular form
pd.DataFrame(data.isnull().sum(),columns=["num"]).T
| Channel code | average per day UV | Average registration rate | Average search volume | Depth of visit | Average residence time | Order conversion rate | Total launch time | Material type | Type of advertisement | Way of cooperation | Advertising size | Advertising selling points | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| num | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# The missing value is replaced by the field mean
data_2 = data.fillna(data[" Average residence time "].mean())
# Descriptive statistics
data_2.describe().round(3)
| average per day UV | Average registration rate | Average search volume | Depth of visit | Average residence time | Order conversion rate | Total launch time | |
|---|---|---|---|---|---|---|---|
| count | 889.000 | 889.000 | 889.000 | 889.000 | 889.000 | 889.000 | 889.000 |
| mean | 540.847 | 0.001 | 0.030 | 2.167 | 262.669 | 0.003 | 16.053 |
| std | 1634.410 | 0.003 | 0.106 | 3.801 | 224.112 | 0.012 | 8.509 |
| min | 0.060 | 0.000 | 0.000 | 1.000 | 1.640 | 0.000 | 1.000 |
| 25% | 6.180 | 0.000 | 0.001 | 1.392 | 126.200 | 0.000 | 9.000 |
| 50% | 114.180 | 0.000 | 0.003 | 1.793 | 236.660 | 0.000 | 16.000 |
| 75% | 466.870 | 0.001 | 0.012 | 2.216 | 357.930 | 0.002 | 24.000 |
| max | 25294.770 | 0.039 | 1.037 | 98.980 | 4450.830 | 0.216 | 30.000 |
From descriptive statistics we can see
- UV Our data fluctuate a lot , It shows that the differences between different channels are obvious . But the difference is not necessarily an outlier , The characteristics of advertising traffic are explosive , Therefore, it is generally not treated as an outlier .
- You can see the average registration rate , Average search volume , Multiple statistics of order conversion rate are 0, But considering that the maximum itself is very small , It shows that the data itself is very small , In line with the actual situation , So it's normal .
data_2.corr().round(2)
| average per day UV | Average registration rate | Average search volume | Depth of visit | Average residence time | Order conversion rate | Total launch time | |
|---|---|---|---|---|---|---|---|
| average per day UV | 1.00 | -0.05 | -0.07 | -0.02 | 0.04 | -0.05 | -0.04 |
| Average registration rate | -0.05 | 1.00 | 0.24 | 0.11 | 0.22 | 0.32 | -0.01 |
| Average search volume | -0.07 | 0.24 | 1.00 | 0.06 | 0.17 | 0.13 | -0.03 |
| Depth of visit | -0.02 | 0.11 | 0.06 | 1.00 | 0.72 | 0.16 | 0.06 |
| Average residence time | 0.04 | 0.22 | 0.17 | 0.72 | 1.00 | 0.25 | 0.05 |
| Order conversion rate | -0.05 | 0.32 | 0.13 | 0.16 | 0.25 | 1.00 | -0.00 |
| Total launch time | -0.04 | -0.01 | -0.03 | 0.06 | 0.05 | -0.00 | 1.00 |
# Map the distribution
sns.pairplot(data_2,kind='reg')

From the above correlation analysis, we can see that , Only the average residence time and the depth of access are related 0.72, But the characteristics are not very obvious . The correlation between other features is not prominent .
Data preprocessing
# Convert string to integer , discretization
cols = [" Material type ", " Type of advertisement ", " Way of cooperation ", " Advertising size ", " Advertising selling points "]
convert_matrix = data_2[cols]
lines = convert_matrix.shape[0]
dict_list = [] # A dictionary for storing strings and corresponding indexes
unique_list = [] # List of total unique values , A list of unique values for each column
for col_name in cols:
col_unique_vlaue = data_2[col_name].unique().tolist() # List of unique values per column
unique_list.append(col_unique_vlaue) # Save the list in the summary table
for line_index in range(lines):
each_record = convert_matrix.iloc[line_index] # Read each row of data , The result is series
for each_index,each_data in enumerate(each_record):
# Read series The value of each row in the and its corresponding index( The original name ) Numeric index value of .
list_value = unique_list[each_index]
# Read the unique value of the column corresponding to the row index
each_record[each_index] = list_value.index(each_data)
# Map each value to an index in the list of unique values
each_dict = dict(zip(cols,each_record))
dict_list.append(each_dict)
model_transform = DictVectorizer(separator=False,dtype=np.int64)
data_dicvec = model_transform.fit_transform(dict_list).toarray()
You can see from the data ,UV The data of these fields are in different orders of magnitude ,UV There are tens of thousands , But the conversion rate is less than 1, Therefore, data standardization is required , Here the MINMAX Standardization
# Data standardization
scaler_matrix = data_2.iloc[:,1:8]
minmax_scaler = MinMaxScaler()
data_scaler = minmax_scaler.fit_transform(scaler_matrix)
# Merge data
data3 = np.hstack((data_scaler,data_dicvec)) # Horizontal merger
clustering
KMeans The key point of clustering algorithm is K Determination of value .KMeans As unsupervised learning , did not “ The best ”K value , But in terms of data characteristics , The best K Value is to minimize the distance within the class , Maximize the distance between classes . It is like the average contour coefficient , In class distance / Methods such as distance between classes can be used to evaluate K value . Here we use the enumeration method to calculate each K Average profile factor below , Then choose the maximum coefficient to be K value
score_list = [] # List of storage factors
score_init = -1 # Initial profile factor
for n_k in range(2,11):
model_kmeans = KMeans(n_clusters=n_k,random_state=0) # Build a model
cluster_tmp = model_kmeans.fit_predict(data3) # Training models
score_tmp = metrics.silhouette_score(data3,cluster_tmp) # obtain K Contour coefficient of the value
if score_tmp > score_init: # If this coefficient is higher
good_k = n_k # Store K value
score_init = score_tmp # Store the profile factor , Make the next comparison
good_model = model_kmeans # Storage model
good_cluster = cluster_tmp # Store cluster tags
score_list.append([n_k,score_tmp])
print(score_list)
print ('Best K is:{0} with average silhouette of {1}'.
format(good_k, score_init.round(4)))
[[2, 0.4669282108253203], [3, 0.5490464644387694], [4, 0.5696854692292723], [5, 0.481866036548318], [6, 0.45477666842362924], [7, 0.4820426124661439], [8, 0.5044722277929435], [9, 0.5269749291473864], [10, 0.5433876151990182]]
Best K is:4 with average silhouette of 0.5697
You can see , When K=4 When , The largest outline . So here we choose 4 As the best K value
cluster_labels = pd.DataFrame(good_cluster,columns=['cluster'])
merge_data = pd.concat((data_2,cluster_labels),axis=1)
merge_data.head()
| Channel code | average per day UV | Average registration rate | Average search volume | Depth of visit | Average residence time | Order conversion rate | Total launch time | Material type | Type of advertisement | Way of cooperation | Advertising size | Advertising selling points | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A203 | 3.69 | 0.0071 | 0.0214 | 2.3071 | 419.77 | 0.0258 | 20.0 | jpg | banner | roi | 140*40 | Discount | 3 |
| 1 | A387 | 178.70 | 0.0040 | 0.0324 | 2.0489 | 157.94 | 0.0030 | 19.0 | jpg | banner | cpc | 140*40 | Full reduction | 3 |
| 2 | A388 | 91.77 | 0.0022 | 0.0530 | 1.8771 | 357.93 | 0.0026 | 4.0 | jpg | banner | cpc | 140*40 | Full reduction | 3 |
| 3 | A389 | 1.09 | 0.0074 | 0.3382 | 4.2426 | 364.07 | 0.0153 | 10.0 | jpg | banner | cpc | 140*40 | Full reduction | 3 |
| 4 | A390 | 3.37 | 0.0028 | 0.1740 | 2.1934 | 313.34 | 0.0007 | 30.0 | jpg | banner | cpc | 140*40 | Full reduction | 3 |
Find the data characteristics of each category
# Count each category
cluster_count = pd.DataFrame(merge_data[" Channel code "].groupby(
merge_data['cluster']).count()).T.rename({
" Channel code ":"count"})
# Calculate the proportion of each category
cluster_ratio = (cluster_count / len(merge_data)).round(4).rename(
{
"count":"per"})
cluster_features = [] # An empty list , Store characteristic information
for line in range(good_k):
label_data = merge_data[merge_data["cluster"] == line] # Get specific categories of data
part1_data = label_data.iloc[:,1:8] # Get numeric data
part1_desc = part1_data.describe().round(3)
merge_data_mean = part1_desc.iloc[2,:] # Mean characteristic
part2_data = label_data.iloc[:,8:-1] # Get string data characteristics
part2_desc = part2_data.describe(include="all")
merge_data2_mean = part2_desc.iloc[2,:] # Mean characteristic
merge_line = pd.concat((merge_data_mean,merge_data2_mean),axis=0) # Merge
cluster_features.append(merge_line) # Add to list
cluster_df = pd.DataFrame(cluster_features).T
cluster_all = pd.concat((cluster_count,cluster_ratio,cluster_df),axis=0)
cluster_all
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| count | 411 | 297 | 27 | 154 |
| per | 0.4623 | 0.3341 | 0.0304 | 0.1732 |
| average per day UV | 1369.81 | 1194.69 | 1263.03 | 2718.7 |
| Average registration rate | 0.003 | 0.003 | 0.003 | 0.005 |
| Average search volume | 0.082 | 0.144 | 0.151 | 0.051 |
| Depth of visit | 0.918 | 5.728 | 9.8 | 0.948 |
| Average residence time | 165.094 | 285.992 | 374.689 | 104.14 |
| Order conversion rate | 0.009 | 0.016 | 0.017 | 0.007 |
| Total launch time | 8.462 | 8.57 | 7.996 | 8.569 |
| Material type | swf | jpg | swf | jpg |
| Type of advertisement | Not sure | Not sure | banner | banner |
| Way of cooperation | cpc | cpc | cpc | cpc |
| Advertising size | 600*90 | 600*90 | 900*120 | 308*388 |
| Advertising selling points | Discount | Straight down | Discount | Full reduction |
# Draw polar charts to visualize data features
num_sets = cluster_df.iloc[:6,:].T.astype(np.float64) # Extract the displayed data
num_sets_minmax = minmax_scaler.fit_transform(num_sets) # Standardization
# mapping
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111,polar=True)
labels = np.array(merge_data_mean.index[:-1]) # Data labels
colors = ['r','g','b','y']
angles = np.linspace(0,2*np.pi,len(labels),endpoint=False) # Calculate the angle of each section
angles = np.concatenate((angles,[angles[0]])) # Create the first same field to ensure closure
for i in range(len(num_sets)):
data_tmp = num_sets_minmax[i,:]
df = np.concatenate((data_tmp,[data_tmp[0]]))
ax.plot(angles,df,'o-',c=colors[i],label=i)
ax.set_thetagrids(angles*180 / np.pi,labels,fontsize=12)
ax.set_title(" Comparison of significant characteristics of each cluster category ",fontsize=20) # Set title placement
ax.set_rlim(-0.2, 1.2) # Set the axis scale range
plt.legend(loc=0) # Set legend position
plt.show() # Show the image

Brief analysis
- A preliminary analysis of the data :
- 0 No. category accounts for the largest proportion , But there is no outstanding highlight in the data characteristics , Everything is mediocre .
- 1 The No. category has 33% The proportion of , At the same time, the average search volume , residence time , Outstanding performance in terms of access depth and order conversion rate
- 2 No. category and 1 No. is very similar , And in 1 No. performs better on typical characteristics , But the proportion is too low , Only 3%
- 3 No. category is obviously different from other categories , It shows the characteristics of large flow . But the flow quality is poor .
- How to select different types of advertising channels for business .
- 0 All aspects of the advertising channel No , Need to rethink delivery value , When money is tight, you can consider trade-offs .
- 1 Number and 2 No. 1 advertising channel is a channel with high traffic quality , Especially the channel 2. So in the operation strategy , We should strengthen the guidance of registration , Guidance on registration incentives . Focus on promoting discounts , Straight down and other key points , The advertising size is appropriate 900*120 Size . Such advertising channels should play the role of supporting traffic quality , Focus on... In the launch portfolio .
- 3 Channel No. 1 is a typical flow channel , As the backbone of traffic in marketing activities , The drainage effect is obvious . The sales promotion point should be full or reduced , The advertising size is appropriate 308*388.
边栏推荐
- iconv_ Open returns error code 22
- Luogu p5707 [deep foundation 2. example 12] late for school
- Using Visual Studio
- Fishing detection software
- QT set process startup and self startup
- One question per day, a classic simulation question
- One code per day - day one
- Yolov4 coco pre train Darknet weight file
- Single user mode
- 3. Sequential structure multiple choice questions
猜你喜欢

Dynamic memory allocation

JS select all exercise

Js- get the mouse coordinates and follow them

From 408 to independent proposition, 211 to postgraduate entrance examination of Guizhou University

Stack and queue

google_ Breakpad crash detection

Time stamp calculation and audio-visual synchronization of TS stream combined video by ffmpeg protocol concat

多张动图怎样合成一张gif?仅需三步快速生成gif动画图片

Flexible layout (display:flex;) Attribute details

GDB debugging
随机推荐
The robot is playing an old DOS based game
Explanation of dev/mapper
Design and implementation of thread pool
网上办理股票开户安全吗?
Std:: vector minutes
Design and implementation of timer
Source code analysis of synergetics and ntyco
Esp8266 building smart home system
QT set process startup and self startup
Review of arrays and pointers triggered by a topic
Judging the number of leap years from 1 to N years
Daily question, Caesar code,
Using Visual Studio
NBD Network Block Device
5 connection modes of QT signal slot
Semaphore function
HMS Core机器学习服务实现同声传译,支持中英文互译和多种音色语音播报
C language LNK2019 unresolved external symbols_ Main error
HMS core machine learning service realizes simultaneous interpretation, supports Chinese-English translation and multiple voice broadcast
[C language] implementation of magic square array (the most complete)