当前位置:网站首页>Fishman: telecom customer churn prediction game scheme!
Fishman: telecom customer churn prediction game scheme!
2022-06-21 17:45:00 【Datawhale】
Datawhale dried food
author : Fishman , Master of Wuhan University
2022 Hkust xunfei : Telecom Customer Churn Prediction challenge
Event address ( Continuous updating ):
https://challenge.xfyun.cn/topic/info?type=telecom-customer&ch=ds22-dw-zs01
Introduction to the contest question
As the market saturation increases , The competition among telecom operators is becoming more and more fierce , Telecom operators urgently need to reduce the loss of users , The issue of extending the user's life cycle . For the churn rate , Every increase 5%, Profits may be reduced 25%-85%. therefore , How to reduce the loss of telecom users is very important .
In view of this , Operators often have customer service departments , The main function of this department is to do a good job in customer churn analysis , Win back customers with high probability of losing , Reduce customer churn . A large number of customers of a telecommunications institution are lost , This led to a sharp decline in the number of users of the organization . Facing such a headache , The agency opens up some customer data , We sincerely invite you to help them build a churn prediction model to predict the potential churn of customers .
Match task
Given the relevant customer information in the actual business of a telecommunications institution , contain 69 Customer related fields , among “ Is it lost ” The field indicates whether the customer will be lost within two months after the observation date . The mission objective is to train the model through the training set , To predict whether customers will lose , Work on this basis , Improve user retention .
Question data
The competition data consists of training set and test set , The total amount of data exceeds 25w, contain 69 Characteristic fields . To ensure the fairness of the game , It's going to extract 15 Ten thousand as a training set ,3 Ten thousand as a test set , At the same time, some field information will be desensitized .
Characteristic field
Customer ID、 Geographical area 、 Is it dual band 、 Whether to refurbish the machine 、 Current mobile phone prices 、 Mobile network functions 、 Marital status 、 Number of adults in the family 、 Information base matching 、 Estimated revenue 、 Credit card indicator 、 Current equipment usage days 、 Total number of months in service 、 Number of unique subscribers in the family 、 Number of active home users 、....... 、 Average monthly minutes used in the past six months 、 The average number of calls per month in the past six months 、 Average monthly expenses for the past six months 、 Is it lost
Standard for evaluation
Match questions use AUC As an evaluation indicator , namely :
from sklearn import metrics
auc = metrics.roc_auc_score(data['default_score_true'], data['default_score_pred'])Competition questions baseline
The import module
import pandas as pd
import os
import gc
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
from gensim.models import Word2Vec
import math
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')Data preprocessing
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
data = pd.concat([train, test], axis=0, ignore_index=True)Training data / Test data preparation
features = [f for f in data.columns if f not in [' Is it lost ',' Customer ID']]
train = data[data[' Is it lost '].notnull()].reset_index(drop=True)
test = data[data[' Is it lost '].isnull()].reset_index(drop=True)
x_train = train[features]
x_test = test[features]
y_train = train[' Is it lost ']Build the model
def cv_model(clf, train_x, train_y, test_x, clf_name):
folds = 5
seed = 2022
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
train = np.zeros(train_x.shape[0])
test = np.zeros(test_x.shape[0])
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
print('************************************ {} ************************************'.format(str(i+1)))
trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
if clf_name == "lgb":
train_matrix = clf.Dataset(trn_x, label=trn_y)
valid_matrix = clf.Dataset(val_x, label=val_y)
params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'min_child_weight': 5,
'num_leaves': 2 ** 5,
'lambda_l2': 10,
'feature_fraction': 0.7,
'bagging_fraction': 0.7,
'bagging_freq': 10,
'learning_rate': 0.2,
'seed': 2022,
'n_jobs':-1
}
model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix],
categorical_feature=[], verbose_eval=3000, early_stopping_rounds=200)
val_pred = model.predict(val_x, num_iteration=model.best_iteration)
test_pred = model.predict(test_x, num_iteration=model.best_iteration)
print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20])
if clf_name == "xgb":
train_matrix = clf.DMatrix(trn_x , label=trn_y)
valid_matrix = clf.DMatrix(val_x , label=val_y)
test_matrix = clf.DMatrix(test_x)
params = {'booster': 'gbtree',
'objective': 'binary:logistic',
'eval_metric': 'auc',
'gamma': 1,
'min_child_weight': 1.5,
'max_depth': 5,
'lambda': 10,
'subsample': 0.7,
'colsample_bytree': 0.7,
'colsample_bylevel': 0.7,
'eta': 0.2,
'tree_method': 'exact',
'seed': 2020,
'nthread': 36,
"silent": True,
}
watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=3000, early_stopping_rounds=200)
val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)
test_pred = model.predict(test_matrix , ntree_limit=model.best_ntree_limit)
if clf_name == "cat":
params = {'learning_rate': 0.2, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli',
'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False}
model = clf(iterations=20000, **params)
model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
cat_features=[], use_best_model=True, verbose=3000)
val_pred = model.predict(val_x)
test_pred = model.predict(test_x)
train[valid_index] = val_pred
test = test_pred / kf.n_splits
cv_scores.append(roc_auc_score(val_y, val_pred))
print(cv_scores)
print("%s_scotrainre_list:" % clf_name, cv_scores)
print("%s_score_mean:" % clf_name, np.mean(cv_scores))
print("%s_score_std:" % clf_name, np.std(cv_scores))
return train, test
def lgb_model(x_train, y_train, x_test):
lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
return lgb_train, lgb_test
def xgb_model(x_train, y_train, x_test):
xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb")
return xgb_train, xgb_test
def cat_model(x_train, y_train, x_test):
cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, "cat")
return cat_train, cat_test
lgb_train, lgb_test = lgb_model(x_train, y_train, x_test)Submit results
test[' Is it lost '] = lgb_test
test[[' Customer ID',' Is it lost ']].to_csv('test_sub.csv', index=False)Data mining event communication group

If the crowd is full , Focus on Datawhale official account , reply “ data mining ” or “CV” or “NLP” You can be invited to join your own team group , In addition to the exchange of experience , The competition questions will also be updated and released in the group .
One key, three links , Learning together ️
边栏推荐
- Sorting out Android kotlin generic knowledge points
- The node server res.end() writes Chinese, and the solution to the problem of garbled code in the client
- Kotlin DSL build
- PTA L3-031 千手观音 (30 分)
- 窗帘做EN 1101易燃性测试过程是怎么样的?
- Why is rediscluster designed with 16384 slots?
- 应用架构原则
- Bm19 looking for peak
- 【数据集】|BigDetection
- SCAU Software Engineering Fundamentals
猜你喜欢

Your cache folder contains root-owned files, due to a bug in npm ERR! previous versions of npm which

火山引擎+焱融 YRCloudFile,驱动数据存储新增长

vector的模拟实现

堆栈认知——堆简介

Kubernetes + 焱融 SaaS 数据服务平台,个性化需求支持就没输过

Common setting modes

AS 3744.1标准中提及ISO8191测试,两者测试一样吗?

Analysis of 43 cases of MATLAB neural network: Chapter 26 classification of LVQ Neural Network - breast tumor diagnosis

3DE 三維模型視圖看不到怎麼調整

堆栈认知——栈溢出实例(ret2text)
随机推荐
[dataset] |bigdetection
LeetCode_ String_ Simple_ 387. first unique character in string
path.join() 、path.basename() 和 path.extname()
MySQL 1055错误-this is incompatible with sql_mode=only_full_group_by解决方案
Bm95 points candy problem
Android kotlin 类委托 by,by lazy关键
Algorithm -- maximum number after parity exchange (kotlin)
SCAU Software Engineering Fundamentals
Google play 应用签名密钥证书,上传签名证书区别
The next stop of Intelligent Manufacturing: cloud native + edge computing two wheel drive
Google play application signature key certificate, upload signature certificate difference
经纬度转换为距离
用Node创建一个服务器
Bm22 compare version number
Accelerate the implementation of cloud native applications, and Yanrong yrcloudfile and Tianyi cloud have completed the Compatibility Certification
fs. Readfile() and fs writeFile()
Addition of 3DE grid coordinate points and objects
神经网络七十年:回顾与展望
wcdma与LTE的区别
拦截器实现网页用户登陆