当前位置：网站首页>Fishman: telecom customer churn prediction game scheme!

Fishman: telecom customer churn prediction game scheme!

2022-06-21 17:45:00 【Datawhale】

Datawhale dried food

author ： Fishman , Master of Wuhan University

2022 Hkust xunfei ： Telecom Customer Churn Prediction challenge

Event address （ Continuous updating ）：

https://challenge.xfyun.cn/topic/info?type=telecom-customer&ch=ds22-dw-zs01

Introduction to the contest question

As the market saturation increases , The competition among telecom operators is becoming more and more fierce , Telecom operators urgently need to reduce the loss of users , The issue of extending the user's life cycle . For the churn rate , Every increase 5%, Profits may be reduced 25%-85%. therefore , How to reduce the loss of telecom users is very important .

In view of this , Operators often have customer service departments , The main function of this department is to do a good job in customer churn analysis , Win back customers with high probability of losing , Reduce customer churn . A large number of customers of a telecommunications institution are lost , This led to a sharp decline in the number of users of the organization . Facing such a headache , The agency opens up some customer data , We sincerely invite you to help them build a churn prediction model to predict the potential churn of customers .

Match task

Given the relevant customer information in the actual business of a telecommunications institution , contain 69 Customer related fields , among “ Is it lost ” The field indicates whether the customer will be lost within two months after the observation date . The mission objective is to train the model through the training set , To predict whether customers will lose , Work on this basis , Improve user retention .

Question data

The competition data consists of training set and test set , The total amount of data exceeds 25w, contain 69 Characteristic fields . To ensure the fairness of the game , It's going to extract 15 Ten thousand as a training set ,3 Ten thousand as a test set , At the same time, some field information will be desensitized .

Characteristic field

Customer ID、 Geographical area 、 Is it dual band 、 Whether to refurbish the machine 、 Current mobile phone prices 、 Mobile network functions 、 Marital status 、 Number of adults in the family 、 Information base matching 、 Estimated revenue 、 Credit card indicator 、 Current equipment usage days 、 Total number of months in service 、 Number of unique subscribers in the family 、 Number of active home users 、....... 、 Average monthly minutes used in the past six months 、 The average number of calls per month in the past six months 、 Average monthly expenses for the past six months 、 Is it lost

Standard for evaluation

Match questions use AUC As an evaluation indicator , namely ：

from sklearn import metrics

auc = metrics.roc_auc_score(data['default_score_true'], data['default_score_pred'])

Competition questions baseline

The import module

import pandas as pd
import os
import gc
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
from gensim.models import Word2Vec
import math
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')

Data preprocessing

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
data = pd.concat([train, test], axis=0, ignore_index=True)

Training data / Test data preparation

features = [f for f in data.columns if f not in [' Is it lost ',' Customer ID']]

train = data[data[' Is it lost '].notnull()].reset_index(drop=True)
test = data[data[' Is it lost '].isnull()].reset_index(drop=True)

x_train = train[features]
x_test = test[features]

y_train = train[' Is it lost ']

Build the model

def cv_model(clf, train_x, train_y, test_x, clf_name):
    folds = 5
    seed = 2022
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)

    train = np.zeros(train_x.shape[0])
    test = np.zeros(test_x.shape[0])

    cv_scores = []

    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]

        if clf_name == "lgb":
            train_matrix = clf.Dataset(trn_x, label=trn_y)
            valid_matrix = clf.Dataset(val_x, label=val_y)

            params = {
                'boosting_type': 'gbdt',
                'objective': 'binary',
                'metric': 'auc',
                'min_child_weight': 5,
                'num_leaves': 2 ** 5,
                'lambda_l2': 10,
                'feature_fraction': 0.7,
                'bagging_fraction': 0.7,
                'bagging_freq': 10,
                'learning_rate': 0.2,
                'seed': 2022,
                'n_jobs':-1
            }

            model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], 
                              categorical_feature=[], verbose_eval=3000, early_stopping_rounds=200)
            val_pred = model.predict(val_x, num_iteration=model.best_iteration)
            test_pred = model.predict(test_x, num_iteration=model.best_iteration)
            
            print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20])
                
        if clf_name == "xgb":
            train_matrix = clf.DMatrix(trn_x , label=trn_y)
            valid_matrix = clf.DMatrix(val_x , label=val_y)
            test_matrix = clf.DMatrix(test_x)
            
            params = {'booster': 'gbtree',
                      'objective': 'binary:logistic',
                      'eval_metric': 'auc',
                      'gamma': 1,
                      'min_child_weight': 1.5,
                      'max_depth': 5,
                      'lambda': 10,
                      'subsample': 0.7,
                      'colsample_bytree': 0.7,
                      'colsample_bylevel': 0.7,
                      'eta': 0.2,
                      'tree_method': 'exact',
                      'seed': 2020,
                      'nthread': 36,
                      "silent": True,
                      }
            
            watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
            
            model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=3000, early_stopping_rounds=200)
            val_pred  = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)
            test_pred = model.predict(test_matrix , ntree_limit=model.best_ntree_limit)
                 
        if clf_name == "cat":
            params = {'learning_rate': 0.2, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli',
                      'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False}
            
            model = clf(iterations=20000, **params)
            model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
                      cat_features=[], use_best_model=True, verbose=3000)
            
            val_pred  = model.predict(val_x)
            test_pred = model.predict(test_x)
            
        train[valid_index] = val_pred
        test = test_pred / kf.n_splits
        cv_scores.append(roc_auc_score(val_y, val_pred))
        
        print(cv_scores)
       
    print("%s_scotrainre_list:" % clf_name, cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    print("%s_score_std:" % clf_name, np.std(cv_scores))
    return train, test
    
def lgb_model(x_train, y_train, x_test):
    lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
    return lgb_train, lgb_test

def xgb_model(x_train, y_train, x_test):
    xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb")
    return xgb_train, xgb_test

def cat_model(x_train, y_train, x_test):
    cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, "cat") 
    return cat_train, cat_test
    
lgb_train, lgb_test = lgb_model(x_train, y_train, x_test)

Submit results

test[' Is it lost '] = lgb_test
test[[' Customer ID',' Is it lost ']].to_csv('test_sub.csv', index=False)

Data mining event communication group

If the crowd is full , Focus on Datawhale official account , reply “ data mining ” or “CV” or “NLP” You can be invited to join your own team group , In addition to the exchange of experience , The competition questions will also be updated and released in the group .

One key, three links , Learning together ️

原网站

版权声明
本文为[Datawhale]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/172/202206211546035197.html