KKBox Churn Prediction

12 minute read

Overview

KKBOX is Asia’s leading music streaming service, holding the world’s most comprehensive Asia-Pop music library with over 30 million tracks. They offer a generous, unlimited version of their service to millions of people, supported by advertising and paid subscriptions. This delicate model is dependent on accurately predicting churn of their paid users.

In this project, we will build an algorithm that predicts whether a user will churn after their subscription expires. Here, we will determine that churn has occured if the customer has not renewed their subscription 30 days after it has expired or has been cancelled. Currently, the company uses survival analysis techniques to determine the residual membership life time for each subscriber. By adopting different methods, KKBOX anticipates they’ll discover new insights to why users leave so they can be proactive in keeping users dancing.

You can view the jupter notebook link here.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import time


from sklearn.model_selection import cross_val_score

#import models
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier
import xgboost as xgb
from xgboost import XGBClassifier

#dimension reduction
from sklearn.decomposition import PCA

# Feature Scaling for modelling purpose by using both min-max-scaling
from sklearn.preprocessing import MinMaxScaler
s=time.time()
train = pd.read_csv('input/train_v2.csv') #contains the true churn values
print("Finished importing train in "+str(round(time.time()-s,2)))
sample_submission = pd.read_csv('input/sample_submission_v2.csv') #only relevent for kaggle submission purposes
print("Finished importing sample_submission in "+str(round(time.time()-s,2)))
transactions = pd.read_csv('input/transactions_v2.csv') #contains transaction data
print("Finished importing transactions in "+str(round(time.time()-s,2)))
user_logs = pd.read_csv('input/user_logs_v2.csv') #contains intra day data related to usage
print("Finished importing user_logs in "+str(round(time.time()-s,2)))
members = pd.read_csv('input/members_v3.csv')
print("Finished importing members in "+str(round(time.time()-s,2)))
Finished importing train in 0.73
Finished importing sample_submission in 1.47
Finished importing transactions in 2.87
Finished importing user_logs in 21.29
Finished importing members in 27.34
train.head()
msno is_churn
0 ugx0CjOMzazClkFzU2xasmDZaoIqOUAZPsH1q0teWCg= 1
1 f/NmvEzHfhINFEYZTR05prUdr+E+3+oewvweYz9cCQE= 1
2 zLo9f73nGGT1p21ltZC3ChiRnAVvgibMyazbCxvWPcg= 1
3 8iF/+8HY8lJKFrTc7iR9ZYGCG2Ecrogbc2Vy5YhsfhQ= 1
4 K6fja4+jmoZ5xG6BypqX80Uw/XKpMgrEMdG2edFOxnA= 1
user_logs.head()
msno date num_25 num_50 num_75 num_985 num_100 num_unq total_secs
0 u9E91QDTvHLq6NXjEaWv8u4QIqhrHk72kE+w31Gnhdg= 20170331 8 4 0 1 21 18 6309.273
1 nTeWW/eOZA/UHKdD5L7DEqKKFTjaAj3ALLPoAWsU8n0= 20170330 2 2 1 0 9 11 2390.699
2 2UqkWXwZbIjs03dHLU9KHJNNEvEkZVzm69f3jCS+uLI= 20170331 52 3 5 3 84 110 23203.337
3 ycwLc+m2O0a85jSLALtr941AaZt9ai8Qwlg9n0Nql5U= 20170331 176 4 2 2 19 191 7100.454
4 EGcbTofOSOkMmQyN1NMLxHEXJ1yV3t/JdhGwQ9wXjnI= 20170331 2 1 0 1 112 93 28401.558
members.head()
msno city bd gender registered_via registration_init_time
0 Rb9UwLQTrxzBVwCB6+bCcSQWZ9JiNLC9dXtM1oEsZA8= 1 0 NaN 11 20110911
1 +tJonkh+O1CA796Fm5X60UMOtB6POHAwPjbTRVl/EuU= 1 0 NaN 7 20110914
2 cV358ssn7a0f7jZOwGNWS07wCKVqxyiImJUX6xcIwKw= 1 0 NaN 11 20110915
3 9bzDeJP6sQodK73K5CBlJ6fgIQzPeLnRl0p5B77XP+g= 1 0 NaN 11 20110915
4 WFLY3s7z4EZsieHCt63XrsdtfTEmJ+2PnnKLH5GY4Tk= 6 32 female 9 20110915
transactions.head()
msno payment_method_id payment_plan_days plan_list_price actual_amount_paid is_auto_renew transaction_date membership_expire_date is_cancel
0 ++6eU4LsQ3UQ20ILS7d99XK8WbiVgbyYL4FUgzZR134= 32 90 298 298 0 20170131 20170504 0
1 ++lvGPJOinuin/8esghpnqdljm6NXS8m8Zwchc7gOeA= 41 30 149 149 1 20150809 20190412 0
2 +/GXNtXWQVfKrEDqYAzcSw2xSPYMKWNj22m+5XkVQZc= 36 30 180 180 1 20170303 20170422 0
3 +/w1UrZwyka4C9oNH3+Q8fUf3fD8R3EwWrx57ODIsqk= 36 30 180 180 1 20170329 20170331 1
4 +00PGzKTYqtnb65mPKPyeHXcZEwqiEzktpQksaaSC3c= 41 30 99 99 1 20170323 20170423 0
# check the number of duplicate accounts in each table
print(train.duplicated('msno').sum())
print(sample_submission.duplicated('msno').sum())
print(transactions.duplicated('msno').sum())
print(user_logs.duplicated('msno').sum())
print(members.duplicated('msno').sum())
0
0
233959
17292468
0
#check number of columns and rows for each df
print(train.shape)
print(sample_submission.shape)
print(transactions.shape)
print(user_logs.shape)
print(members.shape)
(970960, 2)
(907471, 2)
(1431009, 9)
(18396362, 9)
(6769473, 6)

Transactions and user_logs contain duplicate msno values because they provide granular data for each user per day. We will need to group the data by the msno column (user id) in order to merge it with the rest of the data.

Explore feature correlation to target variable (is_churn)

Let’s begin by briefly taking a look at how each of our features are correlated with the target variable. A low correlation does not necessarily mean that we should remove the feauture becuase it is not an accurate metric for classification models which only predict binary values.

#explore members
members.merge(train, on='msno',how='inner').corr()
city bd registered_via registration_init_time is_churn
city 1.000000 0.513068 0.018996 -0.361908 0.081424
bd 0.513068 1.000000 0.067176 -0.393429 0.065719
registered_via 0.018996 0.067176 1.000000 -0.382191 -0.080642
registration_init_time -0.361908 -0.393429 -0.382191 1.000000 -0.000021
is_churn 0.081424 0.065719 -0.080642 -0.000021 1.000000

Since transactions and user_logs contain duplicate user id rows, we need to first group by the msno column (user ids) to synthesize the data into useful results.

# returns the max value of numerical variables and membership_expire_date
# returns the min value of transaction date
# returns the mode of ordinal variable and dummy variables, if multiple values share the same frequency, keep the first one
transactions_v2 = transactions.groupby('msno', as_index = False).agg({'payment_method_id': lambda x:x.value_counts().index[0], 'payment_plan_days': 'max', 'plan_list_price': 'max',
                                       'actual_amount_paid': 'max', 'is_auto_renew': lambda x:x.value_counts().index[0], 'transaction_date': 'min', 'membership_expire_date': 'max',
                                       'is_cancel': lambda x:x.value_counts().index[0]})

transactions_v2.merge(train, on='msno',how='inner').corr()
payment_method_id payment_plan_days plan_list_price actual_amount_paid is_auto_renew transaction_date membership_expire_date is_cancel is_churn
payment_method_id 1.000000 -0.264237 -0.326858 -0.325411 0.326347 -0.043490 -0.127438 -0.020496 -0.203389
payment_plan_days -0.264237 1.000000 0.965855 0.965449 -0.361611 -0.025068 0.656398 0.005977 0.455526
plan_list_price -0.326858 0.965855 1.000000 0.999404 -0.394782 -0.034520 0.647513 0.017064 0.447071
actual_amount_paid -0.325411 0.965449 0.999404 1.000000 -0.394979 -0.034559 0.647361 0.016842 0.444360
is_auto_renew 0.326347 -0.361611 -0.394782 -0.394979 1.000000 -0.009964 -0.184528 0.052175 -0.308546
transaction_date -0.043490 -0.025068 -0.034520 -0.034559 -0.009964 1.000000 -0.415458 0.014457 -0.155364
membership_expire_date -0.127438 0.656398 0.647513 0.647361 -0.184528 -0.415458 1.000000 -0.009109 0.286451
is_cancel -0.020496 0.005977 0.017064 0.016842 0.052175 0.014457 -0.009109 1.000000 0.437393
is_churn -0.203389 0.455526 0.447071 0.444360 -0.308546 -0.155364 0.286451 0.437393 1.000000

The user_logs dataframe contains interesting information related to each users intra day usage. I’ve also created some new features, such as:

  • counts_30: number of days the user used the service this month
  • counts_60: number of days the user used the service the month prior
  • num_unq_30: numbe of unique songs the user listened to this month
  • num_unq_30: numbe of unique songs the user listened to in the month prior
  • total_secs_30: total seconds of music listened to this month
  • total_secs_60: total seconds of music listened to in the month prior
  • days_since_last: how long it’s been since the user last used the service (in days)
# Explore user_logs

#convert dates to datetime obj
user_logs['date'] = pd.to_datetime(user_logs['date'], format = '%Y%m%d')

#print latest date
user_logs['date'].max()
Timestamp('2017-03-31 00:00:00')
# returns the max value of date and number of unique songs
# returns the sum of other variables
user_logs_v2 = user_logs.groupby('msno', as_index = False).agg({'date': 'max', 'num_25': 'sum', 'num_50': 'sum', 'num_75': 'sum',
                                 'num_985': 'sum', 'num_100': 'sum', 'num_unq': 'max', 'total_secs': 'sum'})

#change date to max_date
user_logs_v2.columns = ['msno', 'max_date', 'num_25', 'num_50', 'num_75', 'num_985', 'num_100',
       'num_unq', 'total_secs']

#shift the dates back 30, 60 and 90 days
temp = user_logs.copy()
temp['max_date'] = user_logs_v2['max_date']
temp['last_30_days'] = temp['max_date'] - pd.Timedelta(days=30)
temp['last_60_days'] = temp['max_date'] - pd.Timedelta(days=60)
temp['last_90_days'] = temp['max_date'] - pd.Timedelta(days=90)

#group by msno to get total counts of days, total unique songs, and total seconds for 30/60/90 days
temp1 = temp[temp['date']>=temp['last_30_days']]
temp1_groupby = temp1.groupby(['msno']).agg({'msno': 'count', 'num_unq': 'sum', 'total_secs':'sum'})
temp1_groupby.columns = ['counts_30', 'num_unq_30', 'total_secs_30']
temp1_groupby.reset_index(inplace=True)

temp2 = temp[temp['date']>=temp['last_60_days']]
temp2_groupby = temp2.groupby(['msno']).agg({'msno': 'count', 'num_unq': 'sum', 'total_secs':'sum'})
temp2_groupby.columns = ['counts_60', 'num_unq_60', 'total_secs_60']
temp2_groupby.reset_index(inplace=True)


temp3 = temp[temp['date']>=temp['last_90_days']]
temp3_groupby = temp3.groupby(['msno']).agg({'msno': 'count', 'num_unq': 'sum', 'total_secs':'sum'})
temp3_groupby.columns = ['counts_90', 'num_unq_90', 'total_secs_90']
temp3_groupby.reset_index(inplace=True)


#merge back to user_logs_v2
user_logs_v2 = user_logs_v2.merge(temp1_groupby, on='msno', how='left')
user_logs_v2 = user_logs_v2.merge(temp2_groupby, on='msno', how='left')
user_logs_v2 = user_logs_v2.merge(temp3_groupby, on='msno', how='left')

temp1_groupby.head()
msno counts_30 num_unq_30 total_secs_30
0 +++IZseRRiQS9aaSkH6cMYU6bGDcxUieAi/tH67sC5s= 3 25 4245.614
1 +++hVY1rZox/33YtvDgmKA2Frg/2qhkz12B9ylCvh8o= 1 5 924.747
2 +++l/EXNMLTijfLBa8p2TUVVVp2aFGSuUI/h7mLmthw= 1 7 1395.247
3 ++0+IdHga8fCSioOVpU8K7y4Asw8AveIApVH2r9q9yY= 2 58 10594.449
4 ++0/NopttBsaAn6qHZA2AWWrDg7Me7UOMs1vsyo4tSI= 2 9 1650.342
#how long has it been since the user last used the service?
user_logs_v2['today'] = pd.to_datetime('2017-03-31 00:00:00')
#user_logs_v2['days_since_last'] = pd.Timedelta(user_logs_v2['today'] - user_logs_v2['max_date']).seconds
user_logs_v2['days_since_last'] = (user_logs_v2['today'] - user_logs_v2['max_date']).dt.days

# to make these columns more useful, convert them to percentages
# calculate the percentage of number of songs played within certain period
user_logs_v2['percent_25'] = user_logs_v2['num_25']/(user_logs_v2['num_25']+user_logs_v2['num_50']+user_logs_v2['num_75']+user_logs_v2['num_985']+user_logs_v2['num_100'])
user_logs_v2['percent_50'] = user_logs_v2['num_50']/(user_logs_v2['num_25']+user_logs_v2['num_50']+user_logs_v2['num_75']+user_logs_v2['num_985']+user_logs_v2['num_100'])
user_logs_v2['percent_100'] = (user_logs_v2['num_985']+user_logs_v2['num_100'])/(user_logs_v2['num_25']+user_logs_v2['num_50']+user_logs_v2['num_75']+user_logs_v2['num_985']+user_logs_v2['num_100'])

# drop useless variables
user_logs_v3 = user_logs_v2.drop(columns = ['num_25', 'num_50', 'num_75', 'num_985', 'num_100'])
#explore correlation with target variable
# user_logs_v3.merge(train, on='msno',how='inner').corr()

Building the training data

Create the training data by combining all the data together into one big df. Next train the machine learning model using a variety of different classifer and ensemble models.

I’ve also created one additional feauture:

  • days_since_reg : total number of days the user has been a subscriber
# merge between different tables for modelling purpose
dataset_train = train.merge(members, on = 'msno', how = 'left').merge(transactions_v2, on = 'msno', how = 'left').merge(user_logs_v3, on = 'msno', how = 'left')


#create one more features


#convert to datetime objs
dataset_train['transaction_date'] = pd.to_datetime(dataset_train['transaction_date'], format = '%Y%m%d')
dataset_train['membership_expire_date'] = pd.to_datetime(dataset_train['membership_expire_date'], format = '%Y%m%d')

#length of usage in days (transaction_date is the registration date since groupby looked for the min value)
dataset_train['days_since_reg'] = (dataset_train['max_date'] - dataset_train['transaction_date']).dt.days

#dataset_train.corr()

Next we will remove rows which are not useful in predicting churn, such as msno, bd, transaction_date, etc. The data set contains lot of missing values, so we will need to figure out a method to fill in the missing values with mean/mode.

# remove useless rows AND
#remove gender and age since missing value or incorrect value is over 50%
dataset_train_v2 = dataset_train.drop(columns = ['msno', 'gender', 'bd', 'registration_init_time', 'transaction_date', 'membership_expire_date', 'max_date', 'today'])
dataset_train_v2.dtypes

# check the number of missing values in each column
dataset_train_v2.isna().sum()
is_churn                   0
city                  109993
registered_via        109993
payment_method_id      37382
payment_plan_days      37382
plan_list_price        37382
actual_amount_paid     37382
is_auto_renew          37382
is_cancel              37382
num_unq               216409
total_secs            216409
counts_30             514347
num_unq_30            514347
total_secs_30         514347
counts_60             514347
num_unq_60            514347
total_secs_60         514347
counts_90             514347
num_unq_90            514347
total_secs_90         514347
days_since_last       216409
percent_25            216409
percent_50            216409
percent_100           216409
days_since_reg        245224
dtype: int64
# Handle missing value of part of numeric columns by using mode
def replacemode(i):
    dataset_train_v2[i] = dataset_train_v2[i].fillna(dataset_train_v2[i].value_counts().index[0])
    return 

replacemode('city')
replacemode('registered_via')
replacemode('payment_method_id')
replacemode('payment_plan_days')
replacemode('is_auto_renew')
replacemode('is_cancel')

# Handle missing value of part of numeric columns by using mean
from sklearn.impute import SimpleImputer
mean_imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')

def replacemean(i):
    dataset_train_v2[i] = mean_imputer.fit_transform(dataset_train_v2[[i]])
    return 

replacemean('plan_list_price')
replacemean('actual_amount_paid')
replacemean('num_unq')
replacemean('total_secs')
replacemean('percent_25')
replacemean('percent_50')
replacemean('percent_100')
replacemean('days_since_reg')
replacemean('days_since_last')
replacemean('counts_30')
replacemean('counts_60')
replacemean('counts_90')
replacemean('num_unq_30')
replacemean('num_unq_60')
replacemean('num_unq_90')
replacemean('total_secs_30')
replacemean('total_secs_60')
replacemean('total_secs_90')



# Handle outliers by using capping
def replaceoutlier(i):
    mean, std = np.mean(dataset_train_v2[i]), np.std(dataset_train_v2[i])
    cut_off = std*3
    lower, upper = mean - cut_off, mean + cut_off
    dataset_train_v2[i][dataset_train_v2[i] < lower] = lower
    dataset_train_v2[i][dataset_train_v2[i] > upper] = upper
    return

replaceoutlier('plan_list_price')
replaceoutlier('actual_amount_paid')
replaceoutlier('num_unq')
replaceoutlier('total_secs')
replaceoutlier('percent_25')
replaceoutlier('percent_50')
replaceoutlier('percent_100')
replaceoutlier('days_since_reg')
replaceoutlier('days_since_last')
replaceoutlier('counts_30')
replaceoutlier('counts_60')
replaceoutlier('counts_90')
replaceoutlier('num_unq_30')
replaceoutlier('num_unq_60')
replaceoutlier('num_unq_90')
replaceoutlier('total_secs_30')
replaceoutlier('total_secs_60')
replaceoutlier('total_secs_90')

In an effort to make the training time faster, I will be using feature scaling and dimension reduction to vectorize the data.

# Feature Scaling for modelling purpose by using both min-max-scaling
from sklearn.preprocessing import MinMaxScaler
X = dataset_train_v2.drop(columns = ['is_churn'])
Y = dataset_train_v2['is_churn']
nm_X = pd.DataFrame(MinMaxScaler().fit_transform(X))
nm_X.columns = X.columns.values
nm_X.index = X.index.values

#pca reduces dimensionality, and only keeps most important vector
#cons: lose data and accuracy

# Dimension Reduction
from sklearn.decomposition import PCA
pca = PCA()
pca.fit_transform(nm_X)
np.cumsum(pca.explained_variance_ratio_)
mpca=PCA(n_components=10)
nm_X_v2 = mpca.fit_transform(nm_X)

# Split into train and test Set
from sklearn.model_selection import train_test_split
nm_X_train, nm_X_test, nm_Y_train, nm_Y_test = train_test_split(nm_X_v2, Y, test_size = 0.3, random_state = 0)
# Fit training set into different algorithms

#to make things simple, only using logistic regression and xgb classifier

nm_models = []
# nm_models.append(('KNN', KNeighborsClassifier()))
nm_models.append(('LR', LogisticRegression()))
# nm_models.append(('LDA', LinearDiscriminantAnalysis()))
# nm_models.append(('QDA', QuadraticDiscriminantAnalysis()))
# nm_models.append(('CART', DecisionTreeClassifier()))
# nm_models.append(('NB', GaussianNB()))
# # nm_models.append(('Linear SVM', SVC(kernel = 'linear')))
# # nm_models.append(('Kernel SVM', SVC(kernel = 'rbf')))

ensembles = []
# ensembles.append(('BC', BaggingClassifier(base_estimator=LogisticRegression())))
# ensembles.append(('AB', AdaBoostClassifier(base_estimator=LogisticRegression())))
# ensembles.append(('GBM', GradientBoostingClassifier()))
# ensembles.append(('RF', RandomForestClassifier()))
# ensembles.append(('ET', ExtraTreesClassifier()))
ensembles.append(('XGB', XGBClassifier()))


#nm_models
from sklearn.model_selection import cross_val_score
results = []
names = []
for name, model in nm_models:
	nm_cv_results = cross_val_score(model, nm_X_train, nm_Y_train, cv=10, scoring='neg_log_loss', n_jobs = -1)
	results.append(nm_cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, nm_cv_results.mean(), nm_cv_results.std())
	print(msg)
    
#ensembles   
results2 = []
names2 = []
for name2, model2 in ensembles:
	en_cv_results = cross_val_score(model2, nm_X_train, nm_Y_train, cv=10, scoring='neg_log_loss', n_jobs = -1)
	results2.append(en_cv_results)
	names2.append(name2)
	msg2 = "%s: %f (%f)" % (name2, en_cv_results.mean(), en_cv_results.std())
	print(msg2)
LR: -0.214069 (0.001141)
XGB: -0.148262 (0.001632)

The best performing classifier model was the logistic regression model, and the best performing ensemble model was the XGB classifier (the rest have been commented out to speed up re-training).

#evaluate feature importance

from numpy import loadtxt
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot

# fit model to training data
model = XGBClassifier()
model.fit(nm_X_train, nm_Y_train)
# plot feature importance
plot_importance(model)
pyplot.show()
[20:49:18] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

png

According to the feature importance chart, “is_auto_renew” is the most important feature in determing wether a customer will churn, followed by the “payment_method_id” and “is_cancel”. The top 10 are:

  • is_auto_renew
  • payment_method_id
  • is_cancel
  • num_unq
  • actual_amount_paid
  • registered_via
  • plan_list_price
  • payment_plan_days
  • city
  • total_secs
# Apply Grid Search on XGBoost since it returns the best result on Cross Validation among all models
#find the optimal parameters for the model
from sklearn.model_selection import GridSearchCV
parameters = {"learning_rate": [0.1],
              "min_child_weight": [1],
              "max_depth": [9],
              "gamma": [0.1],
              "subsample": [0.8],
              "colsample_bytree": [0.8],
              "objective": ['binary:logistic']}
grid_search_XGB = GridSearchCV(estimator = XGBClassifier(), param_grid = parameters, scoring = "neg_log_loss", cv = 10, n_jobs = -1)
grid_result_XGB = grid_search_XGB.fit(nm_X_train, nm_Y_train)
print("Best: %f using %s" % (grid_result_XGB.best_score_, grid_result_XGB.best_params_)) 

# Evaluate tuned XGBoost model result on test dataset because it provides the best result
from sklearn.metrics import log_loss
nm_Y_predict = grid_result_XGB.predict_proba(nm_X_test)
logloss = log_loss(nm_Y_test, nm_Y_predict)  
print("Log Loss Score: %s" % (logloss))
#save model so you don't have to retrain
import joblib
joblib.dump(grid_result_XGB.best_estimator_, 'trained-models/xgboost_model_v2.pkl')
['trained-models/xgboost_model_v2.pkl']
#Do predictions
final=nm_X.copy()
final["is_churn"]=model.predict(nm_X_v2)
#change column name to "pred_churn" since it's our predicted results
final.columns = ['city', 'registered_via', 'payment_method_id', 'payment_plan_days',
       'plan_list_price', 'actual_amount_paid', 'is_auto_renew', 'is_cancel',
       'num_unq', 'total_secs', 'counts_30', 'num_unq_30', 'total_secs_30',
       'counts_60', 'num_unq_60', 'total_secs_60', 'counts_90', 'num_unq_90',
       'total_secs_90', 'days_since_last', 'percent_25', 'percent_50',
       'percent_100', 'days_since_reg', 'pred_churn']
#let's add the "is_churn" column (which contains the true churn values) back to the dataframe 
final['is_churn'] = dataset_train['is_churn']
#confusion matrix
from sklearn.metrics import confusion_matrix
y_true = final['is_churn']
y_pred = final['pred_churn']
confusion_matrix(y_true, y_pred)


array([[877477,   6153],
       [ 36325,  51005]])
# outcome values order in sklearn
tp, fn, fp, tn = confusion_matrix(y_true,y_pred).reshape(-1)
print('Outcome values : \n', tp, fn, fp, tn)

print('\n')

# classification report for precision, recall f1-score and accuracy
from sklearn.metrics import classification_report
matrix = classification_report(y_true, y_pred)
print('Classification report : \n',matrix)
Outcome values : 
 877477 6153 36325 51005


Classification report : 
               precision    recall  f1-score   support

           0       0.96      0.99      0.98    883630
           1       0.89      0.58      0.71     87330

    accuracy                           0.96    970960
   macro avg       0.93      0.79      0.84    970960
weighted avg       0.95      0.96      0.95    970960
#accuracy
acc = (tp + tn)/(tp + tn + fp + fn)
acc
0.9562515448628162

So far, we see that model has a high accuracy score of 95% and a low log loss value. The model performed poorly in recall, specifically for predicting churned customers (is_churn = 1). The purpose of this project is to retarget customers that the model is predicting will churn, and try to keep them as customers before they cancel their subscription. Targeting customers who are incorrectly predicted to churn (i.e false positives) are okay; however, missing customers who are incorrectly predicted not to churn (false negatives) would be much worse. We see that the model has a much lower recall when predicting customers who will churn to customers who will not. The model was only able to predict 58% of the actual churned cases, meaning that a lot of customers who will churn will be missed.

To improve the model further, we can try to remove non-important features and compare our scores. From there, we can also do more feature engineering and comapare once more.