HackLive - Guided Community Hackathon
Go there and register to be able to download the dataset and submit your predictions. Click the button below to open this notebook in Google Colab!
Marketing campaigns are characterized by focusing on the customer needs and their overall satisfaction. Nevertheless, there are different variables that determine whether a marketing campaign will be successful or not. Some important aspects of a marketing campaign are as follows:
Segment of the Population: To which segment of the population is the marketing campaign going to address and why? This aspect of the marketing campaign is extremely important since it will tell to which part of the population should most likely receive the message of the marketing campaign.
Distribution channel to reach the customer’s place: Implementing the most effective strategy in order to get the most out of this marketing campaign. What segment of the population should we address? Which instrument should we use to get our message out? (Ex: Telephones, Radio, TV, Social Media Etc.)
Promotional Strategy: This is the way the strategy is going to be implemented and how are potential clients going to be address. This should be the last part of the marketing campaign analysis since there has to be an in-depth analysis of previous campaigns (If possible) in order to learn from previous mistakes and to determine how to make the marketing campaign much more effective.
You are leading the marketing analytics team for a banking institution. There has been a revenue decline for the bank and they would like to know what actions to take. After investigation, it was found that the root cause is that their clients are not depositing as frequently as before. Term deposits allow banks to hold onto a deposit for a specific amount of time, so banks can lend more and thus make more profits. In addition, banks also hold better chance to persuade term deposit clients into buying other products such as funds or insurance to further increase their revenues.
You are provided a dataset containing details of marketing campaigns done via phone with various details for customers such as demographics, last campaign details etc. Can you help the bank to predict accurately whether the customer will subscribe to the focus product for the campaign - Term Deposit after the campaign?
!pip install catboost
Requirement already satisfied: catboost in /usr/local/lib/python3.6/dist-packages (0.24.4)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from catboost) (1.4.1)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from catboost) (1.19.5)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from catboost) (1.15.0)
Requirement already satisfied: plotly in /usr/local/lib/python3.6/dist-packages (from catboost) (4.4.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from catboost) (3.2.2)
Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (from catboost) (0.10.1)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.6/dist-packages (from catboost) (1.1.5)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly->catboost) (1.3.3)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (1.3.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.24.0->catboost) (2018.9)
# import useful libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from catboost import CatBoostClassifier
# load in data and set seed
BASE = 'https://drive.google.com/uc?export=download&id='
SEED = 2021
train = pd.read_csv(f'{BASE}1fNjtZDxlQwwAE5VY7BBJODw7an-Lbob2')
test = pd.read_csv(f'{BASE}1VJUp6Zuww-OphdWBqI5Q2TRK7o1Xh_xn')
ss = pd.read_csv(f'{BASE}19P8qo-6_sykC6uTJQ60eyfmcbYpu0GtR')
# prepare a few key variables to classify columns into categorical and numeric
ID_COL, TARGET_COL = 'id', 'term_deposit_subscribed'
features = [c for c in train.columns if c not in [ID_COL, TARGET_COL]]
cat_cols = ['job_type',
num_cols = [c for c in features if c not in cat_cols]
EDA starts
First we look at the first few rows of train dataset.
id | customer_age | job_type | marital | education | default | balance | housing_loan | personal_loan | communication_type | day_of_month | month | last_contact_duration | num_contacts_in_campaign | days_since_prev_campaign_contact | num_contacts_prev_campaign | prev_campaign_outcome | term_deposit_subscribed | |
0 | id_43823 | 28.0 | management | single | tertiary | no | 285.0 | yes | no | unknown | 26 | jun | 303.0 | 4.0 | NaN | 0 | unknown | 0 |
1 | id_32289 | 34.0 | blue-collar | married | secondary | no | 934.0 | no | yes | cellular | 18 | nov | 143.0 | 2.0 | 132.0 | 1 | other | 0 |
2 | id_10523 | 46.0 | technician | married | secondary | no | 656.0 | no | no | cellular | 5 | feb | 101.0 | 4.0 | NaN | 0 | unknown | 0 |
id | term_deposit_subscribed | |
0 | id_17231 | 0 |
1 | id_34508 | 0 |
2 | id_44504 | 0 |
# look at distribution of target variable
train[TARGET_COL].value_counts(), train[TARGET_COL].value_counts(normalize=True)
(0 28253
1 3394
Name: term_deposit_subscribed, dtype: int64, 0 0.892754
1 0.107246
Name: term_deposit_subscribed, dtype: float64)
# look at which variables are null and if they were parsed correctly
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31647 entries, 0 to 31646
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 31647 non-null object
1 customer_age 31028 non-null float64
2 job_type 31647 non-null object
3 marital 31497 non-null object
4 education 31647 non-null object
5 default 31647 non-null object
6 balance 31248 non-null float64
7 housing_loan 31647 non-null object
8 personal_loan 31498 non-null object
9 communication_type 31647 non-null object
10 day_of_month 31647 non-null int64
11 month 31647 non-null object
12 last_contact_duration 31336 non-null float64
13 num_contacts_in_campaign 31535 non-null float64
14 days_since_prev_campaign_contact 5816 non-null float64
15 num_contacts_prev_campaign 31647 non-null int64
16 prev_campaign_outcome 31647 non-null object
17 term_deposit_subscribed 31647 non-null int64
dtypes: float64(5), int64(3), object(10)
memory usage: 4.3+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13564 entries, 0 to 13563
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 13564 non-null object
1 customer_age 13294 non-null float64
2 job_type 13564 non-null object
3 marital 13483 non-null object
4 education 13564 non-null object
5 default 13564 non-null object
6 balance 13383 non-null float64
7 housing_loan 13564 non-null object
8 personal_loan 13490 non-null object
9 communication_type 13564 non-null object
10 day_of_month 13564 non-null int64
11 month 13564 non-null object
12 last_contact_duration 13442 non-null float64
13 num_contacts_in_campaign 13519 non-null float64
14 days_since_prev_campaign_contact 2441 non-null float64
15 num_contacts_prev_campaign 13564 non-null int64
16 prev_campaign_outcome 13564 non-null object
dtypes: float64(5), int64(2), object(10)
memory usage: 1.8+ MB
Looks like we have a lot of nulls. :/ Otherwise pandas parsed out the columns quite well.
Looking at categorical columns
Because of all the categorical columns I decided to set a baseline in Catboost. Here are top 5 value counts and countplots for all of them, they prove useful.
# print top 5 values and plot data wrt target variable (term deposit subscribed)
for col in cat_cols:
print(f'Analysing: {col}\nTrain top 5 counts:')
print('Test top 5 counts:')
sns.countplot(x=col, hue=TARGET_COL, data=train)
Analysing: job_type
Train top 5 counts:
blue-collar 6816
management 6666
technician 5220
admin. 3627
services 2923
Name: job_type, dtype: int64
Test top 5 counts:
blue-collar 2916
management 2792
technician 2377
admin. 1544
services 1231
Name: job_type, dtype: int64
Analysing: marital
Train top 5 counts:
married 18945
single 8857
divorced 3695
Name: marital, dtype: int64
Test top 5 counts:
married 8123
single 3869
divorced 1491
Name: marital, dtype: int64
Analysing: education
Train top 5 counts:
secondary 16247
tertiary 9321
primary 4787
unknown 1292
Name: education, dtype: int64
Test top 5 counts:
secondary 6955
tertiary 3980
primary 2064
unknown 565
Name: education, dtype: int64
Analysing: default
Train top 5 counts:
no 31094
yes 553
Name: default, dtype: int64
Test top 5 counts:
no 13302
yes 262
Name: default, dtype: int64
Analysing: housing_loan
Train top 5 counts:
yes 17700
no 13947
Name: housing_loan, dtype: int64
Test top 5 counts:
yes 7430
no 6134
Name: housing_loan, dtype: int64
Analysing: personal_loan
Train top 5 counts:
no 26463
yes 5035
Name: personal_loan, dtype: int64
Test top 5 counts:
no 11314
yes 2176
Name: personal_loan, dtype: int64
Analysing: communication_type
Train top 5 counts:
cellular 20480
unknown 9151
telephone 2016
Name: communication_type, dtype: int64
Test top 5 counts:
cellular 8805
unknown 3869
telephone 890
Name: communication_type, dtype: int64
Analysing: month
Train top 5 counts:
may 9685
jul 4786
aug 4308
jun 3746
nov 2801
Name: month, dtype: int64
Test top 5 counts:
may 4081
jul 2109
aug 1939
jun 1595
nov 1169
Name: month, dtype: int64
Analysing: prev_campaign_outcome
Train top 5 counts:
unknown 25833
failure 3472
other 1272
success 1070
Name: prev_campaign_outcome, dtype: int64
Test top 5 counts:
unknown 11126
failure 1429
other 568
success 441
Name: prev_campaign_outcome, dtype: int64
Here I am interested in the ratio of target variable in each category. If it is a lot different from the other ratios, the signal conveyed for that category is useful.
Mostly married managers without a default. No housing, no personal loan. Contacted by cell phone.
Analysis of continuous variables
Plotted boxplots by target variable and kernel density estimates for each continuous variable to draw interesting insight.
# plot kernel density plot and a boxplot of data wrt target variable (term deposit subscribed)
for col in num_cols:
print(f'Analysing: {col}')
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20,5))
sns.kdeplot(train[col], ax=ax1)
sns.boxplot(x = train[TARGET_COL], y = train[col], ax=ax2)
Analysing: customer_age
Analysing: balance
Analysing: day_of_month
Analysing: last_contact_duration
Analysing: num_contacts_in_campaign
Analysing: days_since_prev_campaign_contact
Analysing: num_contacts_prev_campaign
Last contact duration, days since previous campaign seem to have an effect, as well as day of month.
Three variables are clearly exponentially distributed, let’s plot them log-transformed to properly see their relationships.
for col in ['balance', 'last_contact_duration', 'num_contacts_prev_campaign']:
# plot kernel density plot and a boxplot of data wrt target variable (term deposit subscribed)
print(f'Analysing: {col}')
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20,5))
sns.kdeplot(np.log1p(train[col]), ax=ax1)
sns.boxplot(x = train[TARGET_COL], y = np.log1p(train[col]), ax=ax2)
Analysing: balance
/usr/local/lib/python3.6/dist-packages/pandas/core/series.py:726: RuntimeWarning: divide by zero encountered in log1p
result = getattr(ufunc, method)(*inputs, **kwargs)
/usr/local/lib/python3.6/dist-packages/pandas/core/series.py:726: RuntimeWarning: invalid value encountered in log1p
result = getattr(ufunc, method)(*inputs, **kwargs)
/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:306: UserWarning: Dataset has 0 variance; skipping density estimate.
warnings.warn(msg, UserWarning)
/usr/local/lib/python3.6/dist-packages/pandas/core/series.py:726: RuntimeWarning: divide by zero encountered in log1p
result = getattr(ufunc, method)(*inputs, **kwargs)
/usr/local/lib/python3.6/dist-packages/pandas/core/series.py:726: RuntimeWarning: invalid value encountered in log1p
result = getattr(ufunc, method)(*inputs, **kwargs)
Analysing: last_contact_duration
Analysing: num_contacts_prev_campaign
Looks like balance
column has some invalid observations => probably negative balances causing issues.
with 0 target variable has lots of outliers, quite a strange distribution - worth investigating in the future.
Let’s try some bivariate analysis.
# correlation heatmap
# not that useful for classification, especially with GBDTs
# since DT-models are not influenced by multi-collinearity
plt.figure(figsize=(22, 8))
sns.heatmap(train[num_cols].corr(), annot=True);
# pairplots => these always take long to render
CPU times: user 11.2 s, sys: 161 ms, total: 11.4 s
Wall time: 11.3 s
<seaborn.axisgrid.PairGrid at 0x7f996f7fe550>
Baseline Model
Alright, after EDA of all variables, it’s time to introduce the CatboostClassifier model with no tuning as a baseline.
# data preparation
y = train[TARGET_COL].values
X = train.drop([TARGET_COL, ID_COL], axis=1)
customer_age | job_type | marital | education | default | balance | housing_loan | personal_loan | communication_type | day_of_month | month | last_contact_duration | num_contacts_in_campaign | days_since_prev_campaign_contact | num_contacts_prev_campaign | prev_campaign_outcome | |
0 | 28.0 | management | single | tertiary | no | 285.0 | yes | no | unknown | 26 | jun | 303.0 | 4.0 | NaN | 0 | unknown |
1 | 34.0 | blue-collar | married | secondary | no | 934.0 | no | yes | cellular | 18 | nov | 143.0 | 2.0 | 132.0 | 1 | other |
2 | 46.0 | technician | married | secondary | no | 656.0 | no | no | cellular | 5 | feb | 101.0 | 4.0 | NaN | 0 | unknown |
3 | 34.0 | services | single | secondary | no | 2.0 | yes | no | unknown | 20 | may | 127.0 | 3.0 | NaN | 0 | unknown |
4 | 41.0 | blue-collar | married | primary | no | 1352.0 | yes | no | cellular | 13 | may | 49.0 | 2.0 | NaN | 0 | unknown |
# categorical features reminder
# fillnas and convert to right data types
X_filled = X.copy()
X_filled['marital'] = X['marital'].fillna('NA')
X_filled['personal_loan'] = X['personal_loan'].fillna('NA')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31647 entries, 0 to 31646
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 job_type 31647 non-null object
1 marital 31497 non-null object
2 education 31647 non-null object
3 default 31647 non-null object
4 housing_loan 31647 non-null object
5 personal_loan 31498 non-null object
6 communication_type 31647 non-null object
7 month 31647 non-null object
8 prev_campaign_outcome 31647 non-null object
dtypes: object(9)
memory usage: 2.2+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31647 entries, 0 to 31646
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 job_type 31647 non-null object
1 marital 31647 non-null object
2 education 31647 non-null object
3 default 31647 non-null object
4 housing_loan 31647 non-null object
5 personal_loan 31647 non-null object
6 communication_type 31647 non-null object
7 month 31647 non-null object
8 prev_campaign_outcome 31647 non-null object
dtypes: object(9)
memory usage: 2.2+ MB
# import train test split, then split the data into train and test set
# cross validation is not included in the baseline => model could overfit
from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation = train_test_split(X_filled, y, train_size=0.8, random_state=SEED, shuffle=True, stratify=y)
model = CatBoostClassifier(
random_seed=SEED, # set seed for reproducibility
eval_metric='F1', # set the same metric as in the competition
task_type='GPU' # GPU makes the training a lot faster!
X_train, y_train,
eval_set=(X_validation, y_validation),
print('Model is fitted: ' + str(model.is_fitted()))
print('Model params:')
Learning rate set to 0.054105
0: learn: 0.1836798 test: 0.2494226 best: 0.2494226 (0) total: 90.6ms remaining: 1m 30s
50: learn: 0.3425693 test: 0.3567568 best: 0.3567568 (50) total: 2.65s remaining: 49.2s
100: learn: 0.5266774 test: 0.5201794 best: 0.5219731 (99) total: 5.05s remaining: 44.9s
150: learn: 0.5510159 test: 0.5511811 best: 0.5514834 (149) total: 7.47s remaining: 42s
200: learn: 0.5628743 test: 0.5553633 best: 0.5553633 (199) total: 9.9s remaining: 39.4s
250: learn: 0.5724382 test: 0.5559380 best: 0.5593804 (212) total: 12.4s remaining: 36.9s
300: learn: 0.5798634 test: 0.5577417 best: 0.5593804 (212) total: 14.8s remaining: 34.4s
350: learn: 0.5963222 test: 0.5629252 best: 0.5653650 (327) total: 17.2s remaining: 31.8s
400: learn: 0.6023570 test: 0.5677966 best: 0.5711864 (374) total: 19.6s remaining: 29.3s
450: learn: 0.6075619 test: 0.5673158 best: 0.5711864 (374) total: 22s remaining: 26.8s
500: learn: 0.6126867 test: 0.5663567 best: 0.5711864 (374) total: 24.3s remaining: 24.2s
550: learn: 0.6154179 test: 0.5661331 best: 0.5711864 (374) total: 26.5s remaining: 21.6s
600: learn: 0.6176152 test: 0.5682968 best: 0.5728728 (579) total: 28.9s remaining: 19.2s
650: learn: 0.6210777 test: 0.5757576 best: 0.5764706 (643) total: 31.1s remaining: 16.7s
700: learn: 0.6214054 test: 0.5719092 best: 0.5767285 (654) total: 33.3s remaining: 14.2s
750: learn: 0.6238651 test: 0.5755274 best: 0.5767285 (654) total: 35.5s remaining: 11.8s
800: learn: 0.6262408 test: 0.5752961 best: 0.5789030 (792) total: 37.7s remaining: 9.37s
850: learn: 0.6271626 test: 0.5748098 best: 0.5789030 (792) total: 39.9s remaining: 6.99s
900: learn: 0.6293253 test: 0.5765004 best: 0.5789030 (792) total: 42.3s remaining: 4.64s
950: learn: 0.6307592 test: 0.5736041 best: 0.5789030 (792) total: 44.5s remaining: 2.29s
999: learn: 0.6333046 test: 0.5738397 best: 0.5789030 (792) total: 46.6s remaining: 0us
bestTest = 0.5789029536
bestIteration = 792
Shrink model to first 793 iterations.
Model is fitted: True
Model params:
{'task_type': 'GPU', 'eval_metric': 'F1', 'random_seed': 2021}
print('Tree count: ' + str(model.tree_count_))
Tree count: 793
Feature Id | Importances | |
0 | last_contact_duration | 45.840995 |
1 | month | 13.903198 |
2 | communication_type | 9.759939 |
3 | job_type | 9.652283 |
4 | prev_campaign_outcome | 5.413135 |
5 | housing_loan | 4.969356 |
6 | balance | 2.261329 |
7 | marital | 1.983482 |
8 | customer_age | 1.673719 |
9 | education | 1.343537 |
10 | day_of_month | 1.009644 |
11 | days_since_prev_campaign_contact | 0.970517 |
12 | num_contacts_in_campaign | 0.605712 |
13 | personal_loan | 0.584974 |
14 | num_contacts_prev_campaign | 0.028181 |
15 | default | 0.000000 |
X_test = test.drop([ID_COL], axis=1)
customer_age | job_type | marital | education | default | balance | housing_loan | personal_loan | communication_type | day_of_month | month | last_contact_duration | num_contacts_in_campaign | days_since_prev_campaign_contact | num_contacts_prev_campaign | prev_campaign_outcome | |
0 | 55.0 | retired | married | tertiary | no | 7136.0 | no | no | cellular | 13 | aug | 90.0 | 2.0 | NaN | 0 | unknown |
1 | 24.0 | blue-collar | single | secondary | no | 179.0 | yes | no | cellular | 18 | may | 63.0 | 2.0 | NaN | 0 | unknown |
2 | 46.0 | technician | divorced | secondary | no | 143.0 | no | no | cellular | 8 | jul | 208.0 | 1.0 | NaN | 0 | unknown |
3 | 56.0 | housemaid | single | unknown | no | 6023.0 | no | no | unknown | 6 | jun | 34.0 | 1.0 | NaN | 0 | unknown |
4 | 62.0 | retired | married | secondary | no | 2913.0 | no | no | cellular | 12 | apr | 127.0 | 1.0 | 188.0 | 1 | success |
# fillnas and convert to right data types TEST
X_test_filled = X_test.copy()
X_test_filled['marital'] = X_test['marital'].fillna('NA')
X_test_filled['personal_loan'] = X_test['personal_loan'].fillna('NA')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13564 entries, 0 to 13563
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 job_type 13564 non-null object
1 marital 13483 non-null object
2 education 13564 non-null object
3 default 13564 non-null object
4 housing_loan 13564 non-null object
5 personal_loan 13490 non-null object
6 communication_type 13564 non-null object
7 month 13564 non-null object
8 prev_campaign_outcome 13564 non-null object
dtypes: object(9)
memory usage: 953.8+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13564 entries, 0 to 13563
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 job_type 13564 non-null object
1 marital 13564 non-null object
2 education 13564 non-null object
3 default 13564 non-null object
4 housing_loan 13564 non-null object
5 personal_loan 13564 non-null object
6 communication_type 13564 non-null object
7 month 13564 non-null object
8 prev_campaign_outcome 13564 non-null object
dtypes: object(9)
memory usage: 953.8+ KB
contest_predictions = model.predict(X_test_filled)
[0 0 0 ... 0 0 0]
ss[TARGET_COL] = contest_predictions.astype(np.int16)
id | term_deposit_subscribed | |
0 | id_17231 | 0 |
1 | id_34508 | 0 |
2 | id_44504 | 0 |
3 | id_174 | 0 |
4 | id_2115 | 0 |
ss.to_csv("Catboost_Baseline.csv", index=False)
# and we're done!