HackLive - Guided Community Hackathon

13 minute read

Link to competition here!

Go there and register to be able to download the dataset and submit your predictions. Click the button below to open this notebook in Google Colab!

Open In Colab

Marketing campaigns are characterized by focusing on the customer needs and their overall satisfaction. Nevertheless, there are different variables that determine whether a marketing campaign will be successful or not. Some important aspects of a marketing campaign are as follows:

  • Segment of the Population: To which segment of the population is the marketing campaign going to address and why? This aspect of the marketing campaign is extremely important since it will tell to which part of the population should most likely receive the message of the marketing campaign.

  • Distribution channel to reach the customer’s place: Implementing the most effective strategy in order to get the most out of this marketing campaign. What segment of the population should we address? Which instrument should we use to get our message out? (Ex: Telephones, Radio, TV, Social Media Etc.)

  • Promotional Strategy: This is the way the strategy is going to be implemented and how are potential clients going to be address. This should be the last part of the marketing campaign analysis since there has to be an in-depth analysis of previous campaigns (If possible) in order to learn from previous mistakes and to determine how to make the marketing campaign much more effective.

You are leading the marketing analytics team for a banking institution. There has been a revenue decline for the bank and they would like to know what actions to take. After investigation, it was found that the root cause is that their clients are not depositing as frequently as before. Term deposits allow banks to hold onto a deposit for a specific amount of time, so banks can lend more and thus make more profits. In addition, banks also hold better chance to persuade term deposit clients into buying other products such as funds or insurance to further increase their revenues.

You are provided a dataset containing details of marketing campaigns done via phone with various details for customers such as demographics, last campaign details etc. Can you help the bank to predict accurately whether the customer will subscribe to the focus product for the campaign - Term Deposit after the campaign?

!pip install catboost
Requirement already satisfied: catboost in /usr/local/lib/python3.6/dist-packages (0.24.4)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from catboost) (1.4.1)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from catboost) (1.19.5)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from catboost) (1.15.0)
Requirement already satisfied: plotly in /usr/local/lib/python3.6/dist-packages (from catboost) (4.4.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from catboost) (3.2.2)
Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (from catboost) (0.10.1)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.6/dist-packages (from catboost) (1.1.5)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly->catboost) (1.3.3)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (1.3.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.24.0->catboost) (2018.9)
# import useful libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

from catboost import CatBoostClassifier
# load in data and set seed
BASE = 'https://drive.google.com/uc?export=download&id='
SEED = 2021

train = pd.read_csv(f'{BASE}1fNjtZDxlQwwAE5VY7BBJODw7an-Lbob2')
test = pd.read_csv(f'{BASE}1VJUp6Zuww-OphdWBqI5Q2TRK7o1Xh_xn')
ss = pd.read_csv(f'{BASE}19P8qo-6_sykC6uTJQ60eyfmcbYpu0GtR')
# prepare a few key variables to classify columns into categorical and numeric

ID_COL, TARGET_COL = 'id', 'term_deposit_subscribed'

features = [c for c in train.columns if c not in [ID_COL, TARGET_COL]]



cat_cols = ['job_type',

 'marital',

 'education',

 'default',

 'housing_loan',

 'personal_loan',

 'communication_type',

 'month',

 'prev_campaign_outcome']



num_cols = [c for c in features if c not in cat_cols]

EDA starts

First we look at the first few rows of train dataset.

train.head(3)
id customer_age job_type marital education default balance housing_loan personal_loan communication_type day_of_month month last_contact_duration num_contacts_in_campaign days_since_prev_campaign_contact num_contacts_prev_campaign prev_campaign_outcome term_deposit_subscribed
0 id_43823 28.0 management single tertiary no 285.0 yes no unknown 26 jun 303.0 4.0 NaN 0 unknown 0
1 id_32289 34.0 blue-collar married secondary no 934.0 no yes cellular 18 nov 143.0 2.0 132.0 1 other 0
2 id_10523 46.0 technician married secondary no 656.0 no no cellular 5 feb 101.0 4.0 NaN 0 unknown 0
ss.head(3)
id term_deposit_subscribed
0 id_17231 0
1 id_34508 0
2 id_44504 0
# look at distribution of target variable
train[TARGET_COL].value_counts(), train[TARGET_COL].value_counts(normalize=True)
(0    28253
 1     3394
 Name: term_deposit_subscribed, dtype: int64, 0    0.892754
 1    0.107246
 Name: term_deposit_subscribed, dtype: float64)
# look at which variables are null and if they were parsed correctly
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31647 entries, 0 to 31646
Data columns (total 18 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                31647 non-null  object 
 1   customer_age                      31028 non-null  float64
 2   job_type                          31647 non-null  object 
 3   marital                           31497 non-null  object 
 4   education                         31647 non-null  object 
 5   default                           31647 non-null  object 
 6   balance                           31248 non-null  float64
 7   housing_loan                      31647 non-null  object 
 8   personal_loan                     31498 non-null  object 
 9   communication_type                31647 non-null  object 
 10  day_of_month                      31647 non-null  int64  
 11  month                             31647 non-null  object 
 12  last_contact_duration             31336 non-null  float64
 13  num_contacts_in_campaign          31535 non-null  float64
 14  days_since_prev_campaign_contact  5816 non-null   float64
 15  num_contacts_prev_campaign        31647 non-null  int64  
 16  prev_campaign_outcome             31647 non-null  object 
 17  term_deposit_subscribed           31647 non-null  int64  
dtypes: float64(5), int64(3), object(10)
memory usage: 4.3+ MB
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13564 entries, 0 to 13563
Data columns (total 17 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                                13564 non-null  object 
 1   customer_age                      13294 non-null  float64
 2   job_type                          13564 non-null  object 
 3   marital                           13483 non-null  object 
 4   education                         13564 non-null  object 
 5   default                           13564 non-null  object 
 6   balance                           13383 non-null  float64
 7   housing_loan                      13564 non-null  object 
 8   personal_loan                     13490 non-null  object 
 9   communication_type                13564 non-null  object 
 10  day_of_month                      13564 non-null  int64  
 11  month                             13564 non-null  object 
 12  last_contact_duration             13442 non-null  float64
 13  num_contacts_in_campaign          13519 non-null  float64
 14  days_since_prev_campaign_contact  2441 non-null   float64
 15  num_contacts_prev_campaign        13564 non-null  int64  
 16  prev_campaign_outcome             13564 non-null  object 
dtypes: float64(5), int64(2), object(10)
memory usage: 1.8+ MB

Looks like we have a lot of nulls. :/ Otherwise pandas parsed out the columns quite well.

Looking at categorical columns

Because of all the categorical columns I decided to set a baseline in Catboost. Here are top 5 value counts and countplots for all of them, they prove useful.

# print top 5 values and plot data wrt target variable (term deposit subscribed)

for col in cat_cols:

  print(f'Analysing: {col}\nTrain top 5 counts:')

  print(train[col].value_counts().head(5))

  print('Test top 5 counts:')

  print(test[col].value_counts().head(5))

  plt.figure(figsize=(20,5))

  sns.countplot(x=col, hue=TARGET_COL, data=train)

  plt.show();

  print('\n')
Analysing: job_type
Train top 5 counts:
blue-collar    6816
management     6666
technician     5220
admin.         3627
services       2923
Name: job_type, dtype: int64
Test top 5 counts:
blue-collar    2916
management     2792
technician     2377
admin.         1544
services       1231
Name: job_type, dtype: int64

alt

Analysing: marital
Train top 5 counts:
married     18945
single       8857
divorced     3695
Name: marital, dtype: int64
Test top 5 counts:
married     8123
single      3869
divorced    1491
Name: marital, dtype: int64

alt

Analysing: education
Train top 5 counts:
secondary    16247
tertiary      9321
primary       4787
unknown       1292
Name: education, dtype: int64
Test top 5 counts:
secondary    6955
tertiary     3980
primary      2064
unknown       565
Name: education, dtype: int64

alt

Analysing: default
Train top 5 counts:
no     31094
yes      553
Name: default, dtype: int64
Test top 5 counts:
no     13302
yes      262
Name: default, dtype: int64

alt

Analysing: housing_loan
Train top 5 counts:
yes    17700
no     13947
Name: housing_loan, dtype: int64
Test top 5 counts:
yes    7430
no     6134
Name: housing_loan, dtype: int64

alt

Analysing: personal_loan
Train top 5 counts:
no     26463
yes     5035
Name: personal_loan, dtype: int64
Test top 5 counts:
no     11314
yes     2176
Name: personal_loan, dtype: int64

alt

Analysing: communication_type
Train top 5 counts:
cellular     20480
unknown       9151
telephone     2016
Name: communication_type, dtype: int64
Test top 5 counts:
cellular     8805
unknown      3869
telephone     890
Name: communication_type, dtype: int64

alt

Analysing: month
Train top 5 counts:
may    9685
jul    4786
aug    4308
jun    3746
nov    2801
Name: month, dtype: int64
Test top 5 counts:
may    4081
jul    2109
aug    1939
jun    1595
nov    1169
Name: month, dtype: int64

alt

Analysing: prev_campaign_outcome
Train top 5 counts:
unknown    25833
failure     3472
other       1272
success     1070
Name: prev_campaign_outcome, dtype: int64
Test top 5 counts:
unknown    11126
failure     1429
other        568
success      441
Name: prev_campaign_outcome, dtype: int64

alt

Observations

Here I am interested in the ratio of target variable in each category. If it is a lot different from the other ratios, the signal conveyed for that category is useful.

Mostly married managers without a default. No housing, no personal loan. Contacted by cell phone.

Analysis of continuous variables

Plotted boxplots by target variable and kernel density estimates for each continuous variable to draw interesting insight.

# plot kernel density plot and a boxplot of data wrt target variable (term deposit subscribed)

for col in num_cols:

  print(f'Analysing: {col}')

  fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20,5))

  sns.kdeplot(train[col], ax=ax1)

  sns.boxplot(x = train[TARGET_COL], y = train[col], ax=ax2)

  plt.show();

  print('\n')
Analysing: customer_age

alt

Analysing: balance

alt

Analysing: day_of_month

alt

Analysing: last_contact_duration

alt

Analysing: num_contacts_in_campaign

alt

Analysing: days_since_prev_campaign_contact

alt

Analysing: num_contacts_prev_campaign

alt

Observations

Last contact duration, days since previous campaign seem to have an effect, as well as day of month.

Three variables are clearly exponentially distributed, let’s plot them log-transformed to properly see their relationships.

for col in ['balance', 'last_contact_duration', 'num_contacts_prev_campaign']:

  # plot kernel density plot and a boxplot of data wrt target variable (term deposit subscribed)

  print(f'Analysing: {col}')

  fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20,5))

  sns.kdeplot(np.log1p(train[col]), ax=ax1)

  sns.boxplot(x = train[TARGET_COL], y = np.log1p(train[col]), ax=ax2)

  plt.show();

  print('\n')
Analysing: balance


/usr/local/lib/python3.6/dist-packages/pandas/core/series.py:726: RuntimeWarning: divide by zero encountered in log1p
  result = getattr(ufunc, method)(*inputs, **kwargs)
/usr/local/lib/python3.6/dist-packages/pandas/core/series.py:726: RuntimeWarning: invalid value encountered in log1p
  result = getattr(ufunc, method)(*inputs, **kwargs)
/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:306: UserWarning: Dataset has 0 variance; skipping density estimate.
  warnings.warn(msg, UserWarning)
/usr/local/lib/python3.6/dist-packages/pandas/core/series.py:726: RuntimeWarning: divide by zero encountered in log1p
  result = getattr(ufunc, method)(*inputs, **kwargs)
/usr/local/lib/python3.6/dist-packages/pandas/core/series.py:726: RuntimeWarning: invalid value encountered in log1p
  result = getattr(ufunc, method)(*inputs, **kwargs)

alt

Analysing: last_contact_duration

alt

Analysing: num_contacts_prev_campaign

alt

Observations

Looks like balance column has some invalid observations => probably negative balances causing issues.

num_contacts_prev_campaign with 0 target variable has lots of outliers, quite a strange distribution - worth investigating in the future.

Let’s try some bivariate analysis.

# correlation heatmap 

# not that useful for classification, especially with GBDTs

# since DT-models are not influenced by multi-collinearity

plt.figure(figsize=(22, 8))

sns.heatmap(train[num_cols].corr(), annot=True);

alt

%%time

# pairplots => these always take long to render

sns.pairplot(train[num_cols]);
CPU times: user 11.2 s, sys: 161 ms, total: 11.4 s
Wall time: 11.3 s





<seaborn.axisgrid.PairGrid at 0x7f996f7fe550>

alt

Baseline Model

Alright, after EDA of all variables, it’s time to introduce the CatboostClassifier model with no tuning as a baseline.

# data preparation
y = train[TARGET_COL].values
X = train.drop([TARGET_COL, ID_COL], axis=1)
X.head()
customer_age job_type marital education default balance housing_loan personal_loan communication_type day_of_month month last_contact_duration num_contacts_in_campaign days_since_prev_campaign_contact num_contacts_prev_campaign prev_campaign_outcome
0 28.0 management single tertiary no 285.0 yes no unknown 26 jun 303.0 4.0 NaN 0 unknown
1 34.0 blue-collar married secondary no 934.0 no yes cellular 18 nov 143.0 2.0 132.0 1 other
2 46.0 technician married secondary no 656.0 no no cellular 5 feb 101.0 4.0 NaN 0 unknown
3 34.0 services single secondary no 2.0 yes no unknown 20 may 127.0 3.0 NaN 0 unknown
4 41.0 blue-collar married primary no 1352.0 yes no cellular 13 may 49.0 2.0 NaN 0 unknown
# categorical features reminder
cat_cols
['job_type',
 'marital',
 'education',
 'default',
 'housing_loan',
 'personal_loan',
 'communication_type',
 'month',
 'prev_campaign_outcome']
# fillnas and convert to right data types
print(X[cat_cols].info())

X_filled = X.copy()
X_filled['marital'] = X['marital'].fillna('NA')
X_filled['personal_loan'] = X['personal_loan'].fillna('NA')

X_filled[cat_cols].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31647 entries, 0 to 31646
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   job_type               31647 non-null  object
 1   marital                31497 non-null  object
 2   education              31647 non-null  object
 3   default                31647 non-null  object
 4   housing_loan           31647 non-null  object
 5   personal_loan          31498 non-null  object
 6   communication_type     31647 non-null  object
 7   month                  31647 non-null  object
 8   prev_campaign_outcome  31647 non-null  object
dtypes: object(9)
memory usage: 2.2+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31647 entries, 0 to 31646
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   job_type               31647 non-null  object
 1   marital                31647 non-null  object
 2   education              31647 non-null  object
 3   default                31647 non-null  object
 4   housing_loan           31647 non-null  object
 5   personal_loan          31647 non-null  object
 6   communication_type     31647 non-null  object
 7   month                  31647 non-null  object
 8   prev_campaign_outcome  31647 non-null  object
dtypes: object(9)
memory usage: 2.2+ MB
# import train test split, then split the data into train and test set
# cross validation is not included in the baseline => model could overfit
from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation = train_test_split(X_filled, y, train_size=0.8, random_state=SEED, shuffle=True, stratify=y)
model = CatBoostClassifier(
    random_seed=SEED, # set seed for reproducibility
    eval_metric='F1', # set the same metric as in the competition
    task_type='GPU'   # GPU makes the training a lot faster!
)
model.fit(
    X_train, y_train,
    cat_features=cat_cols,
    use_best_model=True,
    eval_set=(X_validation, y_validation),
    verbose=50
)
print('Model is fitted: ' + str(model.is_fitted()))
print('Model params:')
print(model.get_params())
Learning rate set to 0.054105
0:	learn: 0.1836798	test: 0.2494226	best: 0.2494226 (0)	total: 90.6ms	remaining: 1m 30s
50:	learn: 0.3425693	test: 0.3567568	best: 0.3567568 (50)	total: 2.65s	remaining: 49.2s
100:	learn: 0.5266774	test: 0.5201794	best: 0.5219731 (99)	total: 5.05s	remaining: 44.9s
150:	learn: 0.5510159	test: 0.5511811	best: 0.5514834 (149)	total: 7.47s	remaining: 42s
200:	learn: 0.5628743	test: 0.5553633	best: 0.5553633 (199)	total: 9.9s	remaining: 39.4s
250:	learn: 0.5724382	test: 0.5559380	best: 0.5593804 (212)	total: 12.4s	remaining: 36.9s
300:	learn: 0.5798634	test: 0.5577417	best: 0.5593804 (212)	total: 14.8s	remaining: 34.4s
350:	learn: 0.5963222	test: 0.5629252	best: 0.5653650 (327)	total: 17.2s	remaining: 31.8s
400:	learn: 0.6023570	test: 0.5677966	best: 0.5711864 (374)	total: 19.6s	remaining: 29.3s
450:	learn: 0.6075619	test: 0.5673158	best: 0.5711864 (374)	total: 22s	remaining: 26.8s
500:	learn: 0.6126867	test: 0.5663567	best: 0.5711864 (374)	total: 24.3s	remaining: 24.2s
550:	learn: 0.6154179	test: 0.5661331	best: 0.5711864 (374)	total: 26.5s	remaining: 21.6s
600:	learn: 0.6176152	test: 0.5682968	best: 0.5728728 (579)	total: 28.9s	remaining: 19.2s
650:	learn: 0.6210777	test: 0.5757576	best: 0.5764706 (643)	total: 31.1s	remaining: 16.7s
700:	learn: 0.6214054	test: 0.5719092	best: 0.5767285 (654)	total: 33.3s	remaining: 14.2s
750:	learn: 0.6238651	test: 0.5755274	best: 0.5767285 (654)	total: 35.5s	remaining: 11.8s
800:	learn: 0.6262408	test: 0.5752961	best: 0.5789030 (792)	total: 37.7s	remaining: 9.37s
850:	learn: 0.6271626	test: 0.5748098	best: 0.5789030 (792)	total: 39.9s	remaining: 6.99s
900:	learn: 0.6293253	test: 0.5765004	best: 0.5789030 (792)	total: 42.3s	remaining: 4.64s
950:	learn: 0.6307592	test: 0.5736041	best: 0.5789030 (792)	total: 44.5s	remaining: 2.29s
999:	learn: 0.6333046	test: 0.5738397	best: 0.5789030 (792)	total: 46.6s	remaining: 0us
bestTest = 0.5789029536
bestIteration = 792
Shrink model to first 793 iterations.
Model is fitted: True
Model params:
{'task_type': 'GPU', 'eval_metric': 'F1', 'random_seed': 2021}
print('Tree count: ' + str(model.tree_count_))
Tree count: 793
model.get_feature_importance(prettified=True)
Feature Id Importances
0 last_contact_duration 45.840995
1 month 13.903198
2 communication_type 9.759939
3 job_type 9.652283
4 prev_campaign_outcome 5.413135
5 housing_loan 4.969356
6 balance 2.261329
7 marital 1.983482
8 customer_age 1.673719
9 education 1.343537
10 day_of_month 1.009644
11 days_since_prev_campaign_contact 0.970517
12 num_contacts_in_campaign 0.605712
13 personal_loan 0.584974
14 num_contacts_prev_campaign 0.028181
15 default 0.000000
X_test = test.drop([ID_COL], axis=1)
X_test.head()
customer_age job_type marital education default balance housing_loan personal_loan communication_type day_of_month month last_contact_duration num_contacts_in_campaign days_since_prev_campaign_contact num_contacts_prev_campaign prev_campaign_outcome
0 55.0 retired married tertiary no 7136.0 no no cellular 13 aug 90.0 2.0 NaN 0 unknown
1 24.0 blue-collar single secondary no 179.0 yes no cellular 18 may 63.0 2.0 NaN 0 unknown
2 46.0 technician divorced secondary no 143.0 no no cellular 8 jul 208.0 1.0 NaN 0 unknown
3 56.0 housemaid single unknown no 6023.0 no no unknown 6 jun 34.0 1.0 NaN 0 unknown
4 62.0 retired married secondary no 2913.0 no no cellular 12 apr 127.0 1.0 188.0 1 success
# fillnas and convert to right data types TEST
print(X_test[cat_cols].info())

X_test_filled = X_test.copy()
X_test_filled['marital'] = X_test['marital'].fillna('NA')
X_test_filled['personal_loan'] = X_test['personal_loan'].fillna('NA')

X_test_filled[cat_cols].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13564 entries, 0 to 13563
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   job_type               13564 non-null  object
 1   marital                13483 non-null  object
 2   education              13564 non-null  object
 3   default                13564 non-null  object
 4   housing_loan           13564 non-null  object
 5   personal_loan          13490 non-null  object
 6   communication_type     13564 non-null  object
 7   month                  13564 non-null  object
 8   prev_campaign_outcome  13564 non-null  object
dtypes: object(9)
memory usage: 953.8+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13564 entries, 0 to 13563
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   job_type               13564 non-null  object
 1   marital                13564 non-null  object
 2   education              13564 non-null  object
 3   default                13564 non-null  object
 4   housing_loan           13564 non-null  object
 5   personal_loan          13564 non-null  object
 6   communication_type     13564 non-null  object
 7   month                  13564 non-null  object
 8   prev_campaign_outcome  13564 non-null  object
dtypes: object(9)
memory usage: 953.8+ KB
contest_predictions = model.predict(X_test_filled)
print('Predictions:')
print(contest_predictions)
Predictions:
[0 0 0 ... 0 0 0]
ss[TARGET_COL] = contest_predictions.astype(np.int16)
ss.head()
id term_deposit_subscribed
0 id_17231 0
1 id_34508 0
2 id_44504 0
3 id_174 0
4 id_2115 0
ss.to_csv("Catboost_Baseline.csv", index=False)
# and we're done!

'Done!'
'Done!'