HackLive II - Guided Community Hackathon

10 minute read

Go there and register to be able to download the dataset and submit your predictions. Click the button below to open this notebook in Google Colab!

As YouTube becomes one of the most popular video-sharing platforms, YouTuber is developed as a new type of career in recent decades. YouTubers earn money through advertising revenue from YouTube videos, sponsorships from companies, merchandise sales, and donations from their fans. In order to maintain a stable income, the popularity of videos become the top priority for YouTubers. Meanwhile, some of our friends are YouTubers or channel owners in other video-sharing platforms. This raises our interest in predicting the performance of the video. If creators can have a preliminary prediction and understanding on their videos’ performance, they may adjust their video to gain the most attention from the public.

You have been provided details on videos along with some features as well. Can you accurately predict the number of likes for each video using the set of input variables?

!pip install catboost

Requirement already satisfied: catboost in /usr/local/lib/python3.6/dist-packages (0.24.4)
Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (from catboost) (0.10.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/dist-packages (from catboost) (3.2.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from catboost) (1.4.1)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from catboost) (1.15.0)
Requirement already satisfied: plotly in /usr/local/lib/python3.6/dist-packages (from catboost) (4.4.1)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.6/dist-packages (from catboost) (1.1.5)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from catboost) (1.19.5)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (2.8.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (1.3.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly->catboost) (1.3.3)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.24.0->catboost) (2018.9)

# import useful libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

from catboost import CatBoostRegressor

# load in data and set seed
BASE = 'https://drive.google.com/uc?export=download&id='
SEED = 2021

train = pd.read_csv(f'{BASE}1twZymRo0KT6IMIL7Q1wYoUSmHE3w3Buc')
test = pd.read_csv(f'{BASE}1Zu57FJCK4XpzX6_CzG_L4vlF9J73B9ke')
ss = pd.read_csv(f'{BASE}1s8iq0VaoTVkE9rQEAh1sfNuNdUgFjxYo')

# prepare a few key variables to classify columns into categorical and numeric

ID_COL, TARGET_COL = 'video_id', 'likes'

num_cols = ['views', 'dislikes', 'comment_count']

cat_cols = ['category_id', 'country_code']

text_cols = ['title', 'channel_title', 'tags', 'description']

date_cols = ['publish_date']

EDA starts

First we look at the first few rows of the train dataset.

train.head(3)

	video_id	title	channel_title	category_id	publish_date	tags	views	dislikes	comment_count	description	country_code	likes
0	53364	Alif Allah Aur Insaan Episode 34 HUM TV Drama ...	HUM TV	24.0	2017-12-12	HUM\|"TV"\|"Alif Allah Aur Insaan"\|"Episode 34"\|...	351430.0	298.0	900.0	Alif Allah Aur Insaan Episode 34 Full - 12 Dec...	CA	2351.0
1	51040	It's Showtime Miss Q & A: Bela gets jealous of...	ABS-CBN Entertainment	24.0	2018-03-08	ABS-CBN Entertainment\|"ABS-CBN"\|"ABS-CBN Onlin...	461508.0	74.0	314.0	Vice Ganda notices Bela Padilla's sudden chang...	CA	3264.0
2	1856	ದರ್ಶನ್ ಗೆ ಬಾರಿ ಅವಮಾನ ಮಾಡಿದ ಶಿವಣ್ಣ ನಾಯಕಿ \n ಕ್...	SANDALWOOD REVIEWS	24.0	2018-03-26	challenging star darshan latest news\|"challeng...	40205.0	150.0	100.0	ದರ್ಶನ್ ಗೆ ಬಾರಿ ಅವಮಾನ ಮಾಡಿದ ಶಿವಣ್ಣ ನಾಯಕಿ ಕ್ಲ...	IN	580.0

ss.head(3)

	video_id	likes
0	87185	0
1	9431	0
2	40599	0

# look at distribution of target variable

train[TARGET_COL].hist();

alt

Lots of zeroes! Definintely need to log transform to be able to analyse this variable. Luckily, GBDT regressors are resistant to log-normal distributions.

# look at distribution of target variable log-transformed
np.log1p(train[TARGET_COL]).hist();

alt

# look at which variables are null and if they were parsed correctly
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26061 entries, 0 to 26060
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   video_id       26061 non-null  int64  
 1   title          26061 non-null  object 
 2   channel_title  26061 non-null  object 
 3   category_id    26061 non-null  float64
 4   publish_date   26061 non-null  object 
 5   tags           26061 non-null  object 
 6   views          26061 non-null  float64
 7   dislikes       26061 non-null  float64
 8   comment_count  26061 non-null  float64
 9   description    26061 non-null  object 
 10  country_code   26061 non-null  object 
 11  likes          26061 non-null  float64
dtypes: float64(5), int64(1), object(6)
memory usage: 2.4+ MB

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11170 entries, 0 to 11169
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   video_id       11170 non-null  int64  
 1   title          11170 non-null  object 
 2   channel_title  11170 non-null  object 
 3   category_id    11170 non-null  float64
 4   publish_date   11170 non-null  object 
 5   tags           11170 non-null  object 
 6   views          11170 non-null  float64
 7   dislikes       11170 non-null  float64
 8   comment_count  11170 non-null  float64
 9   description    11170 non-null  object 
 10  country_code   11170 non-null  object 
dtypes: float64(4), int64(1), object(6)
memory usage: 960.0+ KB

No nulls, which is great!

Looking at categorical columns

Because of a few categorical columns I decided to set a baseline in Catboost. Here are value counts and target variable boxplots for all of them, they prove useful.

# print top 5 value counts and plot target boxplots

for col in cat_cols:

  print(f'Analysing: {col}\nTrain top 5 counts:')

  print(train[col].value_counts().head(5))

  print('Test top 5 counts:')

  print(test[col].value_counts().head(5))

  plt.figure(figsize=(20,5))

  sns.boxplot(x=train[col], y=np.log1p(train[TARGET_COL]))

  plt.show();

  print('\n')

Analysing: category_id
Train top 5 counts:
24.0    9614
25.0    3725
22.0    2365
10.0    2099
23.0    1736
Name: category_id, dtype: int64
Test top 5 counts:
24.0    4105
25.0    1516
22.0     995
10.0     891
23.0     723
Name: category_id, dtype: int64

alt

Analysing: country_code
Train top 5 counts:
IN    10401
CA    10326
US     3095
GB     2239
Name: country_code, dtype: int64
Test top 5 counts:
IN    4458
CA    4425
US    1327
GB     960
Name: country_code, dtype: int64

alt

# channel title could be used as a high cardinality categorical variable

train['channel_title'].value_counts()

SAB TV                165
SET India             128
ESPN                  122
Study IQ education    118
etvteluguindia        115
                     ... 
WGA West                1
PhantomStrider          1
Yarotska                1
KhalidVEVO              1
Christina Aguilera      1
Name: channel_title, Length: 5764, dtype: int64

# same with publish date

train['publish_date'].value_counts()

2018-01-29    199
2017-12-13    185
2018-01-19    181
2018-01-26    180
2018-01-12    179
             ... 
2017-09-26      1
2015-10-31      1
2017-09-09      1
2015-05-21      1
2017-10-20      1
Name: publish_date, Length: 348, dtype: int64

Observations

Strange dataset, will remove textual features for now. Another template shows how to deal with them in the NLP template.

Canadian and Indian videos are the most common, where Canadian ones have more likes. There are fewer British and American ones, with British having the highest median likes.

Otherwise there is somewhat of a sine pattern in the likes distributions, we could maybe even use it as numerical. Let’s stick to categorising it for now.

Analysis of continuous variables

Let’s plot distributions and correlations for numerical variables.

# plot histogram and a kernel density plot of data wrt target variable (term deposit subscribed)

for col in num_cols+[TARGET_COL]:

  print(f'Analysing: {col}')

  fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20,5))

  np.log1p(train[col]).hist(ax=ax1)

  sns.kdeplot(np.log1p(train[col]), ax=ax2)

  plt.show();

  print('\n')

Analysing: views

alt

Analysing: dislikes

alt

Analysing: comment_count

alt

Analysing: likes

alt

# plot correlation heatmap
plt.figure(figsize=(14, 8))
sns.heatmap(np.log1p(train[num_cols+[TARGET_COL]]).corr(), annot=True);

alt

%%time

# pairplots => these always take long to render

sns.pairplot(np.log1p(train[num_cols+[TARGET_COL]]));

CPU times: user 4.39 s, sys: 108 ms, total: 4.49 s
Wall time: 4.42 s

<seaborn.axisgrid.PairGrid at 0x7f09a5f48208>

alt

Observations

All numerical columns are log-normal, views and dislikes highly correlated. Target likes are highly correlated with all numerical independent variables.

We could probably have a decent prediction just using these columns (hence dropping the text columns shouldn’t make a massive difference for now).

Baseline Model

Alright, after basic EDA of all variables, it’s time to introduce the basic Catboost model with no tuning as a baseline.

# data preparation
y = train[TARGET_COL].values
X = train.drop([TARGET_COL, ID_COL, 'title', 'tags', 'description'], axis=1)
X.head()

	channel_title	category_id	publish_date	views	dislikes	comment_count	country_code
0	HUM TV	24.0	2017-12-12	351430.0	298.0	900.0	CA
1	ABS-CBN Entertainment	24.0	2018-03-08	461508.0	74.0	314.0	CA
2	SANDALWOOD REVIEWS	24.0	2018-03-26	40205.0	150.0	100.0	IN
3	doddleoddle	10.0	2018-02-21	509726.0	847.0	4536.0	GB
4	Dude Seriously	23.0	2018-05-10	74311.0	69.0	161.0	IN

# categorical features declaration
cat_features = cat_cols + ['publish_date', 'channel_title']
cat_features

['category_id', 'country_code', 'publish_date', 'channel_title']

# fillnas and convert to right data types
print(X[cat_features].info())

X_filled = X.copy()
X_filled["category_id"] = X["category_id"].astype(np.int16)

X_filled[cat_features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26061 entries, 0 to 26060
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   category_id    26061 non-null  float64
 1   country_code   26061 non-null  object 
 2   publish_date   26061 non-null  object 
 3   channel_title  26061 non-null  object 
dtypes: float64(1), object(3)
memory usage: 814.5+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26061 entries, 0 to 26060
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   category_id    26061 non-null  int16 
 1   country_code   26061 non-null  object
 2   publish_date   26061 non-null  object
 3   channel_title  26061 non-null  object
dtypes: int16(1), object(3)
memory usage: 661.8+ KB

# import train test split, then split the data into train and test set
# cross validation is not included in the baseline => model could overfit
from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation = train_test_split(X_filled, y, train_size=0.8, random_state=SEED, shuffle=True)

model = CatBoostRegressor(
    loss_function='Tweedie:variance_power=1.9',
      # Tweedie loss has worked wonders in previous kaggle comps modelling strange, 
      # Poisson-like distributions, it turns out to work well here as well
      # more details here: https://stats.stackexchange.com/questions/492726/what-is-use-of-tweedie-or-poisson-loss-objective-function-in-xgboost-and-deep-le
    random_seed=SEED,    # set seed for reproducibility
    eval_metric='MSLE',  # set the same metric as in the competition
#     task_type='GPU'    # GPU does not work for Tweedie loss :/
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    use_best_model=True,
    eval_set=(X_validation, y_validation),
    verbose=50
)
print('Model is fitted: ' + str(model.is_fitted()))
print('Model params:')
print(model.get_params())

learn: 63.0488718	test: 63.3900575	best: 63.3900575 (0)	total: 83ms	remaining: 1m 22s
learn: 49.1816641	test: 49.4679398	best: 49.4679398 (50)	total: 1.69s	remaining: 31.5s
learn: 42.9424624	test: 43.2011343	best: 43.2011343 (100)	total: 3.25s	remaining: 28.9s
learn: 39.0388839	test: 39.2794343	best: 39.2794343 (150)	total: 4.86s	remaining: 27.3s
learn: 36.3599589	test: 36.5879924	best: 36.5879924 (200)	total: 6.44s	remaining: 25.6s
learn: 34.6895950	test: 34.9118270	best: 34.9118270 (250)	total: 7.96s	remaining: 23.7s
learn: 34.0791345	test: 34.3005940	best: 34.3005940 (300)	total: 9.59s	remaining: 22.3s
learn: 33.9920518	test: 34.2193143	best: 34.2193143 (350)	total: 11.2s	remaining: 20.7s
learn: 33.9895898	test: 34.2170543	best: 34.2010313 (380)	total: 12.8s	remaining: 19.2s
learn: 33.9986020	test: 34.2287529	best: 34.2010313 (380)	total: 14.3s	remaining: 17.4s
learn: 34.0068053	test: 34.2343207	best: 34.2010313 (380)	total: 15.8s	remaining: 15.7s
learn: 34.0136013	test: 34.2392720	best: 34.2010313 (380)	total: 17.3s	remaining: 14.1s
learn: 34.0192750	test: 34.2437817	best: 34.2010313 (380)	total: 18.9s	remaining: 12.6s
learn: 34.0238506	test: 34.2465844	best: 34.2010313 (380)	total: 20.5s	remaining: 11s
learn: 34.0270986	test: 34.2485531	best: 34.2010313 (380)	total: 22.1s	remaining: 9.44s
learn: 34.0302284	test: 34.2509902	best: 34.2010313 (380)	total: 23.6s	remaining: 7.84s
learn: 34.0324680	test: 34.2520011	best: 34.2010313 (380)	total: 25.2s	remaining: 6.26s
learn: 34.0348004	test: 34.2535832	best: 34.2010313 (380)	total: 26.7s	remaining: 4.68s
learn: 34.0367559	test: 34.2547111	best: 34.2010313 (380)	total: 28.3s	remaining: 3.11s
learn: 34.0383293	test: 34.2560061	best: 34.2010313 (380)	total: 29.8s	remaining: 1.54s
learn: 34.0399886	test: 34.2572126	best: 34.2010313 (380)	total: 31.3s	remaining: 0us

bestTest = 34.20103126
bestIteration = 380

Shrink model to first 381 iterations.
Model is fitted: True
Model params:
{'eval_metric': 'MSLE', 'random_seed': 2021, 'loss_function': 'Tweedie:variance_power=1.9'}

print('Tree count: ' + str(model.tree_count_))

Tree count: 381

model.get_feature_importance(prettified=True)

	Feature Id	Importances
0	comment_count	60.776699
1	views	20.077523
2	dislikes	12.546723
3	category_id	5.634881
4	country_code	0.964174
5	channel_title	0.000000
6	publish_date	0.000000

X_test = test.drop([ID_COL, 'title', 'tags', 'description'], axis=1)
X_test.head()

	channel_title	category_id	publish_date	views	dislikes	comment_count	country_code
0	CHIRRAVURI FOUNDATION	22.0	2018-01-17	80793.0	54.0	79.0	IN
1	VIRAL IN INDIA	22.0	2017-11-18	150317.0	230.0	311.0	IN
2	Saskatchewan Roughriders	17.0	2017-12-01	6558.0	10.0	7.0	CA
3	Matthias Wandel	26.0	2018-02-06	89664.0	145.0	324.0	US
4	AVA Creative thoughts	22.0	2018-02-06	53526.0	357.0	153.0	IN

# fillnas and convert to right data types TEST
print(X_test[cat_features].info())

X_test_filled = X_test.copy()
X_test_filled["category_id"] = X_test["category_id"].astype(np.int16)

X_test_filled[cat_features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11170 entries, 0 to 11169
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   category_id    11170 non-null  float64
 1   country_code   11170 non-null  object 
 2   publish_date   11170 non-null  object 
 3   channel_title  11170 non-null  object 
dtypes: float64(1), object(3)
memory usage: 349.2+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11170 entries, 0 to 11169
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   category_id    11170 non-null  int16 
 1   country_code   11170 non-null  object
 2   publish_date   11170 non-null  object
 3   channel_title  11170 non-null  object
dtypes: int16(1), object(3)
memory usage: 283.7+ KB

contest_predictions = model.predict(X_test_filled)
print('Predictions:')
print(contest_predictions)

Predictions:
[  792.32025174  2267.42416482   313.2757626  ...  5558.09271849
 11900.57430592  4262.99567486]

ss[TARGET_COL] = contest_predictions.round(0).astype(np.int16)
ss.head()

	video_id	likes
0	87185	792
1	9431	2267
2	40599	313
3	494	2833
4	73942	1074

ss.to_csv("catboost_baseline.csv", index = False)

# and we're done!

'Done!'

'Done!'

Twitter Facebook LinkedIn

Jirka Prazan

HackLive II - Guided Community Hackathon

EDA starts

Looking at categorical columns

Observations

Analysis of continuous variables

Observations

Baseline Model

You May Also Enjoy

HackLive IV - Guided Community Hackathon - TimeSeries

HackLive III - Guided Community Hackathon - NLP

HackLive - Guided Community Hackathon