Ensembling Implementation
Version 1.0.1
import numpy as np
import pandas as pd
import sklearn
import scipy.sparse
import lightgbm
for p in [np, pd, scipy, sklearn, lightgbm]:
print (p.__name__, p.__version__)
Important! There is a huge chance that the assignment will be impossible to pass if the versions of lighgbm
and scikit-learn
are wrong. The versions being tested:
numpy 1.13.1
pandas 0.20.3
scipy 0.19.1
sklearn 0.19.0
ligthgbm 2.0.6
To install an older version of lighgbm
you may use the following command:
pip uninstall lightgbm
pip install lightgbm==2.0.6
In this programming assignment you are asked to implement two ensembling schemes: simple linear mix and stacking.
We will spend several cells to load data and create feature matrix, you can scroll down this part or try to understand what's happening.
import pandas as pd
import numpy as np
import gc
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_rows', 600)
pd.set_option('display.max_columns', 50)
import lightgbm as lgb
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from tqdm import tqdm_notebook
from itertools import product
def downcast_dtypes(df):
'''
Changes column types in the dataframe:
`float64` type to `float32`
`int64` type to `int32`
'''
# Select columns to downcast
float_cols = [c for c in df if df[c].dtype == "float64"]
int_cols = [c for c in df if df[c].dtype == "int64"]
# Downcast
df[float_cols] = df[float_cols].astype(np.float32)
df[int_cols] = df[int_cols].astype(np.int32)
return df
Let's load the data from the hard drive first.
sales = pd.read_csv('../readonly/final_project_data/sales_train.csv.gz')
shops = pd.read_csv('../readonly/final_project_data/shops.csv')
items = pd.read_csv('../readonly/final_project_data/items.csv')
item_cats = pd.read_csv('../readonly/final_project_data/item_categories.csv')
And use only 3 shops for simplicity.
sales = sales[sales['shop_id'].isin([26, 27, 28])]
We now need to prepare the features. This part is all implemented for you.
index_cols = ['shop_id', 'item_id', 'date_block_num']
# For every month we create a grid from all shops/items combinations from that month
grid = []
for block_num in sales['date_block_num'].unique():
cur_shops = sales.loc[sales['date_block_num'] == block_num, 'shop_id'].unique()
cur_items = sales.loc[sales['date_block_num'] == block_num, 'item_id'].unique()
grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))
# Turn the grid into a dataframe
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)
# Groupby data to get shop-item-month aggregates
gb = sales.groupby(index_cols,as_index=False).agg({'item_cnt_day':{'target':'sum'}})
# Fix column names
gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values]
# Join it to the grid
all_data = pd.merge(grid, gb, how='left', on=index_cols).fillna(0)
# Same as above but with shop-month aggregates
gb = sales.groupby(['shop_id', 'date_block_num'],as_index=False).agg({'item_cnt_day':{'target_shop':'sum'}})
gb.columns = [col[0] if col[-1]=='' else col[-1] for col in gb.columns.values]
all_data = pd.merge(all_data, gb, how='left', on=['shop_id', 'date_block_num']).fillna(0)
# Same as above but with item-month aggregates
gb = sales.groupby(['item_id', 'date_block_num'],as_index=False).agg({'item_cnt_day':{'target_item':'sum'}})
gb.columns = [col[0] if col[-1] == '' else col[-1] for col in gb.columns.values]
all_data = pd.merge(all_data, gb, how='left', on=['item_id', 'date_block_num']).fillna(0)
# Downcast dtypes from 64 to 32 bit to save memory
all_data = downcast_dtypes(all_data)
del grid, gb
gc.collect();
After creating a grid, we can calculate some features. We will use lags from [1, 2, 3, 4, 5, 12] months ago.
cols_to_rename = list(all_data.columns.difference(index_cols))
shift_range = [1, 2, 3, 4, 5, 12]
for month_shift in tqdm_notebook(shift_range):
train_shift = all_data[index_cols + cols_to_rename].copy()
train_shift['date_block_num'] = train_shift['date_block_num'] + month_shift
foo = lambda x: '{}_lag_{}'.format(x, month_shift) if x in cols_to_rename else x
train_shift = train_shift.rename(columns=foo)
all_data = pd.merge(all_data, train_shift, on=index_cols, how='left').fillna(0)
del train_shift
# Don't use old data from year 2013
all_data = all_data[all_data['date_block_num'] >= 12]
# List of all lagged features
fit_cols = [col for col in all_data.columns if col[-1] in [str(item) for item in shift_range]]
# We will drop these at fitting stage
to_drop_cols = list(set(list(all_data.columns)) - (set(fit_cols)|set(index_cols))) + ['date_block_num']
# Category for each item
item_category_mapping = items[['item_id','item_category_id']].drop_duplicates()
all_data = pd.merge(all_data, item_category_mapping, how='left', on='item_id')
all_data = downcast_dtypes(all_data)
gc.collect();
To this end, we've created a feature matrix. It is stored in all_data
variable. Take a look:
all_data.head()
For a sake of the programming assignment, let's artificially split the data into train and test. We will treat last month data as the test set.
dates = all_data['date_block_num']
last_block = dates.max()
print('Test `date_block_num` is %d' % last_block)
dates_train = dates[dates < last_block]
dates_test = dates[dates == last_block]
X_train = all_data.loc[dates < last_block].drop(to_drop_cols, axis=1)
X_test = all_data.loc[dates == last_block].drop(to_drop_cols, axis=1)
y_train = all_data.loc[dates < last_block, 'target'].values
y_test = all_data.loc[dates == last_block, 'target'].values
You need to implement a basic stacking scheme. We have a time component here, so we will use scheme f) from the reading material. Recall, that we always use first level models to build two datasets: test meta-features and 2-nd level train-metafetures. Let's see how we get test meta-features first.
Firts, we will run linear regression on numeric columns and get predictions for the last month.
lr = LinearRegression()
lr.fit(X_train.values, y_train)
pred_lr = lr.predict(X_test.values)
print('Test R-squared for linreg is %f' % r2_score(y_test, pred_lr))
And the we run LightGBM.
lgb_params = {
'feature_fraction': 0.75,
'metric': 'rmse',
'nthread':1,
'min_data_in_leaf': 2**7,
'bagging_fraction': 0.75,
'learning_rate': 0.03,
'objective': 'mse',
'bagging_seed': 2**7,
'num_leaves': 2**7,
'bagging_freq':1,
'verbose':0
}
model = lgb.train(lgb_params, lgb.Dataset(X_train, label=y_train), 100)
pred_lgb = model.predict(X_test)
print('Test R-squared for LightGBM is %f' % r2_score(y_test, pred_lgb))
Finally, concatenate test predictions to get test meta-features.
X_test_level2 = np.c_[pred_lr, pred_lgb]
Now it is your turn to write the code. You need to implement scheme f) from the reading material. Here, we will use duration T equal to month and M=15.
That is, you need to get predictions (meta-features) from linear regression and LightGBM for months 27, 28, 29, 30, 31, 32. Use the same parameters as in above models.
dates_train_level2 = dates_train[dates_train.isin([27, 28, 29, 30, 31, 32])]
# That is how we get target for the 2nd level dataset
y_train_level2 = y_train[dates_train.isin([27, 28, 29, 30, 31, 32])]
X_train_level2 = np.zeros([y_train_level2.shape[0], 2])
# Now fill `X_train_level2` with metafeatures
for cur_block_num in [27, 28, 29, 30, 31, 32]:
print(cur_block_num)
'''
1. Split `X_train` into parts
Remember, that corresponding dates are stored in `dates_train`
2. Fit linear regression
3. Fit LightGBM and put predictions
4. Store predictions from 2. and 3. in the right place of `X_train_level2`.
You can use `dates_train_level2` for it
Make sure the order of the meta-features is the same as in `X_test_level2`
'''
# YOUR CODE GOES HERE
X_train_meta = all_data.loc[dates < cur_block_num].drop(to_drop_cols, axis=1)
X_test_meta = all_data.loc[dates == cur_block_num].drop(to_drop_cols, axis=1)
y_train_meta = all_data.loc[dates < cur_block_num, 'target'].values
y_test_meta = all_data.loc[dates == cur_block_num, 'target'].values
lr.fit(X_train_meta.values, y_train_meta)
X_train_level2[dates_train_level2 == cur_block_num, 0] = lr.predict(X_test_meta.values)
model = lgb.train(lgb_params, lgb.Dataset(X_train_meta, label=y_train_meta), 100)
X_train_level2[dates_train_level2 == cur_block_num, 1] = model.predict(X_test_meta)
# Sanity check
assert np.all(np.isclose(X_train_level2.mean(axis=0), [ 1.50148988, 1.38811989]))
Remember, the ensembles work best, when first level models are diverse. We can qualitatively analyze the diversity by examinig scatter plot between the two metafeatures. Plot the scatter plot below.
plt.scatter(X_train_level2[:, 0], X_train_level2[:, 1])
Now, when the meta-features are created, we can ensemble our first level models.
Let's start with simple linear convex mix:
$$ mix= \alpha\cdot\text{linreg_prediction}+(1-\alpha)\cdot\text{lgb_prediction} $$We need to find an optimal $\alpha$. And it is very easy, as it is feasible to do grid search. Next, find the optimal $\alpha$ out of alphas_to_try
array. Remember, that you need to use train meta-features (not test) when searching for $\alpha$.
alphas_to_try = np.linspace(0, 1, 1001)
# YOUR CODE GOES HERE
r2_scores = np.array([r2_score(y_train_level2, np.dot(X_train_level2, [alpha, 1 - alpha])) for alpha in alphas_to_try])
best_alpha = alphas_to_try[r2_scores.argmax()] # YOUR CODE GOES HERE
r2_train_simple_mix = r2_scores.max() # YOUR CODE GOES HERE
print('Best alpha: %f; Corresponding r2 score on train: %f' % (best_alpha, r2_train_simple_mix))
Now use the $\alpha$ you've found to compute predictions for the test set
test_preds = best_alpha * pred_lr + (1 - best_alpha) * pred_lgb # YOUR CODE GOES HERE
r2_test_simple_mix = r2_score(y_test, test_preds) # YOUR CODE GOES HERE
print('Test R-squared for simple mix is %f' % r2_test_simple_mix)
Now, we will try a more advanced ensembling technique. Fit a linear regression model to the meta-features. Use the same parameters as in the model above.
lr.fit(X_train_level2, y_train_level2)
Compute R-squared on the train and test sets.
train_preds = lr.predict(X_train_level2) # YOUR CODE GOES HERE
r2_train_stacking = r2_score(y_train_level2, train_preds) # YOUR CODE GOES HERE
test_preds = lr.predict(np.vstack((pred_lr, pred_lgb)).T) # YOUR CODE GOES HERE
r2_test_stacking = r2_score(y_test, test_preds) # YOUR CODE GOES HERE
print('Train R-squared for stacking is %f' % r2_train_stacking)
print('Test R-squared for stacking is %f' % r2_test_stacking)
Interesting, that the score turned out to be lower than in previous method. Although the model is very simple (just 3 parameters) and, in fact, mixes predictions linearly, it looks like it managed to overfit. Examine and compare train and test scores for the two methods.
And of course this particular case does not mean simple mix is always better than stacking.
We all done! Submit everything we need to the grader now.
from grader import Grader
grader = Grader()
grader.submit_tag('best_alpha', best_alpha)
grader.submit_tag('r2_train_simple_mix', r2_train_simple_mix)
grader.submit_tag('r2_test_simple_mix', r2_test_simple_mix)
grader.submit_tag('r2_train_stacking', r2_train_stacking)
grader.submit_tag('r2_test_stacking', r2_test_stacking)
STUDENT_EMAIL ="EMAIL HERE" # EMAIL HERE
STUDENT_TOKEN =" TOKEN HERE"# TOKEN HERE
grader.status()
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)