Kaggle-Bike Sharing Demand
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city.
-
Goal: We predictthe total count of bikesrented duringeach hourcovered by the test set. -
Metric:Root Mean Squared Logarithmic Error (RMSLE)
-
-
Hourly rental data spanning
two years -
The training set is comprised of
the first 19 daysofeach month, while the test set isthe 20th to the endof the month. -
12 features.
-
datetime - hourly date + timestamp > Object!!
-
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
-
holiday - whether the day is considered a holiday ( 1=Holiday except weekend, 0= Not holiday)
-
workingday - whether the day is neither a weekend nor holiday
-
weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
-
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
-
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
-
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
-
-
temp - temperature in Celsius
-
atemp - “feels like” temperature in Celsius
-
humidity - relative humidity
-
windspeed - wind speed
-
casual - number of non-registered user rentals initiated
-
registered - number of registered user rentals initiated
-
count: y_target - number of total rentals
-
-
Dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
bike_df = pd.read_csv('input/bike_train.csv')
print(bike_df.shape)
print(bike_df.describe())
print(bike_df.info())
print(bike_df.head(3))
(10886, 12)
season holiday workingday weather temp \
count 10886.000000 10886.000000 10886.000000 10886.000000 10886.00000
mean 2.506614 0.028569 0.680875 1.418427 20.23086
std 1.116174 0.166599 0.466159 0.633839 7.79159
min 1.000000 0.000000 0.000000 1.000000 0.82000
25% 2.000000 0.000000 0.000000 1.000000 13.94000
50% 3.000000 0.000000 1.000000 1.000000 20.50000
75% 4.000000 0.000000 1.000000 2.000000 26.24000
max 4.000000 1.000000 1.000000 4.000000 41.00000
atemp humidity windspeed casual registered \
count 10886.000000 10886.000000 10886.000000 10886.000000 10886.000000
mean 23.655084 61.886460 12.799395 36.021955 155.552177
std 8.474601 19.245033 8.164537 49.960477 151.039033
min 0.760000 0.000000 0.000000 0.000000 0.000000
25% 16.665000 47.000000 7.001500 4.000000 36.000000
50% 24.240000 62.000000 12.998000 17.000000 118.000000
75% 31.060000 77.000000 16.997900 49.000000 222.000000
max 45.455000 100.000000 56.996900 367.000000 886.000000
count
count 10886.000000
mean 191.574132
std 181.144454
min 1.000000
25% 42.000000
50% 145.000000
75% 284.000000
max 977.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10886 non-null object
1 season 10886 non-null int64
2 holiday 10886 non-null int64
3 workingday 10886 non-null int64
4 weather 10886 non-null int64
5 temp 10886 non-null float64
6 atemp 10886 non-null float64
7 humidity 10886 non-null int64
8 windspeed 10886 non-null float64
9 casual 10886 non-null int64
10 registered 10886 non-null int64
11 count 10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB
None
datetime season holiday workingday weather temp atemp \
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635
humidity windspeed casual registered count
0 81 0.0 3 13 16
1 80 0.0 8 32 40
2 80 0.0 5 27 32
Null data check
bike_df.isnull().sum() #No NaN.
datetime 0 season 0 holiday 0 workingday 0 weather 0 temp 0 atemp 0 humidity 0 windspeed 0 casual 0 registered 0 count 0 year 0 month 0 day 0 hour 0 dtype: int64
The distribution of Target Label
plt.title('Original count Histogram')
sns.distplot(bike_df['count'])
<AxesSubplot:title={'center':'Original count Histogram'}, xlabel='count', ylabel='Density'>
plt.title('Log transformed Count Histogram')
log_count =np.log1p(bike_df['count'])
sns.distplot(log_count)
<AxesSubplot:title={'center':'Log transformed Count Histogram'}, xlabel='count', ylabel='Density'>
bike_df['log_count'] = log_count
bike_df.drop(['count'], axis=1, inplace=True)
Feature Engineering/ Model Learning
Datetime
# Convert Object to datetime
bike_df['datetime'] = bike_df.datetime.apply(pd.to_datetime)
bike_df['year'] = bike_df['datetime'].apply(lambda x: x.year)
bike_df['month'] = bike_df['datetime'].apply(lambda x: x.month)
bike_df['day'] = bike_df['datetime'].apply(lambda x: x.day)
bike_df['hour'] = bike_df['datetime'].apply(lambda x: x.hour)
bike_df.head(3)
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | log_count | year | month | day | hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 2.833213 | 2011 | 1 | 1 | 0 |
| 1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 3.713572 | 2011 | 1 | 1 | 1 |
| 2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 3.496508 | 2011 | 1 | 1 | 2 |
bike_df.drop(['datetime','casual','registered'], axis=1, inplace=True)
Evaluation
from sklearn.metrics import mean_squared_error, mean_absolute_error
def rmsle(y, pred):
log_y = np.log1p(y)
log_pred = np.log1p(pred) #because of NaN issue
rmsle = np.sqrt(np.mean((log_y-log_pred)**2)) #instead of mean_Squared_log_error in sklearn
return rmsle
def rmse(y, pred):
return np.sqrt(np.mean((y-pred)**2))
#MSE, RMSE, RMSLE, MAE
def evaluate_regr(y, pred):
rmsle_val = rmsle(y, pred)
rmse_val = rmse(y, pred)
mae_val=mean_absolute_error(y,pred)
print('rmsle: ', rmsle_val, 'rmse: ', rmse_val, 'mae: ', mae_val )
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
y_target = bike_df['log_count']
X_features = bike_df.drop(['log_count'], axis=1, inplace=False)
X_train, X_test, y_train, y_test = train_test_split(X_features, y_target, test_size=0.3, random_state=0)
lr = LinearRegression()
lr.fit(X_train,y_train)
pred = lr.predict(X_test)
y_test_exp = np.expm1(y_test)
pred_exp = np.expm1(pred)
evaluate_regr(y_test_exp, pred_exp)
rmsle: 1.016826598200345 rmse: 162.59426809004643 mae: 109.28615860077495
result_df = pd.DataFrame(y_test[:,np.newaxis], columns=['real_count'])
result_df['pred_count'] = np.round(pred)
result_df['diff'] = np.abs(result_df['real_count']-result_df['pred_count'])
print(result_df.sort_values('diff',ascending=False)[:5])
real_count pred_count diff
1521 0.693147 4.0 3.306853
786 0.693147 4.0 3.306853
1242 0.693147 4.0 3.306853
3168 0.693147 4.0 3.306853
3238 0.693147 4.0 3.306853
coef = pd.Series(lr.coef_, index=X_features.columns)
coef_sort = coef.sort_values(ascending=False)
sns.barplot(x = coef_sort, y=coef_sort.index)
<AxesSubplot:>
- Coefficient of year is too big, year=2011,2012.
yearfeature is integer but, categorical. So, we implement one-hot encoding.
One-hot encoding
X_features_ohe = pd.get_dummies(X_features, columns=['year','month','day','hour','holiday','workingday','season','weather'])
X_train, X_test, y_train, y_test = train_test_split(X_features_ohe, y_target, test_size=0.3, random_state=0)
def get_model_predict(model, X_train, X_test, y_train, y_test, is_expm1=False):
model.fit(X_train, y_train)
pred = model.predict(X_test)
if is_expm1 :
y_test = np.expm1(y_test)
pred = np.expm1(pred)
print('###',model.__class__.__name__,'###')
evaluate_regr(y_test, pred)
#end of function get_model_predict
lr_reg = LinearRegression()
ridge_reg = Ridge(alpha=10)
lasso_reg = Lasso(alpha=0.01)
for model in [lr_reg, ridge_reg, lasso_reg]:
get_model_predict(model, X_train, X_test, y_train, y_test, is_expm1=True)
### LinearRegression ### rmsle: 0.5896336894543538 rmse: 97.68841046053633 mae: 63.382335548187164 ### Ridge ### rmsle: 0.5901367703437249 rmse: 98.52859077604543 mae: 63.89335277110783 ### Lasso ### rmsle: 0.6347518077052988 rmse: 113.21881019147763 mae: 72.80270669734962
coef = pd.Series(lr_reg.coef_, index=X_features_ohe.columns)
coef_sort = coef.sort_values(ascending=False)[:20]
sns.barplot(x=coef_sort.values, y=coef_sort.index)
<AxesSubplot:>
-
High coefficients of Month, year, weather, day
-
The values of coefficients have been increased.
Tree-based regressor
-
Implement GBM, XGBoost, and LightGBM.
-
For XGBoost, we have to convert dataframe with numpy ndarray.
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
rf_reg = RandomForestRegressor(n_estimators=500)
gbm_reg = GradientBoostingRegressor(n_estimators=500)
xgb_reg = XGBRegressor(n_estimators=500)
lgbm_reg = LGBMRegressor(n_estimators=500)
for model in [rf_reg, gbm_reg, xgb_reg, lgbm_reg]:
get_model_predict(model, X_train.values, X_test.values, y_train.values, y_test.values, is_expm1=True)
### RandomForestRegressor ### rmsle: 0.3537731884391265 rmse: 50.29169038323934 mae: 31.1784434147597 ### GradientBoostingRegressor ### rmsle: 0.32982335072461555 rmse: 53.32399144802179 mae: 32.73387650916991 ### XGBRegressor ### rmsle: 0.3422048283640822 rmse: 51.73158151762133 mae: 31.25122169453128 ### LGBMRegressor ### rmsle: 0.3188456499157369 rmse: 47.21464677592674 mae: 29.028770412428244
Leave a comment