Kaggle-Bike Sharing Demand

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city.

  • Goal: We predict the total count of bikes rented during each hour covered by the test set.

  • Metric: Root Mean Squared Logarithmic Error (RMSLE)

\[\sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }\]
  • Data

    • Hourly rental data spanning two years

    • The training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month.

    • 12 features.

      • datetime - hourly date + timestamp > Object!!

      • season - 1 = spring, 2 = summer, 3 = fall, 4 = winter

      • holiday - whether the day is considered a holiday ( 1=Holiday except weekend, 0= Not holiday)

      • workingday - whether the day is neither a weekend nor holiday

      • weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy

        • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

        • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

        • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

      • temp - temperature in Celsius

      • atemp - “feels like” temperature in Celsius

      • humidity - relative humidity

      • windspeed - wind speed

      • casual - number of non-registered user rentals initiated

      • registered - number of registered user rentals initiated

      • count: y_target - number of total rentals

Dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


import warnings
warnings.filterwarnings('ignore')

bike_df = pd.read_csv('input/bike_train.csv')
print(bike_df.shape)
print(bike_df.describe())
print(bike_df.info())
print(bike_df.head(3))
(10886, 12)
             season       holiday    workingday       weather         temp  \
count  10886.000000  10886.000000  10886.000000  10886.000000  10886.00000   
mean       2.506614      0.028569      0.680875      1.418427     20.23086   
std        1.116174      0.166599      0.466159      0.633839      7.79159   
min        1.000000      0.000000      0.000000      1.000000      0.82000   
25%        2.000000      0.000000      0.000000      1.000000     13.94000   
50%        3.000000      0.000000      1.000000      1.000000     20.50000   
75%        4.000000      0.000000      1.000000      2.000000     26.24000   
max        4.000000      1.000000      1.000000      4.000000     41.00000   

              atemp      humidity     windspeed        casual    registered  \
count  10886.000000  10886.000000  10886.000000  10886.000000  10886.000000   
mean      23.655084     61.886460     12.799395     36.021955    155.552177   
std        8.474601     19.245033      8.164537     49.960477    151.039033   
min        0.760000      0.000000      0.000000      0.000000      0.000000   
25%       16.665000     47.000000      7.001500      4.000000     36.000000   
50%       24.240000     62.000000     12.998000     17.000000    118.000000   
75%       31.060000     77.000000     16.997900     49.000000    222.000000   
max       45.455000    100.000000     56.996900    367.000000    886.000000   

              count  
count  10886.000000  
mean     191.574132  
std      181.144454  
min        1.000000  
25%       42.000000  
50%      145.000000  
75%      284.000000  
max      977.000000  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int64  
 10  registered  10886 non-null  int64  
 11  count       10886 non-null  int64  
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB
None
              datetime  season  holiday  workingday  weather  temp   atemp  \
0  2011-01-01 00:00:00       1        0           0        1  9.84  14.395   
1  2011-01-01 01:00:00       1        0           0        1  9.02  13.635   
2  2011-01-01 02:00:00       1        0           0        1  9.02  13.635   

   humidity  windspeed  casual  registered  count  
0        81        0.0       3          13     16  
1        80        0.0       8          32     40  
2        80        0.0       5          27     32  

Null data check

bike_df.isnull().sum() #No NaN.
datetime      0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
casual        0
registered    0
count         0
year          0
month         0
day           0
hour          0
dtype: int64

The distribution of Target Label

plt.title('Original count Histogram')
sns.distplot(bike_df['count'])
<AxesSubplot:title={'center':'Original count Histogram'}, xlabel='count', ylabel='Density'>

plt.title('Log transformed Count Histogram')
log_count =np.log1p(bike_df['count'])
sns.distplot(log_count)
<AxesSubplot:title={'center':'Log transformed Count Histogram'}, xlabel='count', ylabel='Density'>

bike_df['log_count'] = log_count
bike_df.drop(['count'], axis=1, inplace=True)

Feature Engineering/ Model Learning

Datetime

# Convert Object to datetime
bike_df['datetime'] = bike_df.datetime.apply(pd.to_datetime)
bike_df['year'] = bike_df['datetime'].apply(lambda x: x.year)
bike_df['month'] = bike_df['datetime'].apply(lambda x: x.month)
bike_df['day'] = bike_df['datetime'].apply(lambda x: x.day)
bike_df['hour'] = bike_df['datetime'].apply(lambda x: x.hour)
bike_df.head(3)
datetime season holiday workingday weather temp atemp humidity windspeed casual registered log_count year month day hour
0 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0.0 3 13 2.833213 2011 1 1 0
1 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0.0 8 32 3.713572 2011 1 1 1
2 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0.0 5 27 3.496508 2011 1 1 2
bike_df.drop(['datetime','casual','registered'], axis=1, inplace=True)

Evaluation

from sklearn.metrics import mean_squared_error, mean_absolute_error

def rmsle(y, pred):
    log_y = np.log1p(y)
    log_pred = np.log1p(pred) #because of NaN issue
    rmsle = np.sqrt(np.mean((log_y-log_pred)**2)) #instead of mean_Squared_log_error in sklearn
    return rmsle

def rmse(y, pred):
    return np.sqrt(np.mean((y-pred)**2))

#MSE, RMSE, RMSLE, MAE
def evaluate_regr(y, pred):
    rmsle_val = rmsle(y, pred)
    rmse_val = rmse(y, pred)

    mae_val=mean_absolute_error(y,pred)
    print('rmsle: ', rmsle_val, 'rmse: ', rmse_val, 'mae: ', mae_val )
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso

y_target = bike_df['log_count']
X_features = bike_df.drop(['log_count'], axis=1, inplace=False)

X_train, X_test, y_train, y_test = train_test_split(X_features, y_target, test_size=0.3, random_state=0)

lr = LinearRegression()
lr.fit(X_train,y_train)
pred = lr.predict(X_test)

y_test_exp = np.expm1(y_test)
pred_exp = np.expm1(pred)

evaluate_regr(y_test_exp, pred_exp)
rmsle:  1.016826598200345 rmse:  162.59426809004643 mae:  109.28615860077495
result_df = pd.DataFrame(y_test[:,np.newaxis], columns=['real_count']) 
result_df['pred_count'] = np.round(pred)
result_df['diff'] = np.abs(result_df['real_count']-result_df['pred_count'])

print(result_df.sort_values('diff',ascending=False)[:5])
      real_count  pred_count      diff
1521    0.693147         4.0  3.306853
786     0.693147         4.0  3.306853
1242    0.693147         4.0  3.306853
3168    0.693147         4.0  3.306853
3238    0.693147         4.0  3.306853
coef = pd.Series(lr.coef_, index=X_features.columns)
coef_sort = coef.sort_values(ascending=False)
sns.barplot(x = coef_sort, y=coef_sort.index)
<AxesSubplot:>

  • Coefficient of year is too big, year=2011,2012. year feature is integer but, categorical. So, we implement one-hot encoding.

One-hot encoding

X_features_ohe = pd.get_dummies(X_features, columns=['year','month','day','hour','holiday','workingday','season','weather'])
X_train, X_test, y_train, y_test = train_test_split(X_features_ohe, y_target, test_size=0.3, random_state=0)
def get_model_predict(model, X_train, X_test, y_train, y_test, is_expm1=False):
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    if is_expm1 :
        y_test = np.expm1(y_test)
        pred = np.expm1(pred)
    print('###',model.__class__.__name__,'###')
    evaluate_regr(y_test, pred)
#end of function get_model_predict


lr_reg = LinearRegression()
ridge_reg = Ridge(alpha=10)
lasso_reg = Lasso(alpha=0.01)

for model in [lr_reg, ridge_reg, lasso_reg]:
    get_model_predict(model, X_train, X_test, y_train, y_test, is_expm1=True)
### LinearRegression ###
rmsle:  0.5896336894543538 rmse:  97.68841046053633 mae:  63.382335548187164
### Ridge ###
rmsle:  0.5901367703437249 rmse:  98.52859077604543 mae:  63.89335277110783
### Lasso ###
rmsle:  0.6347518077052988 rmse:  113.21881019147763 mae:  72.80270669734962
coef = pd.Series(lr_reg.coef_, index=X_features_ohe.columns)
coef_sort = coef.sort_values(ascending=False)[:20]
sns.barplot(x=coef_sort.values, y=coef_sort.index)
<AxesSubplot:>

  • High coefficients of Month, year, weather, day

  • The values of coefficients have been increased.

Tree-based regressor

  • Implement GBM, XGBoost, and LightGBM.

  • For XGBoost, we have to convert dataframe with numpy ndarray.

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
rf_reg = RandomForestRegressor(n_estimators=500)
gbm_reg = GradientBoostingRegressor(n_estimators=500)
xgb_reg = XGBRegressor(n_estimators=500)
lgbm_reg = LGBMRegressor(n_estimators=500)
for model in [rf_reg, gbm_reg, xgb_reg, lgbm_reg]:
    get_model_predict(model, X_train.values, X_test.values, y_train.values, y_test.values, is_expm1=True)
### RandomForestRegressor ###
rmsle:  0.3537731884391265 rmse:  50.29169038323934 mae:  31.1784434147597
### GradientBoostingRegressor ###
rmsle:  0.32982335072461555 rmse:  53.32399144802179 mae:  32.73387650916991
### XGBRegressor ###
rmsle:  0.3422048283640822 rmse:  51.73158151762133 mae:  31.25122169453128
### LGBMRegressor ###
rmsle:  0.3188456499157369 rmse:  47.21464677592674 mae:  29.028770412428244

Leave a comment