Kaggle-Bike Sharing Demand

April 25, 2023

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city.

Goal: We predict the total count of bikes rented during each hour covered by the test set.
Metric: Root Mean Squared Logarithmic Error (RMSLE)

\[\sqrt{\frac{1}{n} \sum_{i=1}^n (\log(p_i + 1) - \log(a_i+1))^2 }\]

Data
- Hourly rental data spanning two years
- The training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month.
- 12 features.
  - datetime - hourly date + timestamp > Object!!
  - season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
  - holiday - whether the day is considered a holiday ( 1=Holiday except weekend, 0= Not holiday)
  - workingday - whether the day is neither a weekend nor holiday
  - weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  - temp - temperature in Celsius
  - atemp - “feels like” temperature in Celsius
  - humidity - relative humidity
  - windspeed - wind speed
  - casual - number of non-registered user rentals initiated
  - registered - number of registered user rentals initiated
  - count: y_target - number of total rentals

Dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


import warnings
warnings.filterwarnings('ignore')

bike_df = pd.read_csv('input/bike_train.csv')
print(bike_df.shape)
print(bike_df.describe())
print(bike_df.info())
print(bike_df.head(3))

(10886, 12)
             season       holiday    workingday       weather         temp  \
count  10886.000000  10886.000000  10886.000000  10886.000000  10886.00000   
mean       2.506614      0.028569      0.680875      1.418427     20.23086   
std        1.116174      0.166599      0.466159      0.633839      7.79159   
min        1.000000      0.000000      0.000000      1.000000      0.82000   
25%        2.000000      0.000000      0.000000      1.000000     13.94000   
50%        3.000000      0.000000      1.000000      1.000000     20.50000   
75%        4.000000      0.000000      1.000000      2.000000     26.24000   
max        4.000000      1.000000      1.000000      4.000000     41.00000   

              atemp      humidity     windspeed        casual    registered  \
count  10886.000000  10886.000000  10886.000000  10886.000000  10886.000000   
mean      23.655084     61.886460     12.799395     36.021955    155.552177   
std        8.474601     19.245033      8.164537     49.960477    151.039033   
min        0.760000      0.000000      0.000000      0.000000      0.000000   
25%       16.665000     47.000000      7.001500      4.000000     36.000000   
50%       24.240000     62.000000     12.998000     17.000000    118.000000   
75%       31.060000     77.000000     16.997900     49.000000    222.000000   
max       45.455000    100.000000     56.996900    367.000000    886.000000   

              count  
count  10886.000000  
mean     191.574132  
std      181.144454  
min        1.000000  
25%       42.000000  
50%      145.000000  
75%      284.000000  
max      977.000000  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int64  
 10  registered  10886 non-null  int64  
 11  count       10886 non-null  int64  
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB
None
              datetime  season  holiday  workingday  weather  temp   atemp  \
0  2011-01-01 00:00:00       1        0           0        1  9.84  14.395   
1  2011-01-01 01:00:00       1        0           0        1  9.02  13.635   
2  2011-01-01 02:00:00       1        0           0        1  9.02  13.635   

   humidity  windspeed  casual  registered  count  
0        81        0.0       3          13     16  
1        80        0.0       8          32     40  
2        80        0.0       5          27     32

Null data check

bike_df.isnull().sum() #No NaN.

datetime      0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
casual        0
registered    0
count         0
year          0
month         0
day           0
hour          0
dtype: int64

The distribution of Target Label

plt.title('Original count Histogram')
sns.distplot(bike_df['count'])

<AxesSubplot:title={'center':'Original count Histogram'}, xlabel='count', ylabel='Density'>

plt.title('Log transformed Count Histogram')
log_count =np.log1p(bike_df['count'])
sns.distplot(log_count)

<AxesSubplot:title={'center':'Log transformed Count Histogram'}, xlabel='count', ylabel='Density'>

bike_df['log_count'] = log_count
bike_df.drop(['count'], axis=1, inplace=True)

Feature Engineering/ Model Learning

Datetime

# Convert Object to datetime
bike_df['datetime'] = bike_df.datetime.apply(pd.to_datetime)

bike_df['year'] = bike_df['datetime'].apply(lambda x: x.year)
bike_df['month'] = bike_df['datetime'].apply(lambda x: x.month)
bike_df['day'] = bike_df['datetime'].apply(lambda x: x.day)
bike_df['hour'] = bike_df['datetime'].apply(lambda x: x.hour)
bike_df.head(3)

	datetime	season	weather	temp	atemp	humidity	casual	registered	log_count	year	month	day	hour
0	2011-01-01 00:00:00	1	1	9.84	14.395	81	3	13	2.833213	2011	1	1	0
1	2011-01-01 01:00:00	1	1	9.02	13.635	80	8	32	3.713572	2011	1	1	1
2	2011-01-01 02:00:00	1	1	9.02	13.635	80	5	27	3.496508	2011	1	1	2

bike_df.drop(['datetime','casual','registered'], axis=1, inplace=True)

Evaluation

from sklearn.metrics import mean_squared_error, mean_absolute_error

def rmsle(y, pred):
    log_y = np.log1p(y)
    log_pred = np.log1p(pred) #because of NaN issue
    rmsle = np.sqrt(np.mean((log_y-log_pred)**2)) #instead of mean_Squared_log_error in sklearn
    return rmsle

def rmse(y, pred):
    return np.sqrt(np.mean((y-pred)**2))

#MSE, RMSE, RMSLE, MAE
def evaluate_regr(y, pred):
    rmsle_val = rmsle(y, pred)
    rmse_val = rmse(y, pred)

    mae_val=mean_absolute_error(y,pred)
    print('rmsle: ', rmsle_val, 'rmse: ', rmse_val, 'mae: ', mae_val )

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso

y_target = bike_df['log_count']
X_features = bike_df.drop(['log_count'], axis=1, inplace=False)

X_train, X_test, y_train, y_test = train_test_split(X_features, y_target, test_size=0.3, random_state=0)

lr = LinearRegression()
lr.fit(X_train,y_train)
pred = lr.predict(X_test)

y_test_exp = np.expm1(y_test)
pred_exp = np.expm1(pred)

evaluate_regr(y_test_exp, pred_exp)

rmsle:  1.016826598200345 rmse:  162.59426809004643 mae:  109.28615860077495

result_df = pd.DataFrame(y_test[:,np.newaxis], columns=['real_count']) 
result_df['pred_count'] = np.round(pred)
result_df['diff'] = np.abs(result_df['real_count']-result_df['pred_count'])

print(result_df.sort_values('diff',ascending=False)[:5])

      real_count  pred_count      diff
1521    0.693147         4.0  3.306853
786     0.693147         4.0  3.306853
1242    0.693147         4.0  3.306853
3168    0.693147         4.0  3.306853
3238    0.693147         4.0  3.306853

coef = pd.Series(lr.coef_, index=X_features.columns)
coef_sort = coef.sort_values(ascending=False)
sns.barplot(x = coef_sort, y=coef_sort.index)

<AxesSubplot:>

Coefficient of year is too big, year=2011,2012. year feature is integer but, categorical. So, we implement one-hot encoding.

One-hot encoding

X_features_ohe = pd.get_dummies(X_features, columns=['year','month','day','hour','holiday','workingday','season','weather'])

X_train, X_test, y_train, y_test = train_test_split(X_features_ohe, y_target, test_size=0.3, random_state=0)

def get_model_predict(model, X_train, X_test, y_train, y_test, is_expm1=False):
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    if is_expm1 :
        y_test = np.expm1(y_test)
        pred = np.expm1(pred)
    print('###',model.__class__.__name__,'###')
    evaluate_regr(y_test, pred)
#end of function get_model_predict


lr_reg = LinearRegression()
ridge_reg = Ridge(alpha=10)
lasso_reg = Lasso(alpha=0.01)

for model in [lr_reg, ridge_reg, lasso_reg]:
    get_model_predict(model, X_train, X_test, y_train, y_test, is_expm1=True)

### LinearRegression ###
rmsle:  0.5896336894543538 rmse:  97.68841046053633 mae:  63.382335548187164
### Ridge ###
rmsle:  0.5901367703437249 rmse:  98.52859077604543 mae:  63.89335277110783
### Lasso ###
rmsle:  0.6347518077052988 rmse:  113.21881019147763 mae:  72.80270669734962

coef = pd.Series(lr_reg.coef_, index=X_features_ohe.columns)
coef_sort = coef.sort_values(ascending=False)[:20]
sns.barplot(x=coef_sort.values, y=coef_sort.index)

<AxesSubplot:>

High coefficients of Month, year, weather, day
The values of coefficients have been increased.

Tree-based regressor

Implement GBM, XGBoost, and LightGBM.
For XGBoost, we have to convert dataframe with numpy ndarray.

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

rf_reg = RandomForestRegressor(n_estimators=500)
gbm_reg = GradientBoostingRegressor(n_estimators=500)
xgb_reg = XGBRegressor(n_estimators=500)
lgbm_reg = LGBMRegressor(n_estimators=500)

for model in [rf_reg, gbm_reg, xgb_reg, lgbm_reg]:
    get_model_predict(model, X_train.values, X_test.values, y_train.values, y_test.values, is_expm1=True)

### RandomForestRegressor ###
rmsle:  0.3537731884391265 rmse:  50.29169038323934 mae:  31.1784434147597
### GradientBoostingRegressor ###
rmsle:  0.32982335072461555 rmse:  53.32399144802179 mae:  32.73387650916991
### XGBRegressor ###
rmsle:  0.3422048283640822 rmse:  51.73158151762133 mae:  31.25122169453128
### LGBMRegressor ###
rmsle:  0.3188456499157369 rmse:  47.21464677592674 mae:  29.028770412428244

Share on

Twitter Facebook LinkedIn

Haesong Choi

Kaggle-Bike Sharing Demand

Dataset

Null data check

The distribution of Target Label

Feature Engineering/ Model Learning

Datetime

Evaluation

One-hot encoding

Tree-based regressor

Share on

Leave a comment

You may also enjoy

SQL: Strata Scratch test

SQL: Hacker Rank Questions 1

Dealing with Outliers

Handling Missing Values