Encoding numerical features

This section explains the way numerical encoding can be carried out using feature_encoders.

The SplineEncoder takes pandas.DataFrames as input and generates numpy.ndarrays as output.

[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression

%matplotlib inline
[2]:
from feature_encoders.encode import SplineEncoder

We can create some synthetic data:

[3]:
def f(x):
    return 10 + (x * np.sin(x))
[4]:
x_support = np.linspace(0, 15, 100)
y_support = f(x_support)

x_train = np.sort(np.random.choice(x_support[15:-15], size=25, replace=False))
y_train = f(x_train)
[5]:
X_train = pd.DataFrame(data=x_train, columns=['x'])
X_support = pd.DataFrame(data=x_support, columns=['x'])
[6]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 3.5), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    ax.plot(X_support, y_support, label='ground truth', c='#fc8d59')
    ax.plot(X_train, y_train, 'o', label='training points', c='#fc8d59')
    ax.legend(loc='upper left')
_images/Tutorial_Numerical_7_0.png

Cubic spline without extrapolation:

[7]:
enc = SplineEncoder(feature='x', n_knots=5, degree=3, strategy='uniform',
                        extrapolation='constant', include_bias=True,)

model = make_pipeline(enc, LinearRegression(fit_intercept=False))
model.fit(X_train, y_train)
pred = model.predict(X_support)

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 3.5), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    ax.plot(X_support, y_support, label='ground truth', c='#fc8d59')
    ax.plot(X_train, y_train, 'o', label='training points', c='#fc8d59')
    ax.plot(X_support, pred, label='spline approximation')
    ax.legend(loc='upper left')
_images/Tutorial_Numerical_9_0.png

With linear extrapolation:

[8]:
enc = SplineEncoder(feature='x', n_knots=5, degree=3, strategy='uniform',
                        extrapolation='linear', include_bias=True,)

model = make_pipeline(enc, LinearRegression(fit_intercept=False))
model.fit(X_train, y_train)
pred = model.predict(X_support)

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 3.5), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    ax.plot(X_support, y_support, label='ground truth', c='#fc8d59')
    ax.plot(X_train, y_train, 'o', label='training points', c='#fc8d59')
    ax.plot(X_support, pred, label='spline approximation')
    ax.legend(loc='upper left')
_images/Tutorial_Numerical_11_0.png

An application of the spline encoder

The TOWT model for predicting the energy consumption of a building estimates the temperature effect separately for hours of the week with high and with low energy consumption in order to distinguish between occupied and unoccupied periods.

To this end, a flexible curve is fitted on the consumption~temperature relationship, and if more than the 65% of the data points that correspond to a specific hour-of-week are above the fitted curve, the corresponding hour is flagged as “Occupied”, otherwise it is flagged as “Unoccupied.”

We can apply this approach using feature_encoders functionality.

Load demo data

[9]:
data = pd.read_csv('data/data.csv', parse_dates=[0], index_col=0)
data = data[~data['consumption_outlier']]
[10]:
dmatrix = SplineEncoder(feature='temperature',
                        degree=1,
                        strategy='uniform'
          ).fit_transform(data)

model = LinearRegression(fit_intercept=False).fit(dmatrix, data['consumption'])
pred = pd.DataFrame(
            data=model.predict(dmatrix),
            index=data.index,
            columns=['consumption']
       )
[11]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(12, 3.5), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    ax.plot(data['temperature'], data['consumption'], 'o', alpha=0.02)
    ax.plot(data['temperature'], pred['consumption'])
_images/Tutorial_Numerical_16_0.png
[12]:
resid = data[['consumption']] - pred[['consumption']]
mask = resid > 0
mask['hourofweek'] = 24 * mask.index.dayofweek + mask.index.hour
occupied = mask.groupby('hourofweek')['consumption'].mean() > 0.65

data['hourofweek'] = 24 * data.index.dayofweek + data.index.hour
data['occupied'] = data['hourofweek'].map(lambda x: occupied[x])
[13]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(12, 3.5), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    ax.scatter(data.loc[data['occupied'], 'temperature'],
               data.loc[data['occupied'], 'consumption'],
               s=2, alpha=0.2, label='Probably occupied')

    ax.scatter(data.loc[~data['occupied'], 'temperature'],
               data.loc[~data['occupied'], 'consumption'],
               s=2, alpha=0.2, label='Probably not occupied')

    ax.legend(fancybox=True, frameon=True, loc='upper left')
_images/Tutorial_Numerical_18_0.png