The functionality for generating new features

The feature-encoders library includes a few feature generators:

  • TrendFeatures: Generates time trend features.

  • DatetimeFeatures: Generates date and time features (such as the month of the year or the hour of the week).

  • CyclicalFeatures: Creates cyclical (seasonal) features as Fourier terms (similarly to the way the Prophet library generates seasonality features).

All feature generators generate pandas DataFrames, and they all have two common parameters:

remainder : str, {'drop', 'passthrough'}, default='passthrough'
    By specifying `remainder='passthrough'`, all the remaining columns of the
    input dataset will be automatically passed through (concatenated with the
    output of the transformer).
replace : bool, default=False
    Specifies whether replacing an existing column with the same name is allowed
    (when `remainder=passthrough`).
[1]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

%matplotlib inline
[2]:
from feature_encoders.generate import CyclicalFeatures, DatetimeFeatures, TrendFeatures

Load demo data

[3]:
data = pd.read_csv('data/data.csv', parse_dates=[0], index_col=0)
data = data[~data['consumption_outlier']]

Create time trend features

The ds argument corresponds to the name of the input dataframe’s column that contains datetime information. If None, it is assumed that the datetime information is provided by the input dataframe’s index.

[4]:
enc = TrendFeatures(ds=None, name='growth', remainder='drop')
features = enc.fit_transform(data)
[5]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    ax.plot(features['growth'], label='growth')
    ax.legend(loc='upper left')
_images/Tutorial_Feature_Generation_7_0.png

Add date and time features

The subset argument corresponds to the names of the features to generate. If None, all features will be produced: ‘month’, ‘week’, ‘dayofyear’, ‘dayofweek’, ‘hour’, ‘hourofweek’. The last 2 features are generated only if the timestep of the input’s feature (or index if feature is None) is smaller than pandas.Timedelta(days=1).

[6]:
enc = DatetimeFeatures(ds=None, subset=None, remainder='drop')
features = enc.fit_transform(data)
features.columns
[6]:
Index(['month', 'week', 'dayofyear', 'dayofweek', 'hour', 'hourofweek'], dtype='object')
[7]:
enc = DatetimeFeatures(ds=None, remainder='drop', subset=['month', 'hourofweek'])
features = enc.fit_transform(data)
features.columns
[7]:
Index(['month', 'hourofweek'], dtype='object')

Encode cyclical (seasonal) features

The encoder is parameterized by period (number of days in one period) and fourier_order (number of Fourier components to use).

It can provide default values for period and fourier_order if seasonality is one of daily, weekly or yearly.

[8]:
daily_consumption = data[['consumption']].resample('D').sum()
[9]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    daily_consumption.plot(ax=ax, alpha=0.5)
_images/Tutorial_Feature_Generation_13_0.png

The number of seasonality features is always twice the fourier_order.

[10]:
enc = CyclicalFeatures(ds=None, seasonality='yearly', fourier_order=3, remainder='drop')
features = enc.fit_transform(daily_consumption)
features.columns
[10]:
Index(['yearly_delim_0', 'yearly_delim_1', 'yearly_delim_2', 'yearly_delim_3',
       'yearly_delim_4', 'yearly_delim_5'],
      dtype='object')

Now let’s plot the new features:

[11]:
with plt.style.context('seaborn-whitegrid'):
    fig, axs = plt.subplots(2*enc.fourier_order, figsize=(14, 7), dpi=96)

    for i, col in enumerate(features.columns):
        features[col].plot(ax=axs[i])
        axs[i].set_xlabel(None)

fig.tight_layout()
_images/Tutorial_Feature_Generation_17_0.png

Let’s also see how well this transformation works:

[12]:
regr = LinearRegression(fit_intercept=True).fit(features, daily_consumption)
pred = regr.predict(features)
[13]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    daily_consumption.plot(ax=ax, alpha=0.5)
    pd.Series(pred.squeeze(), index=daily_consumption.index).plot(ax=ax)
_images/Tutorial_Feature_Generation_20_0.png