The functionality for composing linear model features

feature-encoders includes a ModelStructure class for aggregating feature generators and encoders into main effect and pairwise interaction terms for linear regression models.

A ModelStructure instance can get information about features and encoders either from YAML files or through its API.

[1]:
import calendar
import json
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
[2]:
from feature_encoders.utils import load_config
from feature_encoders.compose import ModelStructure, FeatureComposer
from feature_encoders.generate import DatetimeFeatures
from feature_encoders.models import SeasonalPredictor

Reading information from YAML files

feature-encoders expects two YAML files:

Feature generator file

A file that provides a mapping between the name of a feature generator and the classes that should be used for the validation of its inputs and for its creation:

trend:
  validate: validate.TrendSchema
  generate: generate.TrendFeatures

datetime:
  validate: validate.DatetimeSchema
  generate: generate.DatetimeFeatures

cyclical:
  validate: validate.CyclicalSchema
  generate: generate.CyclicalFeatures

By default, ModelStructure searches in feature_encoders.config to find the validation and generation classes, but one can add packages by adding the fully qualified names of the corresponding classes.

Model configuration file

These files have three sections: (a) added features, (b) regressors and (c) interactions.

Added features

The information in this section is passed to one of the feature generators in feature_encoder.generate:

add_features:
  time: # the name of the generator
  ds: null
  type: datetime
  remainder: passthrough
  subset: month, hourofweek

Regressors

The information for each regressor includes its name, the name of the feature to use and encode so that to create this regressor, the type of the encoder (linear, spline or categorical), and the parameters to pass to the corresponding encoder class from feature_encoders.encode:

regressors:
  month:                 # the name of the regressor
    feature: month       # the name of the feature
    type: categorical
    max_n_categories: null
    encode_as: onehot

  tow:                   # the name of the regressor
    feature: hourofweek  # the name of the feature
    type: categorical
    max_n_categories: 60
    encode_as: onehot

  flex_temperature:
    feature: temperature
    type: spline
    n_knots: 5
    degree: 1
    strategy: uniform
    extrapolation: constant
    interaction_only: true  # if True, it will not be included in the main features

Interactions

Interactions can introduce new regressors, reuse regressors already defined in the regressors section, as well as alter the parameters of regressors that are already defined in the regressors section:

interactions:
  tow, flex_temperature:
    tow:
      max_n_categories: 2
      stratify_by: temperature
      min_samples_leaf: 15

Load configuration files

[3]:
model_conf, feature_conf = load_config(model='towt', features='default')
[4]:
print(json.dumps(model_conf, indent=4))
{
    "add_features": {
        "time": {
            "type": "datetime",
            "subset": "month, hourofweek"
        }
    },
    "regressors": {
        "month": {
            "feature": "month",
            "type": "categorical",
            "encode_as": "onehot"
        },
        "tow": {
            "feature": "hourofweek",
            "type": "categorical",
            "max_n_categories": 60,
            "encode_as": "onehot"
        },
        "lin_temperature": {
            "feature": "temperature",
            "type": "linear"
        },
        "flex_temperature": {
            "feature": "temperature",
            "type": "spline",
            "n_knots": 5,
            "degree": 1,
            "strategy": "uniform",
            "extrapolation": "constant",
            "include_bias": true,
            "interaction_only": true
        }
    },
    "interactions": {
        "tow, flex_temperature": {
            "tow": {
                "max_n_categories": 2,
                "stratify_by": "temperature",
                "min_samples_leaf": 15
            }
        }
    }
}
[5]:
print(json.dumps(feature_conf, indent=4))
{
    "trend": {
        "validate": "validate.TrendSchema",
        "generate": "generate.TrendFeatures"
    },
    "datetime": {
        "validate": "validate.DatetimeSchema",
        "generate": "generate.DatetimeFeatures"
    },
    "cyclical": {
        "validate": "validate.CyclicalSchema",
        "generate": "generate.CyclicalFeatures"
    }
}

Create ModelStructure

[6]:
model_structure = ModelStructure.from_config(model_conf, feature_conf)
[7]:
for key, val in model_structure.components.items():
    print(key, '-->', val.keys())
add_features --> dict_keys(['time'])
main_effects --> dict_keys(['month', 'tow', 'lin_temperature'])
interactions --> dict_keys([('tow', 'flex_temperature')])

Create FeatureComposer

Given the model structure, we can create and apply a FeatureComposer:

[8]:
composer = FeatureComposer(model_structure)

Load demo data

[9]:
data = pd.read_csv('data/data.csv', parse_dates=[0], index_col=0)
data = data[~data['consumption_outlier']]

Use the FeatureComposer

[10]:
X = data[['temperature']]
y = data['consumption']

composer = composer.fit(X, y)

The fit method of the composer calls two methods: _create_new_features and _create_encoders. The feature generators are applied in the same order that they were declared in the YAML configuration file.

[11]:
for item in composer.added_features_:
    print(item)
DatetimeFeatures(subset=['month', 'hourofweek'])
[12]:
for name, encoder in composer.encoders_['main_effects'].items():
    print('-->', name)
    print(encoder)
--> month
CategoricalEncoder(feature='month')
--> tow
CategoricalEncoder(feature='hourofweek', max_n_categories=60)
--> lin_temperature
IdentityEncoder(feature='temperature')
[13]:
for pair_name, encoder in composer.encoders_['interactions'].items():
    print('-->', pair_name)
    print(encoder)
--> ('tow', 'flex_temperature')
ICatSplineEncoder(encoder_cat=CategoricalEncoder(feature='hourofweek',
                                                 max_n_categories=2,
                                                 min_samples_leaf=15,
                                                 stratify_by=['temperature']),
                  encoder_num=SplineEncoder(degree=1, feature='temperature'))

After fitting, a composer has a component_names_ attribute:

[14]:
composer.component_names_
[14]:
['lin_temperature', 'month', 'tow', 'tow:flex_temperature']

It also has a component_matrix attribute that shows how the different columns of the design matrix correspond to the different components. This allows us to break down a model’s prediction into the additive contribution of each component.

[15]:
composer.component_matrix
[15]:
component lin_temperature month tow tow:flex_temperature
col
0 0 1 0 0
1 0 1 0 0
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
... ... ... ... ...
78 0 0 0 1
79 0 0 0 1
80 0 0 0 1
81 0 0 0 1
82 0 0 0 1

83 rows × 4 columns

The design matrix is constructed by transforming the data:

[16]:
design_matrix = composer.transform(X)
[17]:
assert design_matrix.shape[0] == X.shape[0]
assert design_matrix.shape[1] == composer.component_matrix.shape[0]
[18]:
n_features = 0

for encoder in composer.encoders_['main_effects'].values():
    n_features += encoder.n_features_out_

for encoder in composer.encoders_['interactions'].values():
    n_features += encoder.n_features_out_

assert design_matrix.shape[1] == n_features

Using the API

An example of using the ModelStructure API can be found in feature_encoders.models.SeasonalPredictor:

def _create_composer(self):
    model_structure = ModelStructure()

    if self.add_trend:
        model_structure = model_structure.add_new_feature(
            name="added_trend",
            fgen_type=TrendFeatures(
                ds=self.ds,
                name="growth",
                remainder="passthrough",
                replace=False,
            ),
        )
        model_structure = model_structure.add_main_effect(
            name="trend",
            enc_type=IdentityEncoder(
                feature="growth",
                as_filter=False,
                include_bias=False,
            ),
        )

    for seasonality, props in self.seasonalities_.items():
        condition_name = props["condition_name"]

        model_structure = model_structure.add_new_feature(
            name=seasonality,
            fgen_type=CyclicalFeatures(
                seasonality=seasonality,
                ds=self.ds,
                period=props.get("period"),
                fourier_order=props.get("fourier_order"),
                remainder="passthrough",
                replace=False,
            ),
        )

        if condition_name is None:
            model_structure = model_structure.add_main_effect(
                name=seasonality,
                enc_type=IdentityEncoder(
                    feature=seasonality,
                    as_filter=True,
                    include_bias=False,
                ),
            )
        else:
            model_structure = model_structure.add_interaction(
                lenc_name=condition_name,
                renc_name=seasonality,
                lenc_type=CategoricalEncoder(
                    feature=condition_name, encode_as="onehot"
                ),
                renc_type=IdentityEncoder(
                    feature=seasonality, as_filter=True, include_bias=False
                ),
            )
    return FeatureComposer(model_structure)
[19]:
model = SeasonalPredictor(
        ds=None,
        add_trend=True,
        yearly_seasonality="auto",
        weekly_seasonality=False,
        daily_seasonality=False,
)

We can add a different daily seasonality per day of week:

[20]:
X = DatetimeFeatures(subset='dayofweek').fit_transform(X)
X['dayofweek'] = X['dayofweek'].map(lambda x: calendar.day_abbr[x])
X = X.merge(pd.get_dummies(X['dayofweek']),
            left_index=True,
            right_index=True).drop('dayofweek', axis=1
)
[21]:
for i in range(7):
    day = calendar.day_abbr[i]
    model.add_seasonality(
        f"daily_on_{day}", period=1, fourier_order=4, condition_name=day
    )
[22]:
model = model.fit(X, y)
[23]:
for item in model.composer_.added_features_:
    print(item)
TrendFeatures()
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Mon')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Tue')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Wed')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Thu')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Fri')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Sat')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Sun')
CyclicalFeatures(fourier_order=6, period=365.25, seasonality='yearly')
[24]:
for name, encoder in model.composer_.encoders_['main_effects'].items():
    print('-->', name)
    print(encoder)
--> trend
IdentityEncoder(feature='growth')
--> yearly
IdentityEncoder(as_filter=True, feature='yearly')
[25]:
for pair_name, encoder in model.composer_.encoders_['interactions'].items():
    print('-->', pair_name)
    print(encoder)
--> ('Mon', 'daily_on_Mon')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Mon'),
                  encoder_num=IdentityEncoder(as_filter=True,
                                              feature='daily_on_Mon'))
--> ('Tue', 'daily_on_Tue')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Tue'),
                  encoder_num=IdentityEncoder(as_filter=True,
                                              feature='daily_on_Tue'))
--> ('Wed', 'daily_on_Wed')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Wed'),
                  encoder_num=IdentityEncoder(as_filter=True,
                                              feature='daily_on_Wed'))
--> ('Thu', 'daily_on_Thu')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Thu'),
                  encoder_num=IdentityEncoder(as_filter=True,
                                              feature='daily_on_Thu'))
--> ('Fri', 'daily_on_Fri')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Fri'),
                  encoder_num=IdentityEncoder(as_filter=True,
                                              feature='daily_on_Fri'))
--> ('Sat', 'daily_on_Sat')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Sat'),
                  encoder_num=IdentityEncoder(as_filter=True,
                                              feature='daily_on_Sat'))
--> ('Sun', 'daily_on_Sun')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Sun'),
                  encoder_num=IdentityEncoder(as_filter=True,
                                              feature='daily_on_Sun'))
[26]:
prediction = model.predict(X)
[27]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(12, 3), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    prediction['consumption'][:1344].plot(ax=ax, alpha=0.8) #2 weeks data
    y[:1344].plot(ax=ax, alpha=0.5)
_images/Tutorial_Feature_Composition_40_0.png

Consistency checks

[28]:
design_matrix = model.composer_.transform(X)
[29]:
for i in range(7):
    day = calendar.day_abbr[i]

    subset_index = model.composer_.component_matrix[
                    model.composer_.component_matrix[f'{day}:daily_on_{day}'] == 1
                   ].index
    subset = pd.DataFrame(design_matrix[:, subset_index], index=X.index)

    features_on = subset.columns[(subset.loc[X[X[day]==1].index] == 0).all()]
    features_off = subset.columns[(subset.loc[X[X[day]==0].index] == 0).all()]

    assert features_on.intersection(features_off).empty

The model works even if we replace:

model_structure = model_structure.add_interaction(
    lenc_name=condition_name,
    renc_name=seasonality,
    lenc_type=CategoricalEncoder(
        feature=condition_name, encode_as="onehot"
    ),
    renc_type=IdentityEncoder(
        feature=seasonality, as_filter=True, include_bias=False
    ),
)

with

model_structure = model_structure.add_interaction(
    lenc_name=condition_name,
    renc_name=seasonality,
    lenc_type="categorical",
    right_enc_type="linear",
    left_feature=condition_name,
    renc_type=seasonality,
    **{
        condition_name: {"encode_as": "onehot"},
        seasonality: {"as_filter": True, "include_bias": False},
    },
)

because the FeatureComposer maps “categorical” to CategoricalEncoder, “linear” to IdentityEncoder and “spline” to SplineEncoder, and passes all additional keyword arguments to the corresponding initializers.