Feature Encoders documentation

Installation

Using pip

python -m pip install feature-encoders

From source

To install feature_encoders from source, first clone the source repository:

git clone https://github.com/hebes-io/feature-encoders.git
cd feature-encoders

Next, you can also install all dependencies using the requirements.txt file in the root of this repository:

python -m pip install -r requirements.txt

Once the dependencies are installed (stay inside of the feature-encoders directory), execute:

python -m pip install .

feature_encoders package

Subpackages

feature_encoders.compose package

Module contents
class feature_encoders.compose._compose.FeatureComposer(model_structure: feature_encoders.compose._compose.ModelStructure)[source]

Generate linear features and pairwise interactions.

Parameters

model_structure (ModelStructure) – The structure of a linear regression model.

property component_matrix

Dataframe indicating which columns of the feature matrix correspond to which components.

Returns

feature_cols – in that component.

Return type

A binary indicator dataframe. Entry is 1 if that column is used

fit(X, y=None)[source]
transform(X)[source]
class feature_encoders.compose._compose.ModelStructure(structure: Optional[Dict] = None, feature_map: Optional[Dict] = None)[source]

Capture the structure of a linear regression model.

The class validates and stores the details of a linear regression model: features, main effects and interactions.

Parameters
  • structure (Dict, optional) –

    A dictionary that includes information about the model. Example:

    {'add_features':
        {'time':
            { 'ds': None,
            'remainder': 'passthrough',
            'replace': False,
            'subset': ['month', 'hourofweek']
            }
        },
    'main_effects':
        {'month':
            { 'feature': 'month',
            'max_n_categories': None,
            'encode_as': 'onehot',
            'interaction_only': False
            },
        'tow':
            { 'feature': 'hourofweek',
            'max_n_categories': 60,
            'encode_as': 'onehot',
            'interaction_only': False
            },
        'lin_temperature':
            { 'feature': 'temperature',
            'include_bias': False,
            'interaction_only': False
            }
        },
    }
    

    Defaults to None.

  • feature_map (Dict, optional) –

    A mapping between a feature generator name and the classes for its validation and creation. Example:

    {'datetime':
        'validate': 'validate.DatetimeSchema'
        'generate': 'generate.DatetimeFeatures'
    }
    

    Defaults to None.

add_interaction(*, lenc_name: str, renc_name: str, lenc_type: Union[str, object], renc_type: Union[str, object], **kwargs)[source]

Add a pairwise interaction.

Parameters
  • lenc_name (str) – A name for the first part of the interaction pair.

  • renc_name (str) – A name for the second part of the interaction pair.

  • lenc_type (str or encoder object) – The type of the feature encoder to apply on the first part of the interaction pair.

  • renc_type (str or encoder object) – The type of the feature encoder to apply on the second part of the interaction pair.

  • **kwargs – Keyword arguments to be passed during the feature encoders’ initialization.

Raises

ValueError – If an interaction with the same name (lenc_name, renc_name) has already been added.

Returns

The updated ModelStructure instance.

Return type

ModelStructure

Example:

model = ModelStructure().add_interaction(
    lenc_name="is_Monday",
    renc_name="daily_seasonality",
    lenc_type="categorical",
    renc_type="linear",
    **{
        is_Monday: {"feature": "is_Monday", "encode_as": "onehot"},
        daily_seasonality: {"feature": "daily", "as_filter": True},
    },
)
add_main_effect(*, name: str, enc_type: Union[str, sklearn.base.BaseEstimator], **kwargs)[source]

Add a main effect.

Parameters
  • name (str) – A name for the main effect.

  • enc_type (str or encoder object) – The type of the feature encoder to apply on the main effect.

  • **kwargs – Keyword arguments to be passed during the feature encoder initialization. Ignored if enc_type is not a string.

Raises

ValueError – If an encoder with the same name has already been added.

Returns

The updated ModelStructure instance.

Return type

ModelStructure

add_new_feature(*, name: str, fgen_type: Union[str, sklearn.base.BaseEstimator], **kwargs)[source]

Add a feature generator.

Feature generators are applied on the input dataframe with the same order that they were added.

Parameters
  • name (str) – A name for the feature generator.

  • fgen_type (str or sklearn-compatible transformer) – The feature generator to add. If it is a string, the corresponding class will be loaded based on the relevant entry in the feature_map dictionary.

  • **kwargs – Keyword arguments to be passed during the feature generator initialization. Ignored if fgen is not a string.

Raises

ValueError – If a feature generator with the same name has already been added.

Returns

The updated ModelStructure instance.

Return type

ModelStructure

property components
classmethod from_config(config: Dict, feature_map: Optional[Dict] = None)[source]

Create a ModelStructure instance from a configuration file.

Parameters
  • config (Dict) – A dictionary that includes information about the model.

  • feature_map (Dict, optional) – A mapping between a feature generator name and the classes for its validation and creation. Defaults to None.

Returns

A populated ModelStructure instance.

Return type

ModelStructure

feature_encoders.encode package

Module contents
class feature_encoders.encode.CategoricalEncoder(*, feature, max_n_categories=None, stratify_by=None, excluded_categories=None, unknown_value=None, min_samples_leaf=1, max_features='auto', random_state=None, encode_as='onehot')[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode categorical features.

If max_n_categories is not None and the number of unique values of the categorical feature is larger than the max_n_categories minus the excluded_categories, the TargetClusterEncoder will be called.

If encode_as = ‘onehot’, the result comes from a TargetClusterEncoder + SafeOneHotEncoder pipeline, otherwise from a TargetClusterEncoder + SafeOrdinalEncoder one.

Parameters
  • feature (str) – The name of the categorical feature to transform. This encoder operates on a single feature.

  • max_n_categories (int, optional) – The maximum number of categories to produce. Defaults to None.

  • stratify_by (str or list of str, optional) – If not None, the encoder will first stratify the categorical feature into groups that have similar values of the features in stratify_by, and then cluster based on the relationship between the categorical feature and the target. It is used only if the number of unique categories minus the excluded_categories is larger than max_n_categories. Defaults to None.

  • excluded_categories (str or list of str, optional) – The names of the categories to be excluded from the clustering process. These categories will stay intact by the encoding process, so they cannot have the same values as the encoder’s results (the encoder acts as an OrdinalEncoder in the sense that the feature is converted into a column of integers 0 to n_categories - 1). Defaults to None.

  • unknown_value (int, optional) – This parameter will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If None, the value -1 is used. Defaults to None.

  • min_samples_leaf (int, optional) – The minimum number of samples required to be at a leaf node of the decision tree model that is used for stratifying the categorical feature if stratify_by is not None. The actual number that will be passed to the tree model is min_samples_leaf multiplied by the number of unique values in the categorical feature to transform. Defaults to 1.

  • max_features (int, float or {"auto", "sqrt", "log2"}, optional) –

    The number of features that the decision tree considers when looking for the best split:

    • If int, then consider max_features features at each split of the decision tree

    • If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split

    • If “auto”, then max_features=n_features

    • If “sqrt”, then max_features=sqrt(n_features)

    • If “log2”, then max_features=log2(n_features)

    • If None, then max_features=n_features

    Defaults to “auto”.

  • random_state (int or RandomState instance, optional) – Controls the randomness of the decision tree estimator. To obtain a deterministic behaviour during its fitting, random_state has to be fixed to an integer. Defaults to None.

  • encode_as ({'onehot', 'ordinal'}, optional) –

    Method used to encode the transformed result.

    • If “onehot”, encode the transformed result with one-hot encoding and return a dense array

    • If “ordinal”, encode the transformed result as integer values

    Defaults to “onehot”.

fit(X: pandas.core.frame.DataFrame, y: Optional[pandas.core.frame.DataFrame] = None)[source]

Fit the encoder on the available data.

Parameters
  • X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

  • y (pandas.DataFrame of shape (n_samples, 1), optional) – The target dataframe. Defaults to None.

Raises
  • ValueError – If the input data does not pass the checks of utils.check_X.

  • ValueError – If the encoder is applied on numerical (float) data.

  • ValueError – If the number of categories minus the excluded_categories is larger than max_n_categories but target values (y) are not provided.

  • ValueError – If any of the values in excluded_categories is not found in the input data.

Returns

Fitted encoder.

Return type

CategoricalEncoder

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

Raises

ValueError – If the input data does not pass the checks of utils.check_X.

Returns

The encoded features as a numpy array.

Return type

numpy array

class feature_encoders.encode.ICatEncoder(encoder_left: feature_encoders.encode._encoders.CategoricalEncoder, encoder_right: feature_encoders.encode._encoders.CategoricalEncoder)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode the interaction between two categorical features.

Interactions are always pairwise and always between encoders (and not features).

Parameters
Raises
  • ValueError – If any of the two encoders is not a CategoricalEncoder.

  • ValueError – If the two encoders do not have the same encode_as parameter.

Note

Both encoders should have the same encode_as parameter. If one or both of the encoders is already fitted, it will not be re-fitted during fit or fit_transform.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the encoder on the available data.

Parameters
  • X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

  • y (None, optional) – Ignored. Defaults to None.

Returns

Fitted encoder.

Return type

ICatEncoder

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

Returns

The matrix of interaction features as a numpy array.

Return type

numpy array

class feature_encoders.encode.ICatLinearEncoder(*, encoder_cat: feature_encoders.encode._encoders.CategoricalEncoder, encoder_num: feature_encoders.encode._encoders.IdentityEncoder)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode the interaction between one categorical and one linear numerical feature.

Parameters
  • encoder_cat (CategoricalEncoder) – The encoder for the categorical feature. It must encode features in an one-hot form.

  • encoder_num (IdentityEncoder) – The encoder for the numerical feature.

Raises
  • ValueError – If encoder_cat is not a CategoricalEncoder.

  • ValueError – If encoder_num is not an IdentityEncoder.

  • ValueError – If encoder_cat is not encoded as one-hot.

Note

If one or both of the encoders is already fitted, it will not be re-fitted during fit or fit_transform.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the encoder on the available data.

Parameters
  • X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

  • y (None, optional) – Ignored. Defaults to None.

Returns

Fitted encoder.

Return type

ICatLinearEncoder

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

Returns

The matrix of interaction features as a numpy array.

Return type

numpy array

class feature_encoders.encode.ICatSplineEncoder(*, encoder_cat: feature_encoders.encode._encoders.CategoricalEncoder, encoder_num: feature_encoders.encode._encoders.SplineEncoder)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode the interaction between one categorical and one spline-encoded numerical feature.

Parameters
  • encoder_cat (CategoricalEncoder) – The encoder for the categorical feature. It must encode features in an one-hot form.

  • encoder_num (SplineEncoder) – The encoder for the numerical feature.

Raises
  • ValueError – If encoder_cat is not a CategoricalEncoder.

  • ValueError – If encoder_num is not a SplineEncoder.

  • ValueError – If encoder_cat is not encoded as one-hot.

Note

If the categorical encoder is already fitted, it will not be re-fitted during fit or fit_transform. The numerical encoder will always be (re)fitted (one encoder per level of categorical feature.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the encoder on the available data.

Parameters
  • X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

  • y (None, optional) – Ignored. Defaults to None.

Returns

Fitted encoder.

Return type

ICatSplineEncoder

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

Returns

The matrix of interaction features as a numpy array.

Return type

numpy array

class feature_encoders.encode.ISplineEncoder(encoder_left: feature_encoders.encode._encoders.SplineEncoder, encoder_right: feature_encoders.encode._encoders.SplineEncoder)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode the interaction between two spline-encoded numerical features.

Parameters
  • encoder_left (SplineEncoder) – The encoder for the first of the two features.

  • encoder_right (SplineEncoder) – The encoder for the second of the two features.

Raises

ValueError – If any of the two encoders is not a SplineEncoder.

Note

If one or both of the encoders is already fitted, it will not be re-fitted during fit or fit_transform.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the encoder on the available data.

Parameters
  • X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

  • y (None, optional) – Ignored. Defaults to None.

Returns

Fitted encoder.

Return type

ISplineEncoder

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

Returns

The matrix of interaction features as a numpy array.

Return type

numpy array

class feature_encoders.encode.IdentityEncoder(feature=None, as_filter=False, include_bias=False)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Create an encoder that returns what it is fed.

This encoder can act as a linear feature encoder.

Parameters
  • feature (str or list of str, optional) – The name(s) of the input dataframe’s column(s) to return. If None, the whole input dataframe will be returned. Defaults to None.

  • as_filter (bool, optional) – If True, the encoder will return all feature labels for which “feature in label == True”. Defaults to False.

  • include_bias (bool, optional) – If True, a column of ones is added to the output. Defaults to False.

Raises

ValueError – If as_filter is True, feature cannot include multiple feature names.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the encoder on the available data.

Parameters
  • X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

  • y (None, optional) – Ignored. Defaults to None.

Raises

ValueError – If the input data does not pass the checks of utils.check_X.

Returns

Fitted encoder.

Return type

IdentityEncoder

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

Raises
  • ValueError – If the input data does not pass the checks of utils.check_X.

  • ValueError – If include_bias is True and a column with constant values already exists in the returned columns.

Returns

The selected column subset as a numpy array.

Return type

numpy array of shape

class feature_encoders.encode.ProductEncoder(encoder_left: feature_encoders.encode._encoders.IdentityEncoder, encoder_right: feature_encoders.encode._encoders.IdentityEncoder)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode the interaction between two linear numerical features.

Parameters
  • encoder_left (IdentityEncoder) – The encoder for the first of the two features.

  • encoder_right (IdentityEncoder) – The encoder for the second of the two features.

Raises

ValueError – If any of the two encoders is not an IdentityEncoder.

Note

If one or both of the encoders is already fitted, it will not be re-fitted during fit or fit_transform.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the encoder on the available data.

Parameters
  • X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

  • y (None, optional) – Ignored. Defaults to None.

Raises

ValueError – If any of the two encoders is not a single-feature encoder.

Returns

Fitted encoder.

Return type

ProductEncoder

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

Returns

The matrix of interaction features as a numpy array.

Return type

numpy array

class feature_encoders.encode.SafeOneHotEncoder(feature=None, unknown_value=None)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode categorical features in a one-hot form.

The encoder uses a SafeOrdinalEncoder`to first encode the feature as an integer array and then a `sklearn.preprocessing.OneHotEncoder to encode the features as an one-hot array.

Parameters
  • feature (str or list of str, optional) – The names of the columns to encode. If None, all categorical columns will be encoded. Defaults to None.

  • unknown_value (int, optional) – This parameter will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If None, the value -1 is used. During transform, unknown categories will be replaced using the most frequent value along each column. Defaults to None.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the encoder on the available data.

Parameters
  • X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

  • y (None, optional) – Ignored. Defaults to None.

Returns

Fitted encoder.

Return type

SafeOneHotEncoder

Raises
  • ValueError – If the input data does not pass the checks of utils.check_X.

  • ValueError – If the encoder is applied on numerical (float) data.

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

Raises

ValueError – If the input data does not pass the checks of utils.check_X.

Returns

The encoded column subset as a numpy array.

Return type

numpy array of shape

class feature_encoders.encode.SafeOrdinalEncoder(feature=None, unknown_value=None)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode categorical features as an integer array.

The encoder converts the features into ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.

Parameters
  • feature (str or list of str, optional) – The names of the columns to encode. If None, all categorical columns will be encoded. Defaults to None.

  • unknown_value (int, optional) – This parameter will set the encoded value for unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If None, the value -1 is used. During transform, unknown categories will be replaced using the most frequent value along each column. Defaults to None.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the encoder on the available data.

Parameters
  • X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

  • y (None, optional) – Ignored. Defaults to None.

Returns

Fitted encoder.

Return type

SafeOrdinalEncoder

Raises

ValueError – If the input data does not pass the checks of utils.check_X.

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

Raises

ValueError – If the input data does not pass the checks of utils.check_X.

Returns

The encoded column subset as a numpy array.

Return type

numpy array of shape

class feature_encoders.encode.SplineEncoder(*, feature, n_knots=5, degree=3, strategy='uniform', extrapolation='constant', include_bias=True, order='C')[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Generate univariate B-spline bases for features.

The encoder generates a matrix consisting of n_splines=n_knots + degree - 1 spline basis functions (B-splines) of polynomial order=`degree` for the given feature.

Parameters
  • feature (str) – The name of the column to encode.

  • n_knots (int, optional) – Number of knots of the splines if knots equals one of {‘uniform’, ‘quantile’}. Must be larger or equal 2. Ignored if knots is array-like. Defaults to 5.

  • degree (int, optional) – The polynomial degree of the spline basis. Must be a non-negative integer. Defaults to 3.

  • strategy ({'uniform', 'quantile'} or array-like of shape (n_knots, n_features) –

    optional): Set knot positions such that first knot <= features <= last knot.

    • If ‘uniform’, n_knots number of knots are distributed uniformly from min to max values of the features (each bin has the same width)

    • If ‘quantile’, they are distributed uniformly along the quantiles of the features (each bin has the same number of observations)

    • If an array-like is given, it directly specifies the sorted knot positions including the boundary knots. Note that, internally, degree number of knots are added before the first knot, the same after the last knot

    Defaults to “uniform”.

  • extrapolation ({'error', 'constant', 'linear', 'continue'}, optional) – If ‘error’, values outside the min and max values of the training features raises a ValueError. If ‘constant’, the value of the splines at minimum and maximum value of the features is used as constant extrapolation. If ‘linear’, a linear extrapolation is used. If ‘continue’, the splines are extrapolated as is, option extrapolate=True in scipy.interpolate.BSpline. Defaults to “constant”.

  • include_bias (bool, optional) – If False, then the last spline element inside the data range of a feature is dropped. As B-splines sum to one over the spline basis functions for each data point, they implicitly include a bias term. Defaults to True.

  • order ({'C', 'F'}, optional) – Order of output array. ‘F’ order is faster to compute, but may slow down subsequent estimators. Defaults to “C”.

fit(X: pandas.core.frame.DataFrame, y=None, sample_weight=None)[source]

Fit the encoder.

Parameters
  • X (pandas.DataFrame of shape (n_samples, n_features)) – The data to fit.

  • y (None, optional) – Ignored. Defaults to None.

  • sample_weight (array-like of shape (n_samples,), optional) – Individual weights for each sample. Used to calculate quantiles if strategy=”quantile”. For strategy=”uniform”, zero weighted observations are ignored for finding the min and max of X. Defaults to None.

Raises

ValueError – If the input data does not pass the checks of utils.check_X.

Returns

Fitted encoder.

Return type

SplineEncoder

transform(X)[source]

Transform the feature data to B-splines.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The data to transform.

Raises

ValueError – If the input data does not pass the checks of utils.check_X.

Returns

The B-splines matrix.

Return type

numpy.ndarray

class feature_encoders.encode.TargetClusterEncoder(*, feature, max_n_categories, stratify_by=None, excluded_categories=None, unknown_value=None, min_samples_leaf=5, max_features='auto', random_state=None)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode a categorical feature as clusters of the target’s values.

The purpose of this encoder is to reduce the cardinality of a categorical feature. This encoder does not replace unknown values with the most frequent one during transform. It just assigns them the value of unknown_value.

Parameters
  • feature (str) – The name of the categorical feature to transform. This encoder operates on a single feature.

  • max_n_categories (int, optional) – The maximum number of categories to produce. Defaults to None.

  • stratify_by (str or list of str, optional) – If not None, the encoder will first stratify the categorical feature into groups that have similar values of the features in stratify_by, and then cluster based on the relationship between the categorical feature and the target. It is used only if the number of unique categories minus the excluded_categories is larger than max_n_categories. Defaults to None.

  • excluded_categories (str or list of str, optional) – The names of the categories to be excluded from the clustering process. These categories will stay intact by the encoding process, so they cannot have the same values as the encoder’s results (the encoder acts as an OrdinalEncoder in the sense that the feature is converted into a column of integers 0 to n_categories - 1). Defaults to None.

  • unknown_value (int, optional) – This parameter will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If None, the value -1 is used. Defaults to None.

  • min_samples_leaf (int, optional) – The minimum number of samples required to be at a leaf node of the decision tree model that is used for stratifying the categorical feature if stratify_by is not None. The actual number that will be passed to the tree model is min_samples_leaf multiplied by the number of unique values in the categorical feature to transform. Defaults to 1.

  • max_features (int, float or {"auto", "sqrt", "log2"}, optional) –

    The number of features that the decision tree considers when looking for the best split:

    • If int, then consider max_features features at each split of the decision tree

    • If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split

    • If “auto”, then max_features=n_features

    • If “sqrt”, then max_features=sqrt(n_features)

    • If “log2”, then max_features=log2(n_features)

    • If None, then max_features=n_features

    Defaults to “auto”.

  • random_state (int or RandomState instance, optional) – Controls the randomness of the decision tree estimator. To obtain a deterministic behaviour during its fitting, random_state has to be fixed to an integer. Defaults to None.

fit(X: pandas.core.frame.DataFrame, y: pandas.core.frame.DataFrame)[source]

Fit the encoder on the available data.

Parameters
  • X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

  • y (pandas.DataFrame of shape (n_samples, 1)) – The target dataframe.

Returns

Fitted encoder.

Return type

TargetClusterEncoder

Raises
  • ValueError – If the input data does not pass the checks of utils.check_X.

  • ValueError – If the encoder is applied on numerical (float) data.

  • ValueError – If any of the values in excluded_categories is not found in the input data.

  • ValueError – If the number of categories left after removing all in excluded_categories is not larger than max_n_categories.

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

Returns

The encoded column subset as a numpy array.

Return type

numpy array

Raises

ValueError – If the input data does not pass the checks of utils.check_X.

feature_encoders.generate package

Module contents
class feature_encoders.generate.CyclicalFeatures(*, seasonality, ds=None, period=None, fourier_order=None, remainder='passthrough', replace=False)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Create cyclical (seasonal) features as fourier terms.

Parameters
  • seasonality (str) – The name of the seasonality. The feature generator can provide default values for period and fourier_order if seasonality is one of ‘daily’, ‘weekly’ or ‘yearly’.

  • ds (str, optional) – The name of the input dataframe’s column that contains datetime information. If None, it is assumed that the datetime information is provided by the input dataframe’s index. Defaults to None.

  • period (float, optional) – Number of days in one period. Defaults to None.

  • fourier_order (int, optional) – Number of Fourier components to use. Defaults to None.

  • remainder ({'drop', 'passthrough'}, optional) – By specifying remainder='passthrough', all the remaining columns of the input dataset will be automatically passed through (concatenated with the output of the transformer), otherwise, they will be dropped. Defaults to “passthrough”.

  • replace (bool, optional) – Specifies whether replacing an existing column with the same name is allowed (applicable when remainder=passthrough). Defaults to False.

Raises

ValueError – If remainder is neither ‘drop’ nor ‘passthrough’.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the feature generator on the available data.

Parameters
  • X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

  • y (None, optional) – Ignored. Defaults to None.

Returns

Fitted encoder.

Return type

CyclicalFeatures

Raises

ValueError – If either period or fourier_order is not provided, but seasonality is not one of ‘daily’, ‘weekly’ or ‘yearly’.

transform(X: pandas.core.frame.DataFrame)[source]

Apply the feature generator.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

Raises
  • ValueError – If the input data does not pass the checks of utils.check_X.

  • ValueError – If common columns are found and replace=False.

Returns

The transformed dataframe.

Return type

pandas.DataFrame

class feature_encoders.generate.DatetimeFeatures(ds=None, remainder='passthrough', replace=False, subset=None)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Generate date and time features.

Parameters
  • ds (str, optional) – The name of the input dataframe’s column that contains datetime information. If None, it is assumed that the datetime information is provided by the input dataframe’s index. Defaults to None.

  • remainder ({'drop', 'passthrough'}, optional) – By specifying remainder='passthrough', all the remaining columns of the input dataset will be automatically passed through (concatenated with the output of the transformer), otherwise, they will be dropped. Defaults to “passthrough”.

  • replace (bool, optional) – Specifies whether replacing an existing column with the same name is allowed (applicable when remainder=passthrough). Defaults to False.

  • subset (str or list of str, optional) – The names of the features to generate. If None, all features will be produced: ‘month’, ‘week’, ‘dayofyear’, ‘dayofweek’, ‘hour’, ‘hourofweek’. The last 2 features are generated only if the timestep of the input’s ds (or index if ds is None) is smaller than pandas.Timedelta(days=1). Defaults to None.

Raises

ValueError – If remainder is neither ‘drop’ nor ‘passthrough’.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the feature generator on the available data.

Parameters
  • X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

  • y (None, optional) – Ignored. Defaults to None.

Returns

Fitted encoder.

Return type

DatetimeFeatures

Raises

ValueError – If the input data does not pass the checks of utils.check_X.

transform(X: pandas.core.frame.DataFrame)[source]

Apply the feature generator.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

Raises
  • ValueError – If the input data does not pass the checks of utils.check_X.

  • ValueError – If common columns are found and replace=False.

Returns

The transformed dataframe.

Return type

pandas.DataFrame

class feature_encoders.generate.TrendFeatures(ds=None, name='growth', remainder='passthrough', replace=False)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Generate linear time trend features.

Parameters
  • ds (str, optional) – The name of the input dataframe’s column that contains datetime information. If None, it is assumed that the datetime information is provided by the input dataframe’s index. Defaults to None.

  • name (str, optional) – The name of the generated dataframe’s column. Defaults to ‘growth’.

  • remainder ({'drop', 'passthrough'}, optional) – By specifying remainder='passthrough', all the remaining columns of the input dataset will be automatically passed through (concatenated with the output of the transformer), otherwise, they will be dropped. Defaults to “passthrough”.

  • replace (bool, optional) – Specifies whether replacing an existing column with the same name is allowed (applicable when remainder=passthrough). Defaults to False.

Raises

ValueError – If remainder is neither ‘drop’ nor ‘passthrough’.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the feature generator on the available data.

Parameters
  • X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

  • y (None, optional) – Ignored. Defaults to None.

Returns

Fitted encoder.

Return type

TrendFeatures

Raises

ValueError – If the input data does not pass the checks of utils.check_X.

transform(X: pandas.core.frame.DataFrame)[source]

Apply the feature generator.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

Raises
  • ValueError – If the input data does not pass the checks of utils.check_X.

  • ValueError – If common columns are found and replace=False.

Returns

The transformed dataframe.

Return type

pandas.DataFrame

feature_encoders.models package

Submodules
feature_encoders.models.grouped module
class feature_encoders.models.grouped.GroupedPredictor(*, group_feature: str, model_conf: Dict[str, Dict], feature_conf: Optional[Dict[str, Dict]] = None, estimator_params=(), fallback=False)[source]

Bases: sklearn.base.RegressorMixin, sklearn.base.BaseEstimator

Construct one predictor per data group.

The predictor splits data by the different values of a single column and fits one estimator per group. Since each of the models in the ensemble predicts on a different subset of the input data (an observation cannot belong to more than one clusters), the final prediction is generated by vertically concatenating all the individual models’ predictions.

Parameters
  • group_feature (str) – The name of the column of the input dataframe to use as the grouping set.

  • model_conf (Dict[str, Dict]) – A dictionary that includes information about the base model’s structure.

  • feature_conf (Dict[str, Dict], optional) – A dictionary that maps feature generator names to the classes for the generators’ validation and creation. Defaults to None.

  • estimator_params (dict or tuple of tuples, optional) – The parameters to use when instantiating a new base estimator. If none are given, default parameters are used. Defaults to tuple().

  • fallback (bool, optional) – Whether or not to fall back to a global model in case a group parameter is not found during .predict(). Otherwise, an exception will be raised. Defaults to False.

property dof
fit(X: pandas.core.frame.DataFrame, y: Union[pandas.core.frame.DataFrame, pandas.core.series.Series])[source]

Fit the estimator with the available data.

Parameters
  • X (pandas.DataFrame) – Input data.

  • y (pandas.Series or pandas.DataFrame) – Target data.

Raises
  • Exception – If the estimator is re-fitted. An estimator object can only be fitted once.

  • ValueError – If the input data does not pass the checks of utils.check_X.

  • ValueError – If the target data does not pass the checks of utils.check_y.

Returns

Fitted estimator.

Return type

GroupedPredictor

property n_parameters
predict(X: pandas.core.frame.DataFrame, include_clusters=False, include_components=False)[source]

Predict given new input data.

Parameters
  • X (pandas.DataFrame) – Input data.

  • include_clusters (bool, optional) – Whether to include the added clusters in the returned prediction. Defaults to False.

  • include_components (bool, optional) – Whether to include the contribution of the individual components of the model structure in the returned prediction. Defaults to False.

Raises

ValueError – If the input data does not pass the checks of utils.check_X.

Returns

The predicted values.

Return type

pandas.DataFrame

feature_encoders.models.linear module
class feature_encoders.models.linear.LinearPredictor(*, model_structure: feature_encoders.compose._compose.ModelStructure, alpha=0.01, fit_intercept=False)[source]

Bases: sklearn.base.RegressorMixin, sklearn.base.BaseEstimator

A linear regression model with flexible parameterization.

Parameters
  • model_structure (ModelStructure) – The structure of a linear regression model.

  • alpha (float, optional) – Regularization strength of the underlying ridge regression; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Defaults to 0.01.

  • fit_intercept (bool, optional) – Whether to fit the intercept for this model. If set to false, no intercept will be used in calculations. Defaults to False.

property dof
fit(X: pandas.core.frame.DataFrame, y: Union[pandas.core.frame.DataFrame, pandas.core.series.Series])[source]

Fit the estimator with the available data.

Parameters
  • X (pandas.DataFrame) – Input data.

  • y (pandas.Series or pandas.DataFrame) – Target data.

Raises
  • Exception – If the estimator is re-fitted. An estimator object can only be fitted once.

  • ValueError – If the input data does not pass the checks of utils.check_X.

  • ValueError – If the target data does not pass the checks of utils.check_y.

Returns

Fitted estimator.

Return type

LinearPredictor

property n_parameters
predict(X: pandas.core.frame.DataFrame, include_components=False)[source]

Predict using the given input data.

Parameters
  • X (pandas.DataFrame) – Input data.

  • include_components (bool, optional) – If True, the prediction dataframe will include also the individual components’ contribution to the predicted values. Defaults to False.

Returns

The prediction.

Return type

pandas.DataFrame

feature_encoders.models.seasonal module
class feature_encoders.models.seasonal.SeasonalPredictor(ds: Optional[str] = None, add_trend: bool = False, yearly_seasonality: Union[str, bool, int] = 'auto', weekly_seasonality: Union[str, bool, int] = 'auto', daily_seasonality: Union[str, bool, int] = 'auto', min_samples=0.5, alpha=0.01)[source]

Bases: sklearn.base.BaseEstimator

Time series prediction model based on seasonal decomposition.

Parameters
  • ds (str, optional) – The name of the input dataframe’s column that contains datetime information. If None, it is assumed that the datetime information is provided by the input dataframe’s index. Defaults to None.

  • add_trend (bool, optional) – If True, a linear time trend will be added. Defaults to False.

  • yearly_seasonality (Union[str, bool, int], optional) – Fit yearly seasonality. Can be ‘auto’, True, False, or a number of Fourier terms to generate. Defaults to “auto”.

  • weekly_seasonality (Union[str, bool, int], optional) – Fit weekly seasonality. Can be ‘auto’, True, False, or a number of Fourier terms to generate. Defaults to “auto”.

  • daily_seasonality (Union[str, bool, int], optional) – Fit daily seasonality. Can be ‘auto’, True, False, or a number of Fourier terms to generate. Defaults to “auto”.

  • min_samples (float ([0, 1]), optional) – Minimum number of samples chosen randomly from original data by the RANSAC (RANdom SAmple Consensus) algorithm. Defaults to 0.5.

  • alpha (float, optional) – Parameter for the underlying ridge estimator (base_estimator). It must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Defaults to 0.01.

add_seasonality(name: str, period: Optional[float] = None, fourier_order: Optional[int] = None, condition_name: Optional[str] = None)[source]

Add a seasonal component with specified period and number of Fourier components.

If condition_name is provided, the input dataframe passed to fit and predict should have a column with the specified condition_name containing booleans that indicate when to apply seasonality.

Parameters
  • name (str) – The name of the seasonality component.

  • period (float, optional) – Number of days in one period. Defaults to None.

  • fourier_order (int, optional) – Number of Fourier components to use. Defaults to None.

  • condition_name (str, optional) – The name of the seasonality condition. Defaults to None.

Raises
  • Exception – If the method is called after the estimator is fitted.

  • ValueError – If either period or fourier_order are not provided and the seasonality is not in (‘daily’, ‘weekly’, ‘yearly’).

Returns

The updated estimator object.

Return type

SeasonalPredictor

fit(X: pandas.core.frame.DataFrame, y: pandas.core.frame.DataFrame)[source]

Fit the estimator with the available data.

Parameters
  • X (pandas.DataFrame) – Input data.

  • y (pandas.DataFrame) – Target data.

Raises
  • Exception – If the estimator is re-fitted. An estimator object can only be fitted once.

  • ValueError – If the input data does not pass the checks of utils.check_X.

  • ValueError – If the target data does not pass the checks of utils.check_y.

Returns

Fitted estimator.

Return type

SeasonalPredictor

predict(X: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]

Predict using the given input data.

Parameters

X (pandas.DataFrame) – Input data.

Returns

The prediction.

Return type

pandas.DataFrame

Module contents

feature_encoders.validate package

Submodules
feature_encoders.validate.schemas module
class feature_encoders.validate.schemas.CategoricalSchema(*, type: str, feature: str, max_n_categories: int = None, stratify_by: Optional[Union[str, List[str]]] = None, excluded_categories: Optional[Union[str, List[str]]] = None, unknown_value: int = None, min_samples_leaf: int = 1, max_features: Union[str, int, float] = 'auto', random_state: int = None, encode_as: str = 'onehot')[source]

Bases: pydantic.main.BaseModel

classmethod check_encode_as(data)[source]
classmethod check_lists(data)[source]
classmethod check_max_features(data)[source]
encode_as: str
excluded_categories: Optional[Union[str, List[str]]]
feature: str
max_features: Union[str, int, float]
max_n_categories: Optional[int]
min_samples_leaf: int
random_state: Optional[int]
stratify_by: Optional[Union[str, List[str]]]
type: str
unknown_value: Optional[int]
class feature_encoders.validate.schemas.CyclicalSchema(*, type: str, seasonality: str, ds: str = None, period: float = None, fourier_order: int = None, remainder: str = 'passthrough', replace: bool = False)[source]

Bases: pydantic.main.BaseModel

classmethod check_remainder(data)[source]
ds: Optional[str]
fourier_order: Optional[int]
period: Optional[float]
remainder: str
replace: bool
seasonality: str
type: str
class feature_encoders.validate.schemas.DatetimeSchema(*, type: str, ds: str = None, remainder: str = 'passthrough', replace: bool = False, subset: Optional[Union[str, List[str]]] = None)[source]

Bases: pydantic.main.BaseModel

classmethod check_remainder(data)[source]
classmethod check_subset(data)[source]
ds: Optional[str]
remainder: str
replace: bool
subset: Optional[Union[str, List[str]]]
type: str
class feature_encoders.validate.schemas.LinearSchema(*, type: str, feature: str, as_filter: bool = False, include_bias: bool = False)[source]

Bases: pydantic.main.BaseModel

as_filter: bool
feature: str
include_bias: bool
type: str
class feature_encoders.validate.schemas.SplineSchema(*, type: str, feature: str, n_knots: int = 5, degree: int = 3, strategy: Optional[Union[str, List]] = 'uniform', extrapolation: str = 'constant', include_bias: bool = False)[source]

Bases: pydantic.main.BaseModel

classmethod check_extrapolation(data)[source]
classmethod check_strategy(data)[source]
degree: Optional[int]
extrapolation: Optional[str]
feature: str
include_bias: bool
n_knots: Optional[int]
strategy: Optional[Union[str, List]]
type: str
class feature_encoders.validate.schemas.TrendSchema(*, type: str, ds: str = None, name: str = 'growth', remainder: str = 'passthrough', replace: bool = False)[source]

Bases: pydantic.main.BaseModel

classmethod check_remainder(data)[source]
ds: Optional[str]
name: str
remainder: str
replace: bool
type: str
Module contents
class feature_encoders.validate.CategoricalSchema(*, type: str, feature: str, max_n_categories: int = None, stratify_by: Optional[Union[str, List[str]]] = None, excluded_categories: Optional[Union[str, List[str]]] = None, unknown_value: int = None, min_samples_leaf: int = 1, max_features: Union[str, int, float] = 'auto', random_state: int = None, encode_as: str = 'onehot')[source]

Bases: pydantic.main.BaseModel

classmethod check_encode_as(data)[source]
classmethod check_lists(data)[source]
classmethod check_max_features(data)[source]
encode_as: str
excluded_categories: Optional[Union[str, List[str]]]
feature: str
max_features: Union[str, int, float]
max_n_categories: Optional[int]
min_samples_leaf: int
random_state: Optional[int]
stratify_by: Optional[Union[str, List[str]]]
type: str
unknown_value: Optional[int]
class feature_encoders.validate.CyclicalSchema(*, type: str, seasonality: str, ds: str = None, period: float = None, fourier_order: int = None, remainder: str = 'passthrough', replace: bool = False)[source]

Bases: pydantic.main.BaseModel

classmethod check_remainder(data)[source]
ds: Optional[str]
fourier_order: Optional[int]
period: Optional[float]
remainder: str
replace: bool
seasonality: str
type: str
class feature_encoders.validate.DatetimeSchema(*, type: str, ds: str = None, remainder: str = 'passthrough', replace: bool = False, subset: Optional[Union[str, List[str]]] = None)[source]

Bases: pydantic.main.BaseModel

classmethod check_remainder(data)[source]
classmethod check_subset(data)[source]
ds: Optional[str]
remainder: str
replace: bool
subset: Optional[Union[str, List[str]]]
type: str
class feature_encoders.validate.LinearSchema(*, type: str, feature: str, as_filter: bool = False, include_bias: bool = False)[source]

Bases: pydantic.main.BaseModel

as_filter: bool
feature: str
include_bias: bool
type: str
class feature_encoders.validate.SplineSchema(*, type: str, feature: str, n_knots: int = 5, degree: int = 3, strategy: Optional[Union[str, List]] = 'uniform', extrapolation: str = 'constant', include_bias: bool = False)[source]

Bases: pydantic.main.BaseModel

classmethod check_extrapolation(data)[source]
classmethod check_strategy(data)[source]
degree: Optional[int]
extrapolation: Optional[str]
feature: str
include_bias: bool
n_knots: Optional[int]
strategy: Optional[Union[str, List]]
type: str
class feature_encoders.validate.TrendSchema(*, type: str, ds: str = None, name: str = 'growth', remainder: str = 'passthrough', replace: bool = False)[source]

Bases: pydantic.main.BaseModel

classmethod check_remainder(data)[source]
ds: Optional[str]
name: str
remainder: str
replace: bool
type: str

Submodules

feature_encoders.settings module

feature_encoders.utils module

feature_encoders.utils.add_constant(data: Union[numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame], prepend=True, has_constant='skip')[source]

Add a column of ones to an array.

Parameters
  • data (array-like) – A column-ordered design matrix.

  • prepend (bool, optional) – If true, the constant is in the first column. Else the constant is appended (last column). Defaults to True.

  • has_constant ({'raise', 'add', 'skip'}, optional) – Behavior if data already has a constant. The default will return data without adding another constant. If ‘raise’, will raise an error if any column has a constant value. Using ‘add’ will add a column of 1s if a constant column is present. Defaults to “skip”.

Returns

The original values with a constant (column of ones).

Return type

numpy.ndarray

feature_encoders.utils.as_list(val: Any)[source]

Cast input as list.

Helper function, always returns a list of the input value.

feature_encoders.utils.as_series(x: Union[numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame])[source]

Cast an iterable to a Pandas Series object.

feature_encoders.utils.check_X(X: pandas.core.frame.DataFrame, exists=None, int_is_categorical=True, return_col_info=False)[source]

Perform a series of checks on the input dataframe.

Parameters
  • X (pamdas.DataFrame) – The input dataframe.

  • exists (str or list of str, optional) – Names of columns that must be present in the input dataframe. Defaults to None.

  • int_is_categorical (bool, optional) – If True, integer types are considered categorical. Defaults to True.

  • return_col_info (bool, optional) – If True, the function will return the names of the categorical and the names of the numerical columns, in addition to the provided dataframe. Defaults to False.

Raises
  • ValueError – If the input is not a pandas DataFrame.

  • ValueError – If any of the column names in exists are not found in the input.

  • ValueError – If Nan or inf values are found in the provided input data.

Returns

pandas.DataFrame if return_col_info is False else (pandas.DataFrame, list, list)

feature_encoders.utils.check_y(y: Union[pandas.core.series.Series, pandas.core.frame.DataFrame], index=None)[source]

Perform a series of checks on the input dataframe.

The checks are carried out by sklearn.utils.check_array.

Parameters
  • y (Union[pandas.Series, pandas.DataFrame]) – The input dataframe.

  • index (Union[pandas.Index, pandas.DatetimeIndex], optional) – An index to compare with the input dataframe’s index. Defaults to None.

Raises
  • ValueError – If the input is neither a pandas Series nor a pandas DataFrame with only a single column.

  • ValueError – If the input data has different index than the one that was provided for comparison (if index is not None).

Returns

The validated input data.

Return type

pandas.DataFrame

feature_encoders.utils.get_categorical_cols(X: pandas.core.frame.DataFrame, int_is_categorical=True)[source]

Return the names of the categorical columns in the input DataFrame.

Parameters
  • X (pandas.DataFrame) – Input dataframe.

  • int_is_categorical (bool, optional) – If True, integer types are considered categorical. Defaults to True.

Returns

The names of categorical columns in the input DataFrame.

Return type

list

feature_encoders.utils.get_datetime_data(X: pandas.core.frame.DataFrame, col_name=None)[source]

Get datetime information from the input dataframe.

Parameters
  • X (pandas.DataFrame) – The input dataframe.

  • col_name (str, optional) – The name of the column that contains datetime information. If None, it is assumed that the datetime information is provided by the input dataframe’s index. Defaults to None.

Returns

The datetime information.

Return type

pandas.Series

feature_encoders.utils.load_config(model='towt', features='default', merge_multiple=False)[source]

Load model configuration and feature generator mapping.

Given model and features, the function searches for files in:

conf_path = str(CONF_PATH)
model_files = glob.glob(f"{conf_path}/models/{model}.*")
feature_files = glob.glob(f"{conf_path}/features/{features}.*")
Parameters
  • model (str, optional) – The name of the model configuration to load. Defaults to “towt”.

  • features (str, optional) – The name of the feature generator mapping to load. Defaults to “default”.

  • merge_multiple (bool, optional) – If True and more than one files are found when searching for either models or features, the contents of the files will ne merged. Otherwise, an exception will be raised. Defaults to False.

Returns

The model configuration and feature mapping as dictionaries.

Return type

(dict, dict)

feature_encoders.utils.maybe_reshape_2d(arr: numpy.ndarray)[source]

Reshape an array (if needed) so it’s always 2-d and long.

Parameters

arr (numpy.ndarray) – The input array.

Returns

The reshaped array.

Return type

numpy.ndarray

feature_encoders.utils.tensor_product(a: numpy.ndarray, b: numpy.ndarray, reshape=True)[source]

Compute the tensor product of two matrices.

Parameters
  • a (numpy array of shape (n, m_a)) – The first matrix.

  • b (numpy array of shape (n, m_b)) – The second matrix.

  • reshape (bool, optional) – Whether to reshape the result to be 2D (n, m_a * m_b) or return a 3D tensor (n, m_a, m_b). Defaults to True.

Raises
  • ValueError – If input arrays are not 2-dimensional.

  • ValueError – If both input arrays do not have the same number of samples.

Returns

numpy.ndarray of shape (n, m_a * m_b) if reshape = True else of shape (n, m_a, m_b).

Module contents

Tutorials

The functionality for generating new features

The feature-encoders library includes a few feature generators:

  • TrendFeatures: Generates time trend features.

  • DatetimeFeatures: Generates date and time features (such as the month of the year or the hour of the week).

  • CyclicalFeatures: Creates cyclical (seasonal) features as Fourier terms (similarly to the way the Prophet library generates seasonality features).

All feature generators generate pandas DataFrames, and they all have two common parameters:

remainder : str, {'drop', 'passthrough'}, default='passthrough'
    By specifying `remainder='passthrough'`, all the remaining columns of the
    input dataset will be automatically passed through (concatenated with the
    output of the transformer).
replace : bool, default=False
    Specifies whether replacing an existing column with the same name is allowed
    (when `remainder=passthrough`).
[1]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

%matplotlib inline
[2]:
from feature_encoders.generate import CyclicalFeatures, DatetimeFeatures, TrendFeatures

Load demo data

[3]:
data = pd.read_csv('data/data.csv', parse_dates=[0], index_col=0)
data = data[~data['consumption_outlier']]

Create time trend features

The ds argument corresponds to the name of the input dataframe’s column that contains datetime information. If None, it is assumed that the datetime information is provided by the input dataframe’s index.

[4]:
enc = TrendFeatures(ds=None, name='growth', remainder='drop')
features = enc.fit_transform(data)
[5]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    ax.plot(features['growth'], label='growth')
    ax.legend(loc='upper left')
_images/Tutorial_Feature_Generation_7_0.png

Add date and time features

The subset argument corresponds to the names of the features to generate. If None, all features will be produced: ‘month’, ‘week’, ‘dayofyear’, ‘dayofweek’, ‘hour’, ‘hourofweek’. The last 2 features are generated only if the timestep of the input’s feature (or index if feature is None) is smaller than pandas.Timedelta(days=1).

[6]:
enc = DatetimeFeatures(ds=None, subset=None, remainder='drop')
features = enc.fit_transform(data)
features.columns
[6]:
Index(['month', 'week', 'dayofyear', 'dayofweek', 'hour', 'hourofweek'], dtype='object')
[7]:
enc = DatetimeFeatures(ds=None, remainder='drop', subset=['month', 'hourofweek'])
features = enc.fit_transform(data)
features.columns
[7]:
Index(['month', 'hourofweek'], dtype='object')

Encode cyclical (seasonal) features

The encoder is parameterized by period (number of days in one period) and fourier_order (number of Fourier components to use).

It can provide default values for period and fourier_order if seasonality is one of daily, weekly or yearly.

[8]:
daily_consumption = data[['consumption']].resample('D').sum()
[9]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    daily_consumption.plot(ax=ax, alpha=0.5)
_images/Tutorial_Feature_Generation_13_0.png

The number of seasonality features is always twice the fourier_order.

[10]:
enc = CyclicalFeatures(ds=None, seasonality='yearly', fourier_order=3, remainder='drop')
features = enc.fit_transform(daily_consumption)
features.columns
[10]:
Index(['yearly_delim_0', 'yearly_delim_1', 'yearly_delim_2', 'yearly_delim_3',
       'yearly_delim_4', 'yearly_delim_5'],
      dtype='object')

Now let’s plot the new features:

[11]:
with plt.style.context('seaborn-whitegrid'):
    fig, axs = plt.subplots(2*enc.fourier_order, figsize=(14, 7), dpi=96)

    for i, col in enumerate(features.columns):
        features[col].plot(ax=axs[i])
        axs[i].set_xlabel(None)

fig.tight_layout()
_images/Tutorial_Feature_Generation_17_0.png

Let’s also see how well this transformation works:

[12]:
regr = LinearRegression(fit_intercept=True).fit(features, daily_consumption)
pred = regr.predict(features)
[13]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    daily_consumption.plot(ax=ax, alpha=0.5)
    pd.Series(pred.squeeze(), index=daily_consumption.index).plot(ax=ax)
_images/Tutorial_Feature_Generation_20_0.png

Encoding categorical features

This section explains the way categorical encoding can be carried out using feature_encoders.

All encoders take pandas.DataFrames as input and generate numpy.ndarrays as output.

[1]:
import matplotlib.cm

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

from sklearn.model_selection import KFold
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from matplotlib.colors import LinearSegmentedColormap, ListedColormap

%matplotlib inline
[2]:
from feature_encoders.encode import (
    SafeOrdinalEncoder,
    SafeOneHotEncoder,
    TargetClusterEncoder,
    CategoricalEncoder
)

A plotting utility:

[3]:
def get_colors(cmap, N=None, use_index="auto"):
    if isinstance(cmap, str):
        if use_index == "auto":
            if cmap in ['Pastel1', 'Pastel2', 'Paired', 'Accent',
                        'Dark2', 'Set1', 'Set2', 'Set3',
                        'tab10', 'tab20', 'tab20b', 'tab20c']:
                use_index=True
            else:
                use_index=False
        cmap = matplotlib.cm.get_cmap(cmap)
    if not N:
        N = cmap.N
    if use_index=="auto":
        if cmap.N > 100:
            use_index=False
        elif isinstance(cmap, LinearSegmentedColormap):
            use_index=False
        elif isinstance(cmap, ListedColormap):
            use_index=True
    if use_index:
        ind = np.arange(int(N)) % cmap.N
        return cmap(ind)
    else:
        return cmap(np.linspace(0,1,N))

Load demo data

The demo data represents the energy consumption of a building.

[4]:
data = pd.read_csv('data/data.csv', parse_dates=[0], index_col=0)
data = data[~data['consumption_outlier']]
data.dtypes
[4]:
consumption            float64
holiday                 object
temperature            float64
consumption_outlier       bool
dtype: object

holiday is a categorical feature. The _novalue_ value corresponds to non-holiday observations.

[5]:
data['holiday'].value_counts()
[5]:
_novalue_                       35943
Immaculate Conception             192
Christmas Day                     192
St Stephen's Day                  192
New year                           96
Epiphany                           96
Easter Monday                      96
Liberation Day                     96
International Workers' Day         96
Republic Day                       96
Assumption of Mary to Heaven       96
All Saints Day                     96
Name: holiday, dtype: int64

SafeOrdinalEncoder

The SafeOrdinalEncoder converts categorical features into ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature. Unknown categories will be replaced using the most frequent value along each column.

It is implemented as a pipeline:

UNKNOWN_VALUE = -1

Pipeline(
    [
        (
            "select",
            sklearn.compose.ColumnTransformer(
                [("select", "passthrough", self.features_)], remainder="drop"
            ),
        ),
        (
            "encode_ordinal",
            sklearn.preprocessing.OrdinalEncoder(
                handle_unknown="use_encoded_value",
                unknown_value=self.unknown_value or UNKNOWN_VALUE,
                dtype=np.int16,
            ),
        ),
        (
            "impute_unknown",
            sklearn.impute.SimpleImputer(
                missing_values=self.unknown_value or UNKNOWN_VALUE,
                strategy="most_frequent",
            ),
        ),
    ]
)
[6]:
enc = SafeOrdinalEncoder(feature='holiday')

kf = KFold(n_splits=5, shuffle=False)

for train_index, _ in kf.split(data):
    enc = enc.fit(data.iloc[train_index])
    not_seen = np.setdiff1d(data["holiday"].unique(),
                            data.iloc[train_index]["holiday"].unique()
    )
    print(f'Holidays not seen during training {not_seen}')

    features = enc.transform(data[data['holiday'].isin(not_seen)])
    print(f'Holidays not seen during training are transformed as {np.unique(features)}')

    features = enc.transform(data[data['holiday'] == '_novalue_'])
    print(f'... and the most common value is also encoded as: {np.unique(features)}')
Holidays not seen during training ['Epiphany' 'New year']
Holidays not seen during training are transformed as [9]
... and the most common value is also encoded as: [9]
Holidays not seen during training ['Easter Monday' "International Workers' Day" 'Liberation Day']
Holidays not seen during training are transformed as [8]
... and the most common value is also encoded as: [8]
Holidays not seen during training ['Republic Day']
Holidays not seen during training are transformed as [10]
... and the most common value is also encoded as: [10]
Holidays not seen during training ['Assumption of Mary to Heaven']
Holidays not seen during training are transformed as [10]
... and the most common value is also encoded as: [10]
Holidays not seen during training ['All Saints Day']
Holidays not seen during training are transformed as [10]
... and the most common value is also encoded as: [10]
[7]:
features = SafeOrdinalEncoder(feature='holiday').fit_transform(data)
assert data['holiday'].nunique() == np.unique(features).size

By default, the SafeOrdinalEncoder considers as categorical features of type object, int, bool and category:

[8]:
enc = SafeOrdinalEncoder().fit(data)
enc.features_
[8]:
['holiday', 'consumption_outlier']

SafeOneHotEncoder

The SafeOneHotEncoder uses a SafeOrdinalEncoderto first safely encode the feature as an integer array and then a sklearn.preprocessing.OneHotEncoder to encode the features as an one-hot array:

UNKNOWN_VALUE = -1

Pipeline(
    [
        (
            "encode_ordinal",
            SafeOrdinalEncoder(
                feature=self.features_,
                unknown_value=self.unknown_value or UNKNOWN_VALUE,
            ),
        ),
        ("one_hot", sklearn.preprocessing.OneHotEncoder(drop=None, sparse=False)),
    ]
)
[9]:
enc = SafeOneHotEncoder(feature='holiday')

kf = KFold(n_splits=5, shuffle=False)

for train_index, _ in kf.split(data):
    enc = enc.fit(data.iloc[train_index])
    not_seen = np.setdiff1d(data["holiday"].unique(),
                            data.iloc[train_index]["holiday"].unique()
    )
    print(f'Holidays not seen during training {not_seen}')

    features = enc.transform(data[data['holiday'].isin(not_seen)])
    # check that it is a proper one-hot
    assert np.all(features.sum(axis=1) == 1)
    print('Holidays not seen during training have non-zero value at column: '
          f'{np.argmax(features == 1)}')

    features = enc.transform(data[data['holiday'] == '_novalue_'])
    # check that it is a proper one-hot
    assert np.all(features.sum(axis=1) == 1)
    print('... and the most common value also has non-zero value at column: '
          f'{np.argmax(features == 1)}')
Holidays not seen during training ['Epiphany' 'New year']
Holidays not seen during training have non-zero value at column: 9
... and the most common value also has non-zero value at column: 9
Holidays not seen during training ['Easter Monday' "International Workers' Day" 'Liberation Day']
Holidays not seen during training have non-zero value at column: 8
... and the most common value also has non-zero value at column: 8
Holidays not seen during training ['Republic Day']
Holidays not seen during training have non-zero value at column: 10
... and the most common value also has non-zero value at column: 10
Holidays not seen during training ['Assumption of Mary to Heaven']
Holidays not seen during training have non-zero value at column: 10
... and the most common value also has non-zero value at column: 10
Holidays not seen during training ['All Saints Day']
Holidays not seen during training have non-zero value at column: 10
... and the most common value also has non-zero value at column: 10

All encoders have a n_features_out_ property after fitting.

[10]:
enc = SafeOneHotEncoder(feature='holiday').fit(data)
assert data['holiday'].nunique() == enc.n_features_out_

TargetClusterEncoder

Next, let’s suppose that we want to lump together all holidays into only two (2) categories. Maybe, for instance, we want to fit a model that predicts energy consumption, but we only have data for one year, and hence not enough information to be confident about the impact of each individual holiday.

We can examine how the target (consumption) changes for each holiday value:

[11]:
to_group = data.loc[data['holiday'] != '_novalue_', ['consumption', 'holiday']]
grouped_mean = to_group.groupby('holiday').mean()

original_idx = grouped_mean.index
[12]:
grouped_mean.index = grouped_mean.index.map(lambda x: (x[:10] + '..') if len(x) > 10 else x)

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(16, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    grouped_mean.plot.bar(ax=ax, rot=0)
_images/Tutorial_Categorical_21_0.png

One approach could be to group holiday values together according to the different levels of the target:

[13]:
disc = KBinsDiscretizer(n_bins=2, encode='ordinal')
bins = disc.fit_transform(grouped_mean)
grouped_mean['bins'] = bins
[14]:
bin_values = [0, 1]
color_list = ['#74a9cf', '#fc8d59']
b2c = dict(zip(bin_values, color_list))

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    grouped_mean['consumption'].plot.bar(ax=ax, rot=0,
                                         color=[b2c[i] for i in grouped_mean['bins']])
_images/Tutorial_Categorical_24_0.png

We can plot the distribution of the consumption values for each category:

[15]:
mapping = pd.Series(data=grouped_mean['bins'].values, index=original_idx).to_dict()
data['bins'] = data['holiday'].map(lambda x: mapping.get(x))

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(16, 3.54), dpi=96)
    layout = (1, 2)
    ax0 = plt.subplot2grid(layout, (0, 0))
    ax1 = plt.subplot2grid(layout, (0, 1))

    subset = data[data['bins'] == 0]
    colors = get_colors('tab10', N=subset['holiday'].nunique())

    for i, (holiday, grouped) in enumerate(subset.groupby('holiday')):
        grouped['consumption'].plot.kde(ax=ax0, color=colors[i], bw_method=0.5)

    subset = data[data['bins'] == 1]
    colors = get_colors('tab10', N=subset['holiday'].nunique())

    for i, (holiday, grouped) in enumerate(subset.groupby('holiday')):
        grouped['consumption'].plot.kde(ax=ax1, color=colors[i], bw_method=0.5)
_images/Tutorial_Categorical_26_0.png

Going one step further, we could examine not only the mean of the target per holiday value but also other characteristics of its distribution. To take more aspects of the target’s distribution into account, the TargetClusterEncoder clusters the different values of a categorical feature according to the mean, standard deviation, skewness and the Wasserstein distance between the distribution of the corresponding target’s values and the distribution of all target’s values (used as reference).

[16]:
enc = TargetClusterEncoder(
        feature='holiday',
        max_n_categories=2,
        excluded_categories='_novalue_'
)

X = data[['holiday']]
y = data['consumption']
enc = enc.fit(X, y)

We can update the bins column based on the encoder’s mapping between values of holiday and clusters:

[17]:
grouped_mean['bins'] = original_idx.map(lambda x: enc.mapping_[x])

… and plot the new features again:

[18]:
bin_values = [0, 1]
color_list = ['#74a9cf', '#fc8d59']
b2c = dict(zip(bin_values, color_list))

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    grouped_mean['consumption'].plot.bar(
        ax=ax,
        rot=0,
        color=[b2c[i] for i in grouped_mean['bins']]
    )
_images/Tutorial_Categorical_32_0.png

Again, we can plot the target distributions for each category to see what was achieved:

[19]:
data['bins'] = data['holiday'].map(lambda x: enc.mapping_.get(x))

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(16, 3.54), dpi=96)
    layout = (1, 2)
    ax0 = plt.subplot2grid(layout, (0, 0))
    ax1 = plt.subplot2grid(layout, (0, 1))

    subset = data[data['bins'] == 0]
    colors = get_colors('tab10', N=subset['holiday'].nunique())

    for i, (holiday, grouped) in enumerate(subset.groupby('holiday')):
        grouped['consumption'].plot.kde(ax=ax0, color=colors[i], bw_method=0.5)

    subset = data[data['bins'] == 1]
    colors = get_colors('tab10', N=subset['holiday'].nunique())

    for i, (holiday, grouped) in enumerate(subset.groupby('holiday')):
        grouped['consumption'].plot.kde(ax=ax1, color=colors[i], bw_method=0.5)
_images/Tutorial_Categorical_34_0.png

Not only the two clusters seem more homogeneous with respect to the distributions they include, but we also managed to distinguish the holidays according to their consumption profiles:

[20]:
profiles = data.loc[
                data['holiday'] != '_novalue_', ['consumption', 'holiday', 'bins']
           ].copy()
profiles['date'] = profiles.index.date
profiles['time'] = profiles.index.time

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(12, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    profiles.pivot(index='time', columns='date', values='consumption').plot(
        ax=ax,
        alpha=0.8,
        legend=None,
        color=[b2c[i] for i in profiles['bins'].resample('D').first().dropna()]
    )
    ax.xaxis.set_major_locator(ticker.MultipleLocator(3600*2))
_images/Tutorial_Categorical_36_0.png
Conditional effect on target

One may be explicitly interested in clustering the holiday feature taking into account the hour-of-day feature: how similar are the target’s values in two distinct values of holiday given similar values for the hour-of-day?

In this case, the encoder first stratifies the categorical feature holiday into groups with similar values of hour-of-day, and then examines the relationship between the categorical feature’s values and the corresponding values of the target.

The stratification is carried out by a sklearn.tree.DecisionTreeRegressor model that first fits the stratify_by features (here hour-of-day) on the target values, and then uses the tree’s leaf nodes as groups. Only the mean of the target’s values per group is taken into account when deriving the clusters.

The parameter min_samples_leaf defines the minimum number of samples required to be at a leaf node of the decision tree model. Note that the actual number that will be passed to the tree model is min_samples_leaf multiplied by the number of unique values in the categorical feature to transform.

[21]:
enc = TargetClusterEncoder(
        feature='holiday',
        max_n_categories=2,
        excluded_categories='_novalue_',
        stratify_by='hour',
        min_samples_leaf=5
)
[22]:
data['hour'] = data.index.hour

X = data[['holiday', 'hour']]
y = data['consumption']
enc = enc.fit(X, y)

It is easy to understand the result of this operation if we consider that when the encoder groups holidays stratified by hours, it actually tries to group the daily profiles of the different holidays in the dataset. Since we already achieved this in the previous step, there shouldn’t be any change in the way holidays are grouped:

[23]:
grouped_mean['bins'] = original_idx.map(lambda x: enc.mapping_[x])

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 3.54), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    grouped_mean['consumption'].plot.bar(
        ax=ax,
        rot=0,
        color=[b2c[i] for i in grouped_mean['bins']]
    )
_images/Tutorial_Categorical_41_0.png

CategoricalEncoder

The CategoricalEncoder encodes a categorical feature by encaptulating all the aforementioned categorical encoders so far. If max_n_categories is not None and the number of unique values of the categorical feature is larger than the max_n_categories minus the excluded_categories, the TargetClusterEncoder will be called.

If encode_as = 'onehot', the result comes from a TargetClusterEncoder + SafeOneHotEncoder pipeline, otherwise from a TargetClusterEncoder + SafeOrdinalEncoder one:

n_categories = X[self.feature].nunique()
use_target = (self.max_n_categories is not None) and (
    n_categories - len(self.excluded_categories_) > self.max_n_categories
)

if not use_target:
    self.feature_pipeline_ = Pipeline(
        [
            (
                "encode_features",
                SafeOneHotEncoder(
                    feature=self.feature, unknown_value=self.unknown_value
                ),
            )
            if self.encode_as == "onehot"
            else (
                "encode_features",
                SafeOrdinalEncoder(
                    feature=self.feature, unknown_value=self.unknown_value
                ),
            )
        ]
    )
else:
    self.feature_pipeline_ = Pipeline(
        [
            (
                "reduce_dimension",
                TargetClusterEncoder(
                    feature=self.feature,
                    stratify_by=self.stratify_by,
                    max_n_categories=self.max_n_categories,
                    excluded_categories=self.excluded_categories,
                    unknown_value=self.unknown_value,
                    min_samples_leaf=self.min_samples_leaf,
                    max_features=self.max_features,
                    random_state=self.random_state,
                ),
            ),
            (
                "to_pandas",
                FunctionTransformer(self._to_pandas),
            ),
            (
                "encode_features",
                SafeOneHotEncoder(
                    feature=self.feature, unknown_value=self.unknown_value
                ),
            )
            if self.encode_as == "onehot"
            else (
                "encode_features",
                SafeOrdinalEncoder(
                    feature=self.feature, unknown_value=self.unknown_value
                ),
            ),
        ]
    )
[24]:
max_n_categories = data['holiday'].nunique() + 3
[25]:
enc = CategoricalEncoder(feature='holiday',
                         max_n_categories=max_n_categories,
                         encode_as='onehot')

features = enc.fit_transform(X, y)
features[:5]
[25]:
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])
[26]:
assert min(data['holiday'].nunique(), max_n_categories) == enc.n_features_out_
[27]:
enc = CategoricalEncoder(feature='holiday',
                         max_n_categories=max_n_categories,
                         encode_as='ordinal')

features = enc.fit_transform(X, y)
features[:5]
[27]:
array([[11],
       [11],
       [11],
       [11],
       [11]], dtype=int16)
[28]:
assert min(data['holiday'].nunique(), max_n_categories) == np.unique(features).size
[29]:
max_n_categories = data['holiday'].nunique() - 3
[30]:
enc = CategoricalEncoder(feature='holiday',
                         max_n_categories=max_n_categories,
                         encode_as='onehot')

features = enc.fit_transform(X, y)
features[:5]
[30]:
array([[0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0.]])
[31]:
assert min(data['holiday'].nunique(), max_n_categories) == enc.n_features_out_
[32]:
enc = CategoricalEncoder(feature='holiday',
                         max_n_categories=max_n_categories,
                         encode_as='ordinal')

features = enc.fit_transform(X, y)
features[:5]
[32]:
array([[5],
       [5],
       [5],
       [5],
       [5]], dtype=int16)
[33]:
assert min(data['holiday'].nunique(), max_n_categories) == np.unique(features).size

An application of the categorical encoder

Suppose we want to use the demo data to predict the energy consumption of the building. The simplest model to use is a model that includes only the hour of the week as a feature. The hour of the week is a categorical feature and it can be encoded in one-hot form:

[34]:
data['hourofweek'] = 24 * data.index.dayofweek + data.index.hour
dmatrix = CategoricalEncoder(feature='hourofweek', encode_as='onehot').fit_transform(data)

We can fit a linear model:

[35]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y)

… and evaluate it in-sample:

[36]:
pred = model.predict(dmatrix)
[37]:
print(f"In-sample CV(RMSE) (%): {100*mean_squared_error(y, pred, squared=False)/y.mean()}")
In-sample CV(RMSE) (%): 19.332298975680697

The degrees of freedom of the model are:

[38]:
np.linalg.matrix_rank(dmatrix)
[38]:
168

We can ask the CategoricalEncoder to lump together the 168 hour-of-week values into only 60, and repeat the process:

[39]:
X = data[['hourofweek']]
y = data['consumption']

dmatrix = CategoricalEncoder(
                feature='hourofweek',
                encode_as='onehot',
                max_n_categories=60
          ).fit_transform(X, y)

model = LinearRegression(fit_intercept=False).fit(dmatrix, y)
pred = model.predict(dmatrix)

print(f"In-sample CV(RMSE) (%): {100*mean_squared_error(y, pred, squared=False)/y.mean()}")
In-sample CV(RMSE) (%): 19.349420155191982

This is practically the same performance with one third of the degrees of freedom:

[40]:
np.linalg.matrix_rank(dmatrix)
[40]:
60

Encoding numerical features

This section explains the way numerical encoding can be carried out using feature_encoders.

The SplineEncoder takes pandas.DataFrames as input and generates numpy.ndarrays as output.

[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression

%matplotlib inline
[2]:
from feature_encoders.encode import SplineEncoder

We can create some synthetic data:

[3]:
def f(x):
    return 10 + (x * np.sin(x))
[4]:
x_support = np.linspace(0, 15, 100)
y_support = f(x_support)

x_train = np.sort(np.random.choice(x_support[15:-15], size=25, replace=False))
y_train = f(x_train)
[5]:
X_train = pd.DataFrame(data=x_train, columns=['x'])
X_support = pd.DataFrame(data=x_support, columns=['x'])
[6]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 3.5), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    ax.plot(X_support, y_support, label='ground truth', c='#fc8d59')
    ax.plot(X_train, y_train, 'o', label='training points', c='#fc8d59')
    ax.legend(loc='upper left')
_images/Tutorial_Numerical_7_0.png

Cubic spline without extrapolation:

[7]:
enc = SplineEncoder(feature='x', n_knots=5, degree=3, strategy='uniform',
                        extrapolation='constant', include_bias=True,)

model = make_pipeline(enc, LinearRegression(fit_intercept=False))
model.fit(X_train, y_train)
pred = model.predict(X_support)

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 3.5), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    ax.plot(X_support, y_support, label='ground truth', c='#fc8d59')
    ax.plot(X_train, y_train, 'o', label='training points', c='#fc8d59')
    ax.plot(X_support, pred, label='spline approximation')
    ax.legend(loc='upper left')
_images/Tutorial_Numerical_9_0.png

With linear extrapolation:

[8]:
enc = SplineEncoder(feature='x', n_knots=5, degree=3, strategy='uniform',
                        extrapolation='linear', include_bias=True,)

model = make_pipeline(enc, LinearRegression(fit_intercept=False))
model.fit(X_train, y_train)
pred = model.predict(X_support)

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 3.5), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    ax.plot(X_support, y_support, label='ground truth', c='#fc8d59')
    ax.plot(X_train, y_train, 'o', label='training points', c='#fc8d59')
    ax.plot(X_support, pred, label='spline approximation')
    ax.legend(loc='upper left')
_images/Tutorial_Numerical_11_0.png

An application of the spline encoder

The TOWT model for predicting the energy consumption of a building estimates the temperature effect separately for hours of the week with high and with low energy consumption in order to distinguish between occupied and unoccupied periods.

To this end, a flexible curve is fitted on the consumption~temperature relationship, and if more than the 65% of the data points that correspond to a specific hour-of-week are above the fitted curve, the corresponding hour is flagged as “Occupied”, otherwise it is flagged as “Unoccupied.”

We can apply this approach using feature_encoders functionality.

Load demo data
[9]:
data = pd.read_csv('data/data.csv', parse_dates=[0], index_col=0)
data = data[~data['consumption_outlier']]
[10]:
dmatrix = SplineEncoder(feature='temperature',
                        degree=1,
                        strategy='uniform'
          ).fit_transform(data)

model = LinearRegression(fit_intercept=False).fit(dmatrix, data['consumption'])
pred = pd.DataFrame(
            data=model.predict(dmatrix),
            index=data.index,
            columns=['consumption']
       )
[11]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(12, 3.5), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    ax.plot(data['temperature'], data['consumption'], 'o', alpha=0.02)
    ax.plot(data['temperature'], pred['consumption'])
_images/Tutorial_Numerical_16_0.png
[12]:
resid = data[['consumption']] - pred[['consumption']]
mask = resid > 0
mask['hourofweek'] = 24 * mask.index.dayofweek + mask.index.hour
occupied = mask.groupby('hourofweek')['consumption'].mean() > 0.65

data['hourofweek'] = 24 * data.index.dayofweek + data.index.hour
data['occupied'] = data['hourofweek'].map(lambda x: occupied[x])
[13]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(12, 3.5), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    ax.scatter(data.loc[data['occupied'], 'temperature'],
               data.loc[data['occupied'], 'consumption'],
               s=2, alpha=0.2, label='Probably occupied')

    ax.scatter(data.loc[~data['occupied'], 'temperature'],
               data.loc[~data['occupied'], 'consumption'],
               s=2, alpha=0.2, label='Probably not occupied')

    ax.legend(fancybox=True, frameon=True, loc='upper left')
_images/Tutorial_Numerical_18_0.png

Encoding interactions

Interactions are always pairwise and always between encoders (and not features).

The supported interactions are between: (a) categorical and categorical encoders, (b) categorical and linear encoders, (c) categorical and spline encoders, (d) linear and linear encoders, and (e) spline and spline encoders.

All encoders have a n_features_out_ property after fitting.

[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import make_friedman1
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split

%matplotlib inline
[2]:
from feature_encoders.utils import tensor_product, add_constant
from feature_encoders.generate import CyclicalFeatures, DatetimeFeatures
from feature_encoders.encode import (
    CategoricalEncoder,
    ICatEncoder,
    SplineEncoder,
    ISplineEncoder,
    IdentityEncoder,
    ProductEncoder,
    ICatLinearEncoder,
    ICatSplineEncoder
)

Load demo data

[3]:
data = pd.read_csv('data/data.csv', parse_dates=[0], index_col=0)
data = data[~data['consumption_outlier']]

Pairwise interactions between categorical features

ICatEncoder encodes the interaction between two categorical features. Both encoders should have the same encode_as parameter.

If encode_as = 'onehot', it returns the tensor product of the results of the two encoders. The tensor product combines row-per-row the results from the first and the second encoder as follows:

tensor product

A small example of the tensor product function:

[4]:
a = np.array([1, 10]).reshape(1, -1)
b = np.array([10, 20, 30]).reshape(1, -1)

tensor_product(a, b)
[4]:
array([[ 10,  20,  30, 100, 200, 300]])

The easiest way to demonstate it is by combining hours of day and days of week into hours of week:

[5]:
enc = DatetimeFeatures(subset=['dayofweek', 'hour', 'hourofweek'])
data = enc.fit_transform(data)
[6]:
enc_dow = CategoricalEncoder(feature='dayofweek', encode_as='onehot')
feature_dow = enc_dow.fit_transform(data)
feature_dow.shape
[6]:
(37287, 7)
[7]:
enc_hour = CategoricalEncoder(feature='hour', encode_as='onehot')
feature_hour = enc_hour.fit_transform(data)
feature_hour.shape
[7]:
(37287, 24)
[8]:
enc = ICatEncoder(enc_dow, enc_hour).fit(data)
enc.n_features_out_
[8]:
168
[9]:
assert np.all(enc.transform(data).argmax(axis=1) == data['hourofweek'].values)

If encode_as = 'ordinal', it returns the combinations of the encoders’ results, where each combination is a string with : between the two values:

[10]:
enc_dow = CategoricalEncoder(feature='dayofweek', encode_as='ordinal')
enc_hour = CategoricalEncoder(feature='hour', encode_as='ordinal')
enc = ICatEncoder(enc_dow, enc_hour)

feature_trf = enc.fit_transform(data)
feature_trf
[10]:
array([['0:12'],
       ['0:12'],
       ['0:12'],
       ...,
       ['5:21'],
       ['5:21'],
       ['5:22']], dtype='<U13')
[11]:
assert np.unique(feature_trf).size == 168

Pairwise interactions between numerical features

We can generate data for the “Friedman #1” regression problem:

\[y=10*sin(\pi*x_0*x_1) + 20*(x_2-0.5)^2 + 10*x_3+5 * x_4 + noise*N(0,1)\]
[12]:
X, y = make_friedman1(n_samples=5000, n_features=5, noise=0.2)
X = pd.DataFrame(data=X, columns=[f'x_{i}' for i in range(5)])
y = pd.Series(data=y, index=X.index)
X.head()
[12]:
x_0 x_1 x_2 x_3 x_4
0 0.491884 0.597237 0.017681 0.753236 0.068667
1 0.281404 0.524407 0.769966 0.689059 0.385223
2 0.329995 0.170124 0.208075 0.390547 0.795747
3 0.772299 0.509458 0.309719 0.172743 0.203104
4 0.838131 0.057251 0.461958 0.006787 0.961568
[13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=True)
[14]:
enc_0 = SplineEncoder(feature='x_0',
                      n_knots=5,
                      degree=3,
                      strategy="quantile",
                      extrapolation="constant",
                      include_bias=True,
)

enc_1 = SplineEncoder(feature='x_1',
                      n_knots=5,
                      degree=3,
                      strategy="quantile",
                      extrapolation="constant",
                      include_bias=True,
)

enc_2 = SplineEncoder(feature='x_2',
                      n_knots=5,
                      degree=3,
                      strategy="quantile",
                      extrapolation="constant",
                      include_bias=True,
)

enc_3 = SplineEncoder(feature='x_3',
                      n_knots=5,
                      degree=3,
                      strategy="quantile",
                      extrapolation="constant",
                      include_bias=True,
)

enc_4 = SplineEncoder(feature='x_4',
                      n_knots=5,
                      degree=3,
                      strategy="quantile",
                      extrapolation="constant",
                      include_bias=True,
)

interact = ISplineEncoder(enc_0, enc_1)
[15]:
pipeline = Pipeline([
    ('features', FeatureUnion([
                    ('inter', interact),
                    ('enc_2', enc_2),
                    ('enc_3', enc_3),
                    ('enc_4', enc_4)

                ])
    ),
    ('regression', LinearRegression(fit_intercept=False))
])

pipeline = pipeline.fit(X_train, y_train)

The root mean squared error is very close to the noise that was injected in the data (0.2):

[16]:
print('Root mean squared out-of-sample error: '
      f'{mean_squared_error(np.array(y_test), pipeline.predict(X_test), squared=False)}'
)
Root mean squared out-of-sample error: 0.1980257834415656

Linear interations are also supported through ProductEncoder. ProductEncoder expects IdentityEncoders, which are utility encoders that return what they are fed.

[17]:
enc_0 = IdentityEncoder(feature='x_0', include_bias=False,)
enc_1 = IdentityEncoder(feature='x_1', include_bias=False,)

interact = ProductEncoder(enc_0, enc_1)

This interaction is practically an element-wise multiplication of the two features:

[18]:
assert np.all(interact.fit_transform(X).squeeze() == X[['x_0', 'x_1']].prod(axis=1))

Pairwise interactions between categorical and numerical features

Suppose that we want to split the hours of the week in the demo data into two distinct categories (according to the similarities of the consumption data) and then model the impact of the outdoor temperature during each one of these categories: consumption ~ temperature:hour_of_week.

First, we can explore the case where we split all the data by the reduced hour_of_week and fit a consumption ~ temperature model to each group:

[19]:
enc_occ = CategoricalEncoder(
            feature='hourofweek',
            max_n_categories=2,
            stratify_by='temperature',
            min_samples_leaf=15,
            encode_as='ordinal'
)

X = data[['hourofweek', 'temperature']]
y = data['consumption']

data['groups'] = enc_occ.fit_transform(X, y)
[20]:
models = {}

for group, grouped_data in data.groupby('groups'):
    model = LinearRegression(fit_intercept=True).fit(grouped_data[['temperature']],
                                                     grouped_data['consumption'])
    models[group] = model
[21]:
color_list = ['#74a9cf', '#fc8d59']

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 4.5), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    resid = []
    for i, (group, grouped_data) in enumerate(data.groupby('groups')):
        pred = models[group].predict(grouped_data[['temperature']])
        resid.append(grouped_data['consumption'].values - pred)

        ax.plot(grouped_data['temperature'], pred,
                label=f'group: {group}', c=color_list[i])
        ax.plot(grouped_data['temperature'], grouped_data['consumption'],
                'o', c=color_list[i], alpha=0.01)

    ax.legend(loc='upper left')
_images/Tutorial_Interactions_32_0.png
[22]:
print(f'Mean squared error: {np.mean(np.concatenate(resid)**2)}')
Mean squared error: 351555.77145852696

The same result can be achieved by first encoding the hour_of_week feature in one-hot form and then taking the tensor product between its encoding and the temperature feature. In this case, an intercept must be added directly to the temperature feature, so that it is possible to model a different intercept for each categorical feature’s level:

[23]:
enc_occ = CategoricalEncoder(
            feature='hourofweek',
            max_n_categories=2,
            stratify_by='temperature',
            min_samples_leaf=15,
            encode_as='onehot'
)
feature_cat = enc_occ.fit_transform(X, y)

features = tensor_product(feature_cat, add_constant(X['temperature']))
[24]:
model = LinearRegression(fit_intercept=False).fit(features, y)
pred = model.predict(features)
[25]:
resid = y.values - pred
print(f'Mean squared error: {np.mean(resid**2)}')
Mean squared error: 351555.771458527
[26]:
color_list = ['#74a9cf', '#fc8d59']

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 4.5), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    for i in range(enc_occ.n_features_out_):
        mask = feature_cat[:, i]==1
        ax.plot(data['temperature'][mask], pred[mask], label=f'group: {i}',
                c=color_list[i])
        ax.plot(data['temperature'][mask], data['consumption'][mask], 'o',
                c=color_list[i], alpha=0.01)

    ax.legend(loc='upper left')
_images/Tutorial_Interactions_38_0.png

The conclusion here is that for the case of one categorical and one linear numerical feature, we can model the interaction by first encoding the categorical feature in one-hot form and then taking the tensor product between this encoding and the numerical feature.

This is supported by ICatLinearEncoder:

[27]:
enc_occ = CategoricalEncoder(
            feature='hourofweek',
            max_n_categories=2,
            stratify_by='temperature',
            min_samples_leaf=15,
            encode_as='onehot'
)
enc_num = IdentityEncoder(feature='temperature', include_bias=True)

enc = ICatLinearEncoder(encoder_cat=enc_occ, encoder_num=enc_num)
[28]:
features = enc.fit_transform(X, y)
model = LinearRegression(fit_intercept=False).fit(features, y)
pred = model.predict(features)

resid = y.values - pred
print(f'Mean squared error: {np.mean(resid**2)}')
Mean squared error: 351555.77145852696

Next, we want to encode the temperature feature with splines so that to capture potential non-linearities.

A *first split, then encode* strategy looks like this:

[29]:
models = {}
encoders = {}

for group, grouped_data in data.groupby('groups'):
    enc = SplineEncoder(feature='temperature',
                        n_knots=3,
                        degree=1,
                        strategy='uniform',
                        extrapolation='constant',
                        include_bias=True,
    )
    features = enc.fit_transform(grouped_data)
    model = LinearRegression(fit_intercept=False).fit(features, grouped_data['consumption'])
    models[group] = model
    encoders[group] = enc
[30]:
color_list = ['#74a9cf', '#fc8d59']

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(14, 4.5), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    resid = []
    for i, (group, grouped_data) in enumerate(data.groupby('groups')):
        features = encoders[group].transform(grouped_data)
        pred = models[group].predict(features)
        resid.append(grouped_data['consumption'].values - pred)

        ax.plot(grouped_data['temperature'], pred, '.', ms=1,
                label=f'group: {group}', c=color_list[i])
        ax.plot(grouped_data['temperature'], grouped_data['consumption'],
                'o', c=color_list[i], alpha=0.01)

    ax.legend(loc='upper left')
_images/Tutorial_Interactions_44_0.png
[31]:
print(f'Mean squared error: {np.mean(np.concatenate(resid)**2)}')
Mean squared error: 329833.6746798132

This first split, then encode strategy is implemented by ICatSplineEncoder. Note that:

  • If the categorical encoder is already fitted, it will not be re-fitted during fit or fit_transform.

  • The numerical encoder will always be fitted (one encoder per level of categorical feature).

Since we employ cardinality reduction, the categorical encoder should be fitted using all data.

[32]:
enc_occ = CategoricalEncoder(
            feature='hourofweek',
            max_n_categories=2,
            stratify_by='temperature',
            min_samples_leaf=15,
            encode_as='onehot'
        )

# Fit the categorical encoder at global level
enc_occ = enc_occ.fit(X, y)

enc_num = SplineEncoder(feature='temperature',
                        n_knots=3,
                        degree=1,
                        strategy='uniform',
                        extrapolation='constant',
                        include_bias=True,
          )
[33]:
enc = ICatSplineEncoder(encoder_cat=enc_occ, encoder_num=enc_num)
features = enc.fit_transform(X)
model = LinearRegression(fit_intercept=False).fit(features, y)
pred = model.predict(features)
[34]:
resid = y.values - pred
print(f'Mean squared error: {np.mean(resid**2)}')
Mean squared error: 329833.6746798132

Conditional seasonality

By combining a CategoricalEncoder with a CyclicalFeatures generator, we can create features of conditional seasonalities very similarly to how the Prophet library does it:

https://facebook.github.io/prophet/docs/seasonality,_holiday_effects,_and_regressors.html#seasonalities-that-depend-on-other-factors

[35]:
data = pd.DataFrame(index=pd.date_range(start='1/1/2018', end='31/12/2019', freq='D'))
data['weekday'] = data.index.dayofweek < 5

data.head()
[35]:
weekday
2018-01-01 True
2018-01-02 True
2018-01-03 True
2018-01-04 True
2018-01-05 True
[36]:
data = CyclicalFeatures(seasonality='yearly', fourier_order=3).fit_transform(data)
data.head()
[36]:
weekday yearly_delim_0 yearly_delim_1 yearly_delim_2 yearly_delim_3 yearly_delim_4 yearly_delim_5
2018-01-01 True 0.008601 0.999963 0.017202 0.999852 0.025801 0.999667
2018-01-02 True 0.025801 0.999667 0.051584 0.998669 0.077334 0.997005
2018-01-03 True 0.042993 0.999075 0.085906 0.996303 0.128661 0.991689
2018-01-04 True 0.060172 0.998188 0.120126 0.992759 0.179645 0.983732
2018-01-05 True 0.077334 0.997005 0.154204 0.988039 0.230151 0.973155
[37]:
enc_cat = CategoricalEncoder(feature='weekday', encode_as='onehot')
features_cat = enc_cat.fit_transform(data)
features_cat.shape
[37]:
(730, 2)
[38]:
enc_lin = IdentityEncoder(feature='yearly', as_filter=True)
features_cyc = enc_lin.fit_transform(data)
features_cyc.shape
[38]:
(730, 6)

As tensor product:

[39]:
features_tp = tensor_product(features_cat, features_cyc)
features_tp = pd.DataFrame(data=features_tp, index=data.index)
features_tp.shape
[39]:
(730, 12)
[40]:
with plt.style.context('seaborn-whitegrid'):
    fig, axs = plt.subplots(features_tp.shape[1], figsize=(14, 10), dpi=96)

    for i in range(features_tp.shape[1]):
        axs[i].plot(features_tp.loc[:, i])

fig.tight_layout()
_images/Tutorial_Interactions_57_0.png
[41]:
assert np.all(features_tp.loc[data['weekday'], [0, 1, 2, 3, 4, 5]] == 0)
assert np.all(features_tp.loc[~data['weekday'], [6, 7, 8, 9, 10, 11]] == 0)

The same thing can be achieved by:

[42]:
enc_cat = CategoricalEncoder(feature='weekday', encode_as='onehot')
enc_num = IdentityEncoder(feature='yearly', as_filter=True)

enc = ICatLinearEncoder(encoder_cat=enc_cat, encoder_num=enc_num)
features_enc = enc.fit_transform(data)
features_enc = pd.DataFrame(data=features_enc, index=data.index)
[43]:
assert np.all(features_tp == features_enc)

Note that for the case of cyclical data, a first split then encode and a first encode then split strategies are equivalent, because the encoding uses only the information of each row and not any other value from the same column.


The functionality for composing linear model features

feature-encoders includes a ModelStructure class for aggregating feature generators and encoders into main effect and pairwise interaction terms for linear regression models.

A ModelStructure instance can get information about features and encoders either from YAML files or through its API.

[1]:
import calendar
import json
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
[2]:
from feature_encoders.utils import load_config
from feature_encoders.compose import ModelStructure, FeatureComposer
from feature_encoders.generate import DatetimeFeatures
from feature_encoders.models import SeasonalPredictor

Reading information from YAML files

feature-encoders expects two YAML files:

Feature generator file

A file that provides a mapping between the name of a feature generator and the classes that should be used for the validation of its inputs and for its creation:

trend:
  validate: validate.TrendSchema
  generate: generate.TrendFeatures

datetime:
  validate: validate.DatetimeSchema
  generate: generate.DatetimeFeatures

cyclical:
  validate: validate.CyclicalSchema
  generate: generate.CyclicalFeatures

By default, ModelStructure searches in feature_encoders.config to find the validation and generation classes, but one can add packages by adding the fully qualified names of the corresponding classes.

Model configuration file

These files have three sections: (a) added features, (b) regressors and (c) interactions.

Added features

The information in this section is passed to one of the feature generators in feature_encoder.generate:

add_features:
  time: # the name of the generator
  ds: null
  type: datetime
  remainder: passthrough
  subset: month, hourofweek
Regressors

The information for each regressor includes its name, the name of the feature to use and encode so that to create this regressor, the type of the encoder (linear, spline or categorical), and the parameters to pass to the corresponding encoder class from feature_encoders.encode:

regressors:
  month:                 # the name of the regressor
    feature: month       # the name of the feature
    type: categorical
    max_n_categories: null
    encode_as: onehot

  tow:                   # the name of the regressor
    feature: hourofweek  # the name of the feature
    type: categorical
    max_n_categories: 60
    encode_as: onehot

  flex_temperature:
    feature: temperature
    type: spline
    n_knots: 5
    degree: 1
    strategy: uniform
    extrapolation: constant
    interaction_only: true  # if True, it will not be included in the main features
Interactions

Interactions can introduce new regressors, reuse regressors already defined in the regressors section, as well as alter the parameters of regressors that are already defined in the regressors section:

interactions:
  tow, flex_temperature:
    tow:
      max_n_categories: 2
      stratify_by: temperature
      min_samples_leaf: 15
Load configuration files
[3]:
model_conf, feature_conf = load_config(model='towt', features='default')
[4]:
print(json.dumps(model_conf, indent=4))
{
    "add_features": {
        "time": {
            "type": "datetime",
            "subset": "month, hourofweek"
        }
    },
    "regressors": {
        "month": {
            "feature": "month",
            "type": "categorical",
            "encode_as": "onehot"
        },
        "tow": {
            "feature": "hourofweek",
            "type": "categorical",
            "max_n_categories": 60,
            "encode_as": "onehot"
        },
        "lin_temperature": {
            "feature": "temperature",
            "type": "linear"
        },
        "flex_temperature": {
            "feature": "temperature",
            "type": "spline",
            "n_knots": 5,
            "degree": 1,
            "strategy": "uniform",
            "extrapolation": "constant",
            "include_bias": true,
            "interaction_only": true
        }
    },
    "interactions": {
        "tow, flex_temperature": {
            "tow": {
                "max_n_categories": 2,
                "stratify_by": "temperature",
                "min_samples_leaf": 15
            }
        }
    }
}
[5]:
print(json.dumps(feature_conf, indent=4))
{
    "trend": {
        "validate": "validate.TrendSchema",
        "generate": "generate.TrendFeatures"
    },
    "datetime": {
        "validate": "validate.DatetimeSchema",
        "generate": "generate.DatetimeFeatures"
    },
    "cyclical": {
        "validate": "validate.CyclicalSchema",
        "generate": "generate.CyclicalFeatures"
    }
}
Create ModelStructure
[6]:
model_structure = ModelStructure.from_config(model_conf, feature_conf)
[7]:
for key, val in model_structure.components.items():
    print(key, '-->', val.keys())
add_features --> dict_keys(['time'])
main_effects --> dict_keys(['month', 'tow', 'lin_temperature'])
interactions --> dict_keys([('tow', 'flex_temperature')])
Create FeatureComposer

Given the model structure, we can create and apply a FeatureComposer:

[8]:
composer = FeatureComposer(model_structure)
Load demo data
[9]:
data = pd.read_csv('data/data.csv', parse_dates=[0], index_col=0)
data = data[~data['consumption_outlier']]
Use the FeatureComposer
[10]:
X = data[['temperature']]
y = data['consumption']

composer = composer.fit(X, y)

The fit method of the composer calls two methods: _create_new_features and _create_encoders. The feature generators are applied in the same order that they were declared in the YAML configuration file.

[11]:
for item in composer.added_features_:
    print(item)
DatetimeFeatures(subset=['month', 'hourofweek'])
[12]:
for name, encoder in composer.encoders_['main_effects'].items():
    print('-->', name)
    print(encoder)
--> month
CategoricalEncoder(feature='month')
--> tow
CategoricalEncoder(feature='hourofweek', max_n_categories=60)
--> lin_temperature
IdentityEncoder(feature='temperature')
[13]:
for pair_name, encoder in composer.encoders_['interactions'].items():
    print('-->', pair_name)
    print(encoder)
--> ('tow', 'flex_temperature')
ICatSplineEncoder(encoder_cat=CategoricalEncoder(feature='hourofweek',
                                                 max_n_categories=2,
                                                 min_samples_leaf=15,
                                                 stratify_by=['temperature']),
                  encoder_num=SplineEncoder(degree=1, feature='temperature'))

After fitting, a composer has a component_names_ attribute:

[14]:
composer.component_names_
[14]:
['lin_temperature', 'month', 'tow', 'tow:flex_temperature']

It also has a component_matrix attribute that shows how the different columns of the design matrix correspond to the different components. This allows us to break down a model’s prediction into the additive contribution of each component.

[15]:
composer.component_matrix
[15]:
component lin_temperature month tow tow:flex_temperature
col
0 0 1 0 0
1 0 1 0 0
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
... ... ... ... ...
78 0 0 0 1
79 0 0 0 1
80 0 0 0 1
81 0 0 0 1
82 0 0 0 1

83 rows × 4 columns

The design matrix is constructed by transforming the data:

[16]:
design_matrix = composer.transform(X)
[17]:
assert design_matrix.shape[0] == X.shape[0]
assert design_matrix.shape[1] == composer.component_matrix.shape[0]
[18]:
n_features = 0

for encoder in composer.encoders_['main_effects'].values():
    n_features += encoder.n_features_out_

for encoder in composer.encoders_['interactions'].values():
    n_features += encoder.n_features_out_

assert design_matrix.shape[1] == n_features

Using the API

An example of using the ModelStructure API can be found in feature_encoders.models.SeasonalPredictor:

def _create_composer(self):
    model_structure = ModelStructure()

    if self.add_trend:
        model_structure = model_structure.add_new_feature(
            name="added_trend",
            fgen_type=TrendFeatures(
                ds=self.ds,
                name="growth",
                remainder="passthrough",
                replace=False,
            ),
        )
        model_structure = model_structure.add_main_effect(
            name="trend",
            enc_type=IdentityEncoder(
                feature="growth",
                as_filter=False,
                include_bias=False,
            ),
        )

    for seasonality, props in self.seasonalities_.items():
        condition_name = props["condition_name"]

        model_structure = model_structure.add_new_feature(
            name=seasonality,
            fgen_type=CyclicalFeatures(
                seasonality=seasonality,
                ds=self.ds,
                period=props.get("period"),
                fourier_order=props.get("fourier_order"),
                remainder="passthrough",
                replace=False,
            ),
        )

        if condition_name is None:
            model_structure = model_structure.add_main_effect(
                name=seasonality,
                enc_type=IdentityEncoder(
                    feature=seasonality,
                    as_filter=True,
                    include_bias=False,
                ),
            )
        else:
            model_structure = model_structure.add_interaction(
                lenc_name=condition_name,
                renc_name=seasonality,
                lenc_type=CategoricalEncoder(
                    feature=condition_name, encode_as="onehot"
                ),
                renc_type=IdentityEncoder(
                    feature=seasonality, as_filter=True, include_bias=False
                ),
            )
    return FeatureComposer(model_structure)
[19]:
model = SeasonalPredictor(
        ds=None,
        add_trend=True,
        yearly_seasonality="auto",
        weekly_seasonality=False,
        daily_seasonality=False,
)

We can add a different daily seasonality per day of week:

[20]:
X = DatetimeFeatures(subset='dayofweek').fit_transform(X)
X['dayofweek'] = X['dayofweek'].map(lambda x: calendar.day_abbr[x])
X = X.merge(pd.get_dummies(X['dayofweek']),
            left_index=True,
            right_index=True).drop('dayofweek', axis=1
)
[21]:
for i in range(7):
    day = calendar.day_abbr[i]
    model.add_seasonality(
        f"daily_on_{day}", period=1, fourier_order=4, condition_name=day
    )
[22]:
model = model.fit(X, y)
[23]:
for item in model.composer_.added_features_:
    print(item)
TrendFeatures()
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Mon')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Tue')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Wed')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Thu')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Fri')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Sat')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Sun')
CyclicalFeatures(fourier_order=6, period=365.25, seasonality='yearly')
[24]:
for name, encoder in model.composer_.encoders_['main_effects'].items():
    print('-->', name)
    print(encoder)
--> trend
IdentityEncoder(feature='growth')
--> yearly
IdentityEncoder(as_filter=True, feature='yearly')
[25]:
for pair_name, encoder in model.composer_.encoders_['interactions'].items():
    print('-->', pair_name)
    print(encoder)
--> ('Mon', 'daily_on_Mon')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Mon'),
                  encoder_num=IdentityEncoder(as_filter=True,
                                              feature='daily_on_Mon'))
--> ('Tue', 'daily_on_Tue')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Tue'),
                  encoder_num=IdentityEncoder(as_filter=True,
                                              feature='daily_on_Tue'))
--> ('Wed', 'daily_on_Wed')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Wed'),
                  encoder_num=IdentityEncoder(as_filter=True,
                                              feature='daily_on_Wed'))
--> ('Thu', 'daily_on_Thu')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Thu'),
                  encoder_num=IdentityEncoder(as_filter=True,
                                              feature='daily_on_Thu'))
--> ('Fri', 'daily_on_Fri')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Fri'),
                  encoder_num=IdentityEncoder(as_filter=True,
                                              feature='daily_on_Fri'))
--> ('Sat', 'daily_on_Sat')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Sat'),
                  encoder_num=IdentityEncoder(as_filter=True,
                                              feature='daily_on_Sat'))
--> ('Sun', 'daily_on_Sun')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Sun'),
                  encoder_num=IdentityEncoder(as_filter=True,
                                              feature='daily_on_Sun'))
[26]:
prediction = model.predict(X)
[27]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(12, 3), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    prediction['consumption'][:1344].plot(ax=ax, alpha=0.8) #2 weeks data
    y[:1344].plot(ax=ax, alpha=0.5)
_images/Tutorial_Feature_Composition_40_0.png
Consistency checks
[28]:
design_matrix = model.composer_.transform(X)
[29]:
for i in range(7):
    day = calendar.day_abbr[i]

    subset_index = model.composer_.component_matrix[
                    model.composer_.component_matrix[f'{day}:daily_on_{day}'] == 1
                   ].index
    subset = pd.DataFrame(design_matrix[:, subset_index], index=X.index)

    features_on = subset.columns[(subset.loc[X[X[day]==1].index] == 0).all()]
    features_off = subset.columns[(subset.loc[X[X[day]==0].index] == 0).all()]

    assert features_on.intersection(features_off).empty

The model works even if we replace:

model_structure = model_structure.add_interaction(
    lenc_name=condition_name,
    renc_name=seasonality,
    lenc_type=CategoricalEncoder(
        feature=condition_name, encode_as="onehot"
    ),
    renc_type=IdentityEncoder(
        feature=seasonality, as_filter=True, include_bias=False
    ),
)

with

model_structure = model_structure.add_interaction(
    lenc_name=condition_name,
    renc_name=seasonality,
    lenc_type="categorical",
    right_enc_type="linear",
    left_feature=condition_name,
    renc_type=seasonality,
    **{
        condition_name: {"encode_as": "onehot"},
        seasonality: {"as_filter": True, "include_bias": False},
    },
)

because the FeatureComposer maps “categorical” to CategoricalEncoder, “linear” to IdentityEncoder and “spline” to SplineEncoder, and passes all additional keyword arguments to the corresponding initializers.


Applications of feature-encoders

In this section, we present two applications:

  • one for a simple linear regression model, and

  • one for a grouped linear regression model (a model that construct one estimator per data group, splits data by values of a single column and fits one estimator per group).

[1]:
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression

%matplotlib inline
[2]:
from feature_encoders.utils import load_config
from feature_encoders.generate import DatetimeFeatures
from feature_encoders.encode import CategoricalEncoder, SplineEncoder, ICatSplineEncoder
from feature_encoders.compose import ModelStructure
from feature_encoders.models import LinearPredictor, GroupedPredictor
[3]:
def cvrmse(y_true, y_pred):
    resid = y_true - y_pred
    return float(np.sqrt((resid ** 2).sum() / len(resid)) / np.mean(y_true))


def nmbe(y_true, y_pred):
    resid = y_true - y_pred
    return float(np.mean(resid) / np.mean(y_true))

Load demo data

The data consists of the energy consumption of a building and the outdoor air temperature.

[4]:
data = pd.read_csv('data/data.csv', parse_dates=[0], index_col=0)
data = data[~data['consumption_outlier']]

X = data[['temperature']]
y = data['consumption']

Linear regression model

The simplest model to use is a model that includes only the hour of the week as a feature. The hour of the week is a categorical feature and it can be encoded in one-hot form:

[5]:
features = DatetimeFeatures(subset='hourofweek', remainder='drop').fit_transform(X)
dmatrix = CategoricalEncoder(feature='hourofweek', encode_as='onehot').fit_transform(features)

We can fit a linear model:

[6]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y)

… and evaluate it in-sample:

[7]:
pred = model.predict(dmatrix)
[8]:
y_true = y.values

print(f"In-sample CV(RMSE) (%): {cvrmse(y_true, pred)*100}")
print(f"In-sample NMBE (%): {nmbe(y_true, pred)*100}")
In-sample CV(RMSE) (%): 19.332298975680697
In-sample NMBE (%): 7.804127359819421e-15

The degrees of freedom of the model are:

[9]:
np.linalg.matrix_rank(dmatrix)
[9]:
168

The impact of the hour of the week on energy consumption is then:

[10]:
pred = pd.DataFrame(data=pred, index=y.index, columns=['hourofweek_impact'])

date_enc = DatetimeFeatures(remainder='passthrough', subset='hourofweek')
to_plot = date_enc.fit_transform(pred).groupby('hourofweek').mean()


colors = ['#8c510a', '#d8b365', '#f6e8c3', '#f5f5f5', '#c7eae5', '#5ab4ac', '#01665e']

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(12, 3), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    intervals = np.split(to_plot.index, 7)
    for i, item in enumerate(intervals):
        ax.axvspan(item[0], item[-1], alpha=0.3, color=colors[i])

    to_plot.plot(ax=ax)
    ax.set_xlabel('Hour of week')
    ax.legend(['Average contribution of hour-of-week feature'], fancybox=True, frameon=True)
_images/Tutorial_Applications_16_0.png

We can reduce the number of categories for the hour-of-week feature while retaining as much as possible the feature’s predictive capability:

[11]:
features = DatetimeFeatures(subset='hourofweek', remainder='drop').fit_transform(X)
enc = CategoricalEncoder(feature='hourofweek', encode_as='onehot', max_n_categories=60)
dmatrix = enc.fit_transform(features, y)
[12]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y)
pred = model.predict(dmatrix)

print(f"In-sample CV(RMSE) (%): {cvrmse(y_true, pred)*100}")
print(f"In-sample NMBE (%): {nmbe(y_true, pred)*100}")
In-sample CV(RMSE) (%): 19.34946719265366
In-sample NMBE (%): 4.263718362437927e-14

This is practically the same performance with one third of the degrees of freedom:

[13]:
np.linalg.matrix_rank(dmatrix)
[13]:
60

Another component to include in the model is an interaction term between the hour of the week and the temperature.

The TOWT model estimates the temperature effect separately for periods of the day with high and with low energy consumption in order to distinguish between occupied and unoccupied building periods.

To this end, a flexible curve is fitted on the consumption~temperature relationship, and if more than the 65% of the data points that correspond to a specific time-of-week are above the fitted curve, the corresponding hour is flagged as “Occupied”, otherwise it is flagged as “Unoccupied.”

We can apply this approach using feature-encoders functionality:

[14]:
enc = SplineEncoder(feature='temperature', degree=1, strategy='uniform').fit(X)
dmatrix = enc.transform(X)
[15]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y)
pred = model.predict(dmatrix)
pred = pd.Series(data=pred, index=y.index)
[16]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(12, 3), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    ax.scatter(X['temperature'], y, s=1, alpha=0.2)

    X_sorted = X.sort_values(by='temperature')
    ax.plot(X_sorted, pred.loc[X_sorted.index], c='#cc4c02')
_images/Tutorial_Applications_25_0.png
[17]:
resid = y - pred
mask = resid > 0
mask = DatetimeFeatures(subset='hourofweek').fit_transform(mask.to_frame('freq'))
occupied = mask.groupby('hourofweek')['freq'].mean() > 0.65
occupied = occupied.to_dict()
[18]:
features = DatetimeFeatures(subset='hourofweek').fit_transform(X)
features['occupied'] = features['hourofweek'].map(lambda x: occupied[x])
features.head()
[18]:
temperature hourofweek occupied
timestamp
2015-12-07 12:00:00 14.300 12 True
2015-12-07 12:15:00 14.525 12 True
2015-12-07 12:30:00 14.750 12 True
2015-12-07 12:45:00 14.975 12 True
2015-12-07 13:00:00 15.200 13 True
[19]:
enc_temp = SplineEncoder(feature='temperature', degree=1, strategy='uniform')
enc_occ = CategoricalEncoder(feature='occupied', encode_as='onehot')
enc_occ = enc_occ.fit(features) #fit before passed to the interaction

enc = ICatSplineEncoder(encoder_cat=enc_occ, encoder_num=enc_temp)
dmatrix = enc.fit_transform(features)
[20]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y)
pred = model.predict(dmatrix)

print(f"In-sample CV(RMSE) (%): {cvrmse(y_true, pred)*100}")
print(f"In-sample NMBE (%): {nmbe(y_true, pred)*100}")
In-sample CV(RMSE) (%): 18.2731614241806
In-sample NMBE (%): 2.152083291755081e-14
[21]:
np.linalg.matrix_rank(dmatrix)
[21]:
10

Alternatively, we can rely on feature-encoders functionality to categorize the hours of the week into the two most dissimilar categories in terms of energy consumption given temperature information:

[22]:
features = DatetimeFeatures(subset='hourofweek').fit_transform(X)

enc_temp = SplineEncoder(feature='temperature',
                         degree=1,
                         strategy='uniform'
)
enc_occ = CategoricalEncoder(feature='hourofweek',
                             max_n_categories=2,
                             stratify_by='temperature',
                             min_samples_leaf=15
)
enc_occ = enc_occ.fit(features, y) #fit before passed to the interaction

enc = ICatSplineEncoder(encoder_cat=enc_occ, encoder_num=enc_temp)
dmatrix = enc.fit_transform(features)
[23]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y)
pred = model.predict(dmatrix)

print(f"In-sample CV(RMSE) (%): {cvrmse(y_true, pred)*100}")
print(f"In-sample NMBE (%): {nmbe(y_true, pred)*100}")
In-sample CV(RMSE) (%): 17.4373587893191
In-sample NMBE (%): 8.789160508284433e-14

The prediction results are better while the number of the degrees of freedom is the same:

[24]:
np.linalg.matrix_rank(dmatrix)
[24]:
10

Then, the consumption~temperature curves per category of hour of week are:

[25]:
date_enc = DatetimeFeatures(remainder='passthrough', subset='hourofweek')

intervals = pd.concat(
    ( pd.cut(X['temperature'], 15, precision=0),
      pd.DataFrame(data=pred, index=X.index, columns=['temperature_impact'])
    ),
    axis=1
)

enc_cat = enc_occ.feature_pipeline_['reduce_dimension']
intervals = date_enc.fit_transform(intervals)
intervals['hourofweek'] = intervals['hourofweek'].map(lambda x: enc_cat.mapping_[x])

to_plot = (
    intervals.groupby(['hourofweek', 'temperature'])['temperature_impact']
             .mean()
             .unstack()
)

colors = ['#8c510a', '#df65b0']

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(12, 3), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    for i, (idx, values) in enumerate(to_plot.iterrows()):
        values.plot(ax=ax, lw=2, alpha=0.6, label=f'category {idx}', color=colors[i])

    ax.xaxis.set_major_locator(plt.MaxNLocator(10))
    ax.set_xlabel('Temperature intervals')
    ax.legend(fancybox=True, frameon=True)
_images/Tutorial_Applications_37_0.png

In confing/models, there is a YAML file (towt.yaml) that defines a linear regression model with the two components above and two additional ones:

  1. A categorical feature for the different months in the dataset.

  2. A linear term for the temperature as a main effect. The interaction term between temperature and the hour of the week “corrects” the predictions of the temperature’s linear term in the main effects.

[26]:
model_conf, feature_conf = load_config(model='towt', features='default')
[27]:
print(json.dumps(model_conf, indent=4))
{
    "add_features": {
        "time": {
            "type": "datetime",
            "subset": "month, hourofweek"
        }
    },
    "regressors": {
        "month": {
            "feature": "month",
            "type": "categorical",
            "encode_as": "onehot"
        },
        "tow": {
            "feature": "hourofweek",
            "type": "categorical",
            "max_n_categories": 60,
            "encode_as": "onehot"
        },
        "lin_temperature": {
            "feature": "temperature",
            "type": "linear"
        },
        "flex_temperature": {
            "feature": "temperature",
            "type": "spline",
            "n_knots": 5,
            "degree": 1,
            "strategy": "uniform",
            "extrapolation": "constant",
            "include_bias": true,
            "interaction_only": true
        }
    },
    "interactions": {
        "tow, flex_temperature": {
            "tow": {
                "max_n_categories": 2,
                "stratify_by": "temperature",
                "min_samples_leaf": 15
            }
        }
    }
}

The feature_encoders.models.LinearPredictor is a linear/ridge regression model that can be created using model configurations such as the above.

[28]:
model_structure = ModelStructure.from_config(model_conf, feature_conf)
model = LinearPredictor(model_structure=model_structure)

Fit with available data:

[29]:
%%time
model = model.fit(X, y)
Wall time: 2.4 s

Evaluate the model in-sample:

[30]:
%%time
pred = model.predict(X)

print(f"In-sample CV(RMSE) (%): {cvrmse(y, pred['consumption'])*100}")
print(f"In-sample NMBE (%): {nmbe(y, pred['consumption'])*100}")
In-sample CV(RMSE) (%): 15.718435024677973
In-sample NMBE (%): 6.953335826993842e-05
Wall time: 290 ms

The effective number of parameters (i.e. the degrees of freedom) is:

[31]:
model.dof
[31]:
79

This is how the design matrix of the regression corresponds to each regressor:

[32]:
model.composer_.component_matrix
[32]:
component lin_temperature month tow tow:flex_temperature
col
0 0 1 0 0
1 0 1 0 0
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
... ... ... ... ...
78 0 0 0 1
79 0 0 0 1
80 0 0 0 1
81 0 0 0 1
82 0 0 0 1

83 rows × 4 columns

This makes it easy to decompose the prediction into components (the regularization term alpha=0.01 in the LinearPredictor was used primarily so that the individual components have reasonable values):

[33]:
%%time
pred = model.predict(X, include_components=True)
pred.head()
Wall time: 337 ms
[33]:
consumption lin_temperature month tow tow:flex_temperature
timestamp
2015-12-07 12:00:00 4087.304937 1569.095673 593.424166 402.220121 1522.564976
2015-12-07 12:15:00 4084.172916 1593.784242 593.424166 402.220121 1494.744387
2015-12-07 12:30:00 4081.040895 1618.472810 593.424166 402.220121 1466.923798
2015-12-07 12:45:00 4077.908875 1643.161378 593.424166 402.220121 1439.103209
2015-12-07 13:00:00 4144.719000 1667.849947 593.424166 472.162267 1411.282620
[34]:
assert np.allclose(pred['consumption'],
            pred[[col for col in pred.columns if col != 'consumption']].sum(axis=1)
)
[35]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(12, 3), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    pred['consumption'][:1344].plot(ax=ax, alpha=0.8) #2 weeks data
    y[:1344].plot(ax=ax, alpha=0.5)
_images/Tutorial_Applications_54_0.png

Grouped linear regression model

The feature_encoders.models.GroupedPredictor can be applied on different clusters of a dataset. For this example, we assume that the clusters are created by a KMeans approach that is applied on daily consumption profiles, but there are smarter methods to distinguish between consumption profiles while ensuring that they are reliable during prediction (when no consumption data is available - see for instance how the eensight tool for automated M&V approaches this problem).

Since each of the models in the ensemble predicts on a different subset of the input data (an observation cannot belong to more than one clusters), the final prediction is generated by vertically concatenating all the individual models’ predictions.

495fc16eb0ca4a688417530144e89e21

[36]:
data['time'] = data.index.time
data['date'] = data.index.date

to_cluster = data.pivot(index='date', columns='time', values='consumption')
to_cluster = to_cluster.fillna(method='bfill').fillna(method='ffill')
[37]:
kmeans = KMeans(n_clusters=3).fit(to_cluster.values)
groups = pd.Series(data=kmeans.labels_, index=to_cluster.index)
data['group'] = data['date'].map(lambda x: str(groups[x]))
[38]:
colors = ['#8c510a', '#3690c0', '#dd3497']

with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(12, 3), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    for i, (_, grouped) in enumerate(data.groupby('group')):
        grouped.pivot(index='time', columns='date', values='consumption').plot(
            ax=ax, legend=False, alpha=0.05, color=colors[i])
_images/Tutorial_Applications_58_0.png
[39]:
X = data[['temperature', 'group']]
y = data['consumption']
[40]:
model = GroupedPredictor(
            group_feature='group',
            model_conf=model_conf,
            feature_conf=feature_conf
)
[41]:
%%time
model = model.fit(X, y)
Wall time: 2.83 s

The GroupedPredictor applies the feature generation transformers defined in model_conf directly on the dataset before it is split per cluster:

[42]:
model.added_features_
[42]:
[DatetimeFeatures(subset=['month', 'hourofweek'])]

… whereas the cluster predictors do not see or apply any feature generator:

[43]:
for group, est in model.estimators_.items():
    print(group, '-->', est.composer_.added_features_)
0 --> []
1 --> []
2 --> []

In addition, GroupedPredictor fits all categorical encoders in ordinal form, and then passes the encoded data to each cluster predictor:

[44]:
for name, encoder in model.encoders_['main_effects'].items():
    print(name, '-->', encoder)
month --> CategoricalEncoder(encode_as='ordinal', feature='month')
tow --> CategoricalEncoder(encode_as='ordinal', feature='hourofweek',
                   max_n_categories=60)

… and adds the cluster feature in every stratify_by that is not empty:

[45]:
for pair_name, encoder in model.encoders_['interactions'].items():
    print(pair_name, '-->', encoder)
('tow', 'flex_temperature') --> {'tow': CategoricalEncoder(encode_as='ordinal', feature='hourofweek',
                   max_n_categories=2, min_samples_leaf=15,
                   stratify_by=['group', 'temperature'])}

The cluster predictors get encoders that operate on data that has been transformed by the categorical encoders of the GroupedPredictor. In this way, categorical data is always encoded with full information (while numerical data is encoded at the cluster level):

[46]:
for group, est in model.estimators_.items():
    print('group ', group)
    for name, encoder in est.composer_.encoders_['main_effects'].items():
        print(name, '-->', encoder)
    print('\n')
group  0
month --> CategoricalEncoder(feature='month__for__month')
tow --> CategoricalEncoder(feature='hourofweek__for__tow')
lin_temperature --> IdentityEncoder(feature='temperature')


group  1
month --> CategoricalEncoder(feature='month__for__month')
tow --> CategoricalEncoder(feature='hourofweek__for__tow')
lin_temperature --> IdentityEncoder(feature='temperature')


group  2
month --> CategoricalEncoder(feature='month__for__month')
tow --> CategoricalEncoder(feature='hourofweek__for__tow')
lin_temperature --> IdentityEncoder(feature='temperature')


[47]:
for group, est in model.estimators_.items():
    print('group ', group)
    for name, encoder in est.composer_.encoders_['interactions'].items():
        print(name, '-->', encoder)
    print('\n')
group  0
('tow', 'flex_temperature') --> ICatSplineEncoder(encoder_cat=CategoricalEncoder(feature='hourofweek__for__tow:flex_temperature',
                                                 min_samples_leaf=15),
                  encoder_num=SplineEncoder(degree=1, feature='temperature'))


group  1
('tow', 'flex_temperature') --> ICatSplineEncoder(encoder_cat=CategoricalEncoder(feature='hourofweek__for__tow:flex_temperature',
                                                 min_samples_leaf=15),
                  encoder_num=SplineEncoder(degree=1, feature='temperature'))


group  2
('tow', 'flex_temperature') --> ICatSplineEncoder(encoder_cat=CategoricalEncoder(feature='hourofweek__for__tow:flex_temperature',
                                                 min_samples_leaf=15),
                  encoder_num=SplineEncoder(degree=1, feature='temperature'))


[48]:
%%time
pred = model.predict(X)

print(f"In-sample CV(RMSE) (%): {cvrmse(y, pred['consumption'])*100}")
print(f"In-sample NMBE (%): {nmbe(y, pred['consumption'])*100}")
In-sample CV(RMSE) (%): 13.537423014233884
In-sample NMBE (%): 0.0001238626531768618
Wall time: 567 ms

The number of parameters is:

[49]:
model.n_parameters
[49]:
242

the degrees of freedom:

[50]:
model.dof
[50]:
230

Since we have fitted one LinearPredictor per cluster, it is still easy to decompose the prediction into components:

[51]:
%%time
pred = model.predict(X, include_components=True)
pred.head()
Wall time: 650 ms
[51]:
consumption lin_temperature month tow tow:flex_temperature
timestamp
2015-12-07 12:00:00 4323.340119 1439.591296 669.917419 535.061043 1678.770360
2015-12-07 12:15:00 4315.882619 1462.242208 669.917419 535.061043 1648.661948
2015-12-07 12:30:00 4308.425119 1484.893120 669.917419 535.061043 1618.553536
2015-12-07 12:45:00 4305.942539 1507.544032 669.917419 535.061043 1593.420044
2015-12-07 13:00:00 4398.279232 1530.194944 669.917419 629.880316 1568.286553
[52]:
assert np.allclose(pred['consumption'],
            pred[[col for col in pred.columns if col != 'consumption']].sum(axis=1)
)
[53]:
with plt.style.context('seaborn-whitegrid'):
    fig = plt.figure(figsize=(12, 3), dpi=96)
    layout = (1, 1)
    ax = plt.subplot2grid(layout, (0, 0))

    pred['consumption'][:1344].plot(ax=ax, alpha=0.8) #2 weeks data
    y[:1344].plot(ax=ax, alpha=0.5)
_images/Tutorial_Applications_81_0.png

Getting Help

First, please check issues on Github to see if your question has already been answered there. If no solution is available there feel free to open a new issue and the authors will attempt to respond in a reasonably timely fashion.

Feature Encoders

Functionality

feature-encoders is a library for encoding categorical and numerical features to create features for linear regression models. In particular, it includes functionality for:

  1. Applying custom feature generators to a dataset. Users can add a feature generator to the existing ones by declaring a class for the validation of their inputs and a class for their creation.

  2. Encoding categorical and numerical features. The categorical encoder provides the option to reduce the cardinality of a categorical feature by lumping together categories for which the corresponding distibution of the target values is similar.

  3. Encoding interactions. Interactions are always pairwise and always between encoders (and not features). The supported interactions are between: (a) categorical and categorical encoders, (b) categorical and linear encoders, (c) categorical and spline encoders, (d) linear and linear encoders, and (e) spline and spline encoders.

  4. Composing features for linear regression. feature-encoders includes a ModelStructure class for aggregating feature generators and encoders into main effect and pairwise interaction terms for linear regression models. A ModelStructure instance can get information about additional features and encoders either from YAML files or through its API.

How to use feature-encoders

Please see our API documentation for a complete list of available functions and see our informative tutorials for more comprehensive example use cases.

Python Version

feature-encoders supports Python 3.7+

License

Copyright 2021 Hebes Intelligence. Released under the terms of the Apache License, Version 2.0.


https://github.com/hebes-io/feature-encoders/raw/main/EC_support.png

Indices and tables