Feature Encoders documentation
Installation
Using pip
python -m pip install feature-encoders
From source
To install feature_encoders from source, first clone the source repository:
git clone https://github.com/hebes-io/feature-encoders.git
cd feature-encoders
Next, you can also install all dependencies using the requirements.txt file in the root of this repository:
python -m pip install -r requirements.txt
Once the dependencies are installed (stay inside of the feature-encoders
directory), execute:
python -m pip install .
feature_encoders package
Subpackages
feature_encoders.compose package
Module contents
- class feature_encoders.compose._compose.FeatureComposer(model_structure: feature_encoders.compose._compose.ModelStructure)[source]
Generate linear features and pairwise interactions.
- Parameters
model_structure (ModelStructure) – The structure of a linear regression model.
- property component_matrix
Dataframe indicating which columns of the feature matrix correspond to which components.
- Returns
feature_cols – in that component.
- Return type
A binary indicator dataframe. Entry is 1 if that column is used
- class feature_encoders.compose._compose.ModelStructure(structure: Optional[Dict] = None, feature_map: Optional[Dict] = None)[source]
Capture the structure of a linear regression model.
The class validates and stores the details of a linear regression model: features, main effects and interactions.
- Parameters
structure (Dict, optional) –
A dictionary that includes information about the model. Example:
{'add_features': {'time': { 'ds': None, 'remainder': 'passthrough', 'replace': False, 'subset': ['month', 'hourofweek'] } }, 'main_effects': {'month': { 'feature': 'month', 'max_n_categories': None, 'encode_as': 'onehot', 'interaction_only': False }, 'tow': { 'feature': 'hourofweek', 'max_n_categories': 60, 'encode_as': 'onehot', 'interaction_only': False }, 'lin_temperature': { 'feature': 'temperature', 'include_bias': False, 'interaction_only': False } }, }
Defaults to None.
feature_map (Dict, optional) –
A mapping between a feature generator name and the classes for its validation and creation. Example:
{'datetime': 'validate': 'validate.DatetimeSchema' 'generate': 'generate.DatetimeFeatures' }
Defaults to None.
- add_interaction(*, lenc_name: str, renc_name: str, lenc_type: Union[str, object], renc_type: Union[str, object], **kwargs)[source]
Add a pairwise interaction.
- Parameters
lenc_name (str) – A name for the first part of the interaction pair.
renc_name (str) – A name for the second part of the interaction pair.
lenc_type (str or encoder object) – The type of the feature encoder to apply on the first part of the interaction pair.
renc_type (str or encoder object) – The type of the feature encoder to apply on the second part of the interaction pair.
**kwargs – Keyword arguments to be passed during the feature encoders’ initialization.
- Raises
ValueError – If an interaction with the same name (lenc_name, renc_name) has already been added.
- Returns
The updated ModelStructure instance.
- Return type
Example:
model = ModelStructure().add_interaction( lenc_name="is_Monday", renc_name="daily_seasonality", lenc_type="categorical", renc_type="linear", **{ is_Monday: {"feature": "is_Monday", "encode_as": "onehot"}, daily_seasonality: {"feature": "daily", "as_filter": True}, }, )
- add_main_effect(*, name: str, enc_type: Union[str, sklearn.base.BaseEstimator], **kwargs)[source]
Add a main effect.
- Parameters
name (str) – A name for the main effect.
enc_type (str or encoder object) – The type of the feature encoder to apply on the main effect.
**kwargs – Keyword arguments to be passed during the feature encoder initialization. Ignored if enc_type is not a string.
- Raises
ValueError – If an encoder with the same name has already been added.
- Returns
The updated ModelStructure instance.
- Return type
- add_new_feature(*, name: str, fgen_type: Union[str, sklearn.base.BaseEstimator], **kwargs)[source]
Add a feature generator.
Feature generators are applied on the input dataframe with the same order that they were added.
- Parameters
name (str) – A name for the feature generator.
fgen_type (str or sklearn-compatible transformer) – The feature generator to add. If it is a string, the corresponding class will be loaded based on the relevant entry in the
feature_map
dictionary.**kwargs – Keyword arguments to be passed during the feature generator initialization. Ignored if fgen is not a string.
- Raises
ValueError – If a feature generator with the same name has already been added.
- Returns
The updated ModelStructure instance.
- Return type
- property components
- classmethod from_config(config: Dict, feature_map: Optional[Dict] = None)[source]
Create a ModelStructure instance from a configuration file.
- Parameters
config (Dict) – A dictionary that includes information about the model.
feature_map (Dict, optional) – A mapping between a feature generator name and the classes for its validation and creation. Defaults to None.
- Returns
A populated ModelStructure instance.
- Return type
feature_encoders.encode package
Module contents
- class feature_encoders.encode.CategoricalEncoder(*, feature, max_n_categories=None, stratify_by=None, excluded_categories=None, unknown_value=None, min_samples_leaf=1, max_features='auto', random_state=None, encode_as='onehot')[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Encode categorical features.
If max_n_categories is not None and the number of unique values of the categorical feature is larger than the max_n_categories minus the excluded_categories, the TargetClusterEncoder will be called.
If encode_as = ‘onehot’, the result comes from a TargetClusterEncoder + SafeOneHotEncoder pipeline, otherwise from a TargetClusterEncoder + SafeOrdinalEncoder one.
- Parameters
feature (str) – The name of the categorical feature to transform. This encoder operates on a single feature.
max_n_categories (int, optional) – The maximum number of categories to produce. Defaults to None.
stratify_by (str or list of str, optional) – If not None, the encoder will first stratify the categorical feature into groups that have similar values of the features in stratify_by, and then cluster based on the relationship between the categorical feature and the target. It is used only if the number of unique categories minus the excluded_categories is larger than max_n_categories. Defaults to None.
excluded_categories (str or list of str, optional) – The names of the categories to be excluded from the clustering process. These categories will stay intact by the encoding process, so they cannot have the same values as the encoder’s results (the encoder acts as an
OrdinalEncoder
in the sense that the feature is converted into a column of integers 0 to n_categories - 1). Defaults to None.unknown_value (int, optional) – This parameter will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If None, the value -1 is used. Defaults to None.
min_samples_leaf (int, optional) – The minimum number of samples required to be at a leaf node of the decision tree model that is used for stratifying the categorical feature if stratify_by is not None. The actual number that will be passed to the tree model is min_samples_leaf multiplied by the number of unique values in the categorical feature to transform. Defaults to 1.
max_features (int, float or {"auto", "sqrt", "log2"}, optional) –
The number of features that the decision tree considers when looking for the best split:
If int, then consider max_features features at each split of the decision tree
If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split
If “auto”, then max_features=n_features
If “sqrt”, then max_features=sqrt(n_features)
If “log2”, then max_features=log2(n_features)
If None, then max_features=n_features
Defaults to “auto”.
random_state (int or RandomState instance, optional) – Controls the randomness of the decision tree estimator. To obtain a deterministic behaviour during its fitting,
random_state
has to be fixed to an integer. Defaults to None.encode_as ({'onehot', 'ordinal'}, optional) –
Method used to encode the transformed result.
If “onehot”, encode the transformed result with one-hot encoding and return a dense array
If “ordinal”, encode the transformed result as integer values
Defaults to “onehot”.
- fit(X: pandas.core.frame.DataFrame, y: Optional[pandas.core.frame.DataFrame] = None)[source]
Fit the encoder on the available data.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (pandas.DataFrame of shape (n_samples, 1), optional) – The target dataframe. Defaults to None.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If the encoder is applied on numerical (float) data.
ValueError – If the number of categories minus the excluded_categories is larger than max_n_categories but target values (y) are not provided.
ValueError – If any of the values in excluded_categories is not found in the input data.
- Returns
Fitted encoder.
- Return type
- transform(X: pandas.core.frame.DataFrame)[source]
Apply the encoder.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
- Returns
The encoded features as a numpy array.
- Return type
numpy array
- class feature_encoders.encode.ICatEncoder(encoder_left: feature_encoders.encode._encoders.CategoricalEncoder, encoder_right: feature_encoders.encode._encoders.CategoricalEncoder)[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Encode the interaction between two categorical features.
Interactions are always pairwise and always between encoders (and not features).
- Parameters
encoder_left (CategoricalEncoder) – The encoder for the first of the two features.
encoder_right (CategoricalEncoder) – The encoder for the second of the two features.
- Raises
ValueError – If any of the two encoders is not a CategoricalEncoder.
ValueError – If the two encoders do not have the same encode_as parameter.
Note
Both encoders should have the same encode_as parameter. If one or both of the encoders is already fitted, it will not be re-fitted during fit or fit_transform.
- class feature_encoders.encode.ICatLinearEncoder(*, encoder_cat: feature_encoders.encode._encoders.CategoricalEncoder, encoder_num: feature_encoders.encode._encoders.IdentityEncoder)[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Encode the interaction between one categorical and one linear numerical feature.
- Parameters
encoder_cat (CategoricalEncoder) – The encoder for the categorical feature. It must encode features in an one-hot form.
encoder_num (IdentityEncoder) – The encoder for the numerical feature.
- Raises
ValueError – If encoder_cat is not a CategoricalEncoder.
ValueError – If encoder_num is not an IdentityEncoder.
ValueError – If encoder_cat is not encoded as one-hot.
Note
If one or both of the encoders is already fitted, it will not be re-fitted during fit or fit_transform.
- class feature_encoders.encode.ICatSplineEncoder(*, encoder_cat: feature_encoders.encode._encoders.CategoricalEncoder, encoder_num: feature_encoders.encode._encoders.SplineEncoder)[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Encode the interaction between one categorical and one spline-encoded numerical feature.
- Parameters
encoder_cat (CategoricalEncoder) – The encoder for the categorical feature. It must encode features in an one-hot form.
encoder_num (SplineEncoder) – The encoder for the numerical feature.
- Raises
ValueError – If encoder_cat is not a CategoricalEncoder.
ValueError – If encoder_num is not a SplineEncoder.
ValueError – If encoder_cat is not encoded as one-hot.
Note
If the categorical encoder is already fitted, it will not be re-fitted during fit or fit_transform. The numerical encoder will always be (re)fitted (one encoder per level of categorical feature.
- class feature_encoders.encode.ISplineEncoder(encoder_left: feature_encoders.encode._encoders.SplineEncoder, encoder_right: feature_encoders.encode._encoders.SplineEncoder)[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Encode the interaction between two spline-encoded numerical features.
- Parameters
encoder_left (SplineEncoder) – The encoder for the first of the two features.
encoder_right (SplineEncoder) – The encoder for the second of the two features.
- Raises
ValueError – If any of the two encoders is not a SplineEncoder.
Note
If one or both of the encoders is already fitted, it will not be re-fitted during fit or fit_transform.
- class feature_encoders.encode.IdentityEncoder(feature=None, as_filter=False, include_bias=False)[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Create an encoder that returns what it is fed.
This encoder can act as a linear feature encoder.
- Parameters
feature (str or list of str, optional) – The name(s) of the input dataframe’s column(s) to return. If None, the whole input dataframe will be returned. Defaults to None.
as_filter (bool, optional) – If True, the encoder will return all feature labels for which “feature in label == True”. Defaults to False.
include_bias (bool, optional) – If True, a column of ones is added to the output. Defaults to False.
- Raises
ValueError – If as_filter is True, feature cannot include multiple feature names.
- fit(X: pandas.core.frame.DataFrame, y=None)[source]
Fit the encoder on the available data.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
- Returns
Fitted encoder.
- Return type
- transform(X: pandas.core.frame.DataFrame)[source]
Apply the encoder.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If include_bias is True and a column with constant values already exists in the returned columns.
- Returns
The selected column subset as a numpy array.
- Return type
numpy array of shape
- class feature_encoders.encode.ProductEncoder(encoder_left: feature_encoders.encode._encoders.IdentityEncoder, encoder_right: feature_encoders.encode._encoders.IdentityEncoder)[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Encode the interaction between two linear numerical features.
- Parameters
encoder_left (IdentityEncoder) – The encoder for the first of the two features.
encoder_right (IdentityEncoder) – The encoder for the second of the two features.
- Raises
ValueError – If any of the two encoders is not an IdentityEncoder.
Note
If one or both of the encoders is already fitted, it will not be re-fitted during fit or fit_transform.
- fit(X: pandas.core.frame.DataFrame, y=None)[source]
Fit the encoder on the available data.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.
- Raises
ValueError – If any of the two encoders is not a single-feature encoder.
- Returns
Fitted encoder.
- Return type
- class feature_encoders.encode.SafeOneHotEncoder(feature=None, unknown_value=None)[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Encode categorical features in a one-hot form.
The encoder uses a SafeOrdinalEncoder`to first encode the feature as an integer array and then a `sklearn.preprocessing.OneHotEncoder to encode the features as an one-hot array.
- Parameters
feature (str or list of str, optional) – The names of the columns to encode. If None, all categorical columns will be encoded. Defaults to None.
unknown_value (int, optional) – This parameter will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If None, the value -1 is used. During transform, unknown categories will be replaced using the most frequent value along each column. Defaults to None.
- fit(X: pandas.core.frame.DataFrame, y=None)[source]
Fit the encoder on the available data.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.
- Returns
Fitted encoder.
- Return type
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If the encoder is applied on numerical (float) data.
- transform(X: pandas.core.frame.DataFrame)[source]
Apply the encoder.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
- Returns
The encoded column subset as a numpy array.
- Return type
numpy array of shape
- class feature_encoders.encode.SafeOrdinalEncoder(feature=None, unknown_value=None)[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Encode categorical features as an integer array.
The encoder converts the features into ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.
- Parameters
feature (str or list of str, optional) – The names of the columns to encode. If None, all categorical columns will be encoded. Defaults to None.
unknown_value (int, optional) – This parameter will set the encoded value for unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If None, the value -1 is used. During transform, unknown categories will be replaced using the most frequent value along each column. Defaults to None.
- fit(X: pandas.core.frame.DataFrame, y=None)[source]
Fit the encoder on the available data.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.
- Returns
Fitted encoder.
- Return type
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
- transform(X: pandas.core.frame.DataFrame)[source]
Apply the encoder.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
- Returns
The encoded column subset as a numpy array.
- Return type
numpy array of shape
- class feature_encoders.encode.SplineEncoder(*, feature, n_knots=5, degree=3, strategy='uniform', extrapolation='constant', include_bias=True, order='C')[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Generate univariate B-spline bases for features.
The encoder generates a matrix consisting of n_splines=n_knots + degree - 1 spline basis functions (B-splines) of polynomial order=`degree` for the given feature.
- Parameters
feature (str) – The name of the column to encode.
n_knots (int, optional) – Number of knots of the splines if knots equals one of {‘uniform’, ‘quantile’}. Must be larger or equal 2. Ignored if knots is array-like. Defaults to 5.
degree (int, optional) – The polynomial degree of the spline basis. Must be a non-negative integer. Defaults to 3.
strategy ({'uniform', 'quantile'} or array-like of shape (n_knots, n_features) –
optional): Set knot positions such that first knot <= features <= last knot.
If ‘uniform’, n_knots number of knots are distributed uniformly from min to max values of the features (each bin has the same width)
If ‘quantile’, they are distributed uniformly along the quantiles of the features (each bin has the same number of observations)
If an array-like is given, it directly specifies the sorted knot positions including the boundary knots. Note that, internally, degree number of knots are added before the first knot, the same after the last knot
Defaults to “uniform”.
extrapolation ({'error', 'constant', 'linear', 'continue'}, optional) – If ‘error’, values outside the min and max values of the training features raises a ValueError. If ‘constant’, the value of the splines at minimum and maximum value of the features is used as constant extrapolation. If ‘linear’, a linear extrapolation is used. If ‘continue’, the splines are extrapolated as is, option extrapolate=True in scipy.interpolate.BSpline. Defaults to “constant”.
include_bias (bool, optional) – If False, then the last spline element inside the data range of a feature is dropped. As B-splines sum to one over the spline basis functions for each data point, they implicitly include a bias term. Defaults to True.
order ({'C', 'F'}, optional) – Order of output array. ‘F’ order is faster to compute, but may slow down subsequent estimators. Defaults to “C”.
- fit(X: pandas.core.frame.DataFrame, y=None, sample_weight=None)[source]
Fit the encoder.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The data to fit.
y (None, optional) – Ignored. Defaults to None.
sample_weight (array-like of shape (n_samples,), optional) – Individual weights for each sample. Used to calculate quantiles if strategy=”quantile”. For strategy=”uniform”, zero weighted observations are ignored for finding the min and max of X. Defaults to None.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
- Returns
Fitted encoder.
- Return type
- class feature_encoders.encode.TargetClusterEncoder(*, feature, max_n_categories, stratify_by=None, excluded_categories=None, unknown_value=None, min_samples_leaf=5, max_features='auto', random_state=None)[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Encode a categorical feature as clusters of the target’s values.
The purpose of this encoder is to reduce the cardinality of a categorical feature. This encoder does not replace unknown values with the most frequent one during transform. It just assigns them the value of unknown_value.
- Parameters
feature (str) – The name of the categorical feature to transform. This encoder operates on a single feature.
max_n_categories (int, optional) – The maximum number of categories to produce. Defaults to None.
stratify_by (str or list of str, optional) – If not None, the encoder will first stratify the categorical feature into groups that have similar values of the features in stratify_by, and then cluster based on the relationship between the categorical feature and the target. It is used only if the number of unique categories minus the excluded_categories is larger than max_n_categories. Defaults to None.
excluded_categories (str or list of str, optional) – The names of the categories to be excluded from the clustering process. These categories will stay intact by the encoding process, so they cannot have the same values as the encoder’s results (the encoder acts as an
OrdinalEncoder
in the sense that the feature is converted into a column of integers 0 to n_categories - 1). Defaults to None.unknown_value (int, optional) – This parameter will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If None, the value -1 is used. Defaults to None.
min_samples_leaf (int, optional) – The minimum number of samples required to be at a leaf node of the decision tree model that is used for stratifying the categorical feature if stratify_by is not None. The actual number that will be passed to the tree model is min_samples_leaf multiplied by the number of unique values in the categorical feature to transform. Defaults to 1.
max_features (int, float or {"auto", "sqrt", "log2"}, optional) –
The number of features that the decision tree considers when looking for the best split:
If int, then consider max_features features at each split of the decision tree
If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split
If “auto”, then max_features=n_features
If “sqrt”, then max_features=sqrt(n_features)
If “log2”, then max_features=log2(n_features)
If None, then max_features=n_features
Defaults to “auto”.
random_state (int or RandomState instance, optional) – Controls the randomness of the decision tree estimator. To obtain a deterministic behaviour during its fitting,
random_state
has to be fixed to an integer. Defaults to None.
- fit(X: pandas.core.frame.DataFrame, y: pandas.core.frame.DataFrame)[source]
Fit the encoder on the available data.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (pandas.DataFrame of shape (n_samples, 1)) – The target dataframe.
- Returns
Fitted encoder.
- Return type
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If the encoder is applied on numerical (float) data.
ValueError – If any of the values in excluded_categories is not found in the input data.
ValueError – If the number of categories left after removing all in excluded_categories is not larger than max_n_categories.
- transform(X: pandas.core.frame.DataFrame)[source]
Apply the encoder.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
- Returns
The encoded column subset as a numpy array.
- Return type
numpy array
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
feature_encoders.generate package
Module contents
- class feature_encoders.generate.CyclicalFeatures(*, seasonality, ds=None, period=None, fourier_order=None, remainder='passthrough', replace=False)[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Create cyclical (seasonal) features as fourier terms.
- Parameters
seasonality (str) – The name of the seasonality. The feature generator can provide default values for
period
andfourier_order
ifseasonality
is one of ‘daily’, ‘weekly’ or ‘yearly’.ds (str, optional) – The name of the input dataframe’s column that contains datetime information. If None, it is assumed that the datetime information is provided by the input dataframe’s index. Defaults to None.
period (float, optional) – Number of days in one period. Defaults to None.
fourier_order (int, optional) – Number of Fourier components to use. Defaults to None.
remainder ({'drop', 'passthrough'}, optional) – By specifying
remainder='passthrough'
, all the remaining columns of the input dataset will be automatically passed through (concatenated with the output of the transformer), otherwise, they will be dropped. Defaults to “passthrough”.replace (bool, optional) – Specifies whether replacing an existing column with the same name is allowed (applicable when remainder=passthrough). Defaults to False.
- Raises
ValueError – If
remainder
is neither ‘drop’ nor ‘passthrough’.
- fit(X: pandas.core.frame.DataFrame, y=None)[source]
Fit the feature generator on the available data.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.
- Returns
Fitted encoder.
- Return type
- Raises
ValueError – If either
period
orfourier_order
is not provided, butseasonality
is not one of ‘daily’, ‘weekly’ or ‘yearly’.
- transform(X: pandas.core.frame.DataFrame)[source]
Apply the feature generator.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If common columns are found and
replace=False
.
- Returns
The transformed dataframe.
- Return type
pandas.DataFrame
- class feature_encoders.generate.DatetimeFeatures(ds=None, remainder='passthrough', replace=False, subset=None)[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Generate date and time features.
- Parameters
ds (str, optional) – The name of the input dataframe’s column that contains datetime information. If None, it is assumed that the datetime information is provided by the input dataframe’s index. Defaults to None.
remainder ({'drop', 'passthrough'}, optional) – By specifying
remainder='passthrough'
, all the remaining columns of the input dataset will be automatically passed through (concatenated with the output of the transformer), otherwise, they will be dropped. Defaults to “passthrough”.replace (bool, optional) – Specifies whether replacing an existing column with the same name is allowed (applicable when remainder=passthrough). Defaults to False.
subset (str or list of str, optional) – The names of the features to generate. If None, all features will be produced: ‘month’, ‘week’, ‘dayofyear’, ‘dayofweek’, ‘hour’, ‘hourofweek’. The last 2 features are generated only if the timestep of the input’s ds (or index if ds is None) is smaller than pandas.Timedelta(days=1). Defaults to None.
- Raises
ValueError – If
remainder
is neither ‘drop’ nor ‘passthrough’.
- fit(X: pandas.core.frame.DataFrame, y=None)[source]
Fit the feature generator on the available data.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.
- Returns
Fitted encoder.
- Return type
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
- transform(X: pandas.core.frame.DataFrame)[source]
Apply the feature generator.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If common columns are found and
replace=False
.
- Returns
The transformed dataframe.
- Return type
pandas.DataFrame
- class feature_encoders.generate.TrendFeatures(ds=None, name='growth', remainder='passthrough', replace=False)[source]
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Generate linear time trend features.
- Parameters
ds (str, optional) – The name of the input dataframe’s column that contains datetime information. If None, it is assumed that the datetime information is provided by the input dataframe’s index. Defaults to None.
name (str, optional) – The name of the generated dataframe’s column. Defaults to ‘growth’.
remainder ({'drop', 'passthrough'}, optional) – By specifying
remainder='passthrough'
, all the remaining columns of the input dataset will be automatically passed through (concatenated with the output of the transformer), otherwise, they will be dropped. Defaults to “passthrough”.replace (bool, optional) – Specifies whether replacing an existing column with the same name is allowed (applicable when remainder=passthrough). Defaults to False.
- Raises
ValueError – If
remainder
is neither ‘drop’ nor ‘passthrough’.
- fit(X: pandas.core.frame.DataFrame, y=None)[source]
Fit the feature generator on the available data.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.
- Returns
Fitted encoder.
- Return type
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
- transform(X: pandas.core.frame.DataFrame)[source]
Apply the feature generator.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If common columns are found and
replace=False
.
- Returns
The transformed dataframe.
- Return type
pandas.DataFrame
feature_encoders.models package
Submodules
feature_encoders.models.grouped module
- class feature_encoders.models.grouped.GroupedPredictor(*, group_feature: str, model_conf: Dict[str, Dict], feature_conf: Optional[Dict[str, Dict]] = None, estimator_params=(), fallback=False)[source]
Bases:
sklearn.base.RegressorMixin
,sklearn.base.BaseEstimator
Construct one predictor per data group.
The predictor splits data by the different values of a single column and fits one estimator per group. Since each of the models in the ensemble predicts on a different subset of the input data (an observation cannot belong to more than one clusters), the final prediction is generated by vertically concatenating all the individual models’ predictions.
- Parameters
group_feature (str) – The name of the column of the input dataframe to use as the grouping set.
model_conf (Dict[str, Dict]) – A dictionary that includes information about the base model’s structure.
feature_conf (Dict[str, Dict], optional) – A dictionary that maps feature generator names to the classes for the generators’ validation and creation. Defaults to None.
estimator_params (dict or tuple of tuples, optional) – The parameters to use when instantiating a new base estimator. If none are given, default parameters are used. Defaults to tuple().
fallback (bool, optional) – Whether or not to fall back to a global model in case a group parameter is not found during .predict(). Otherwise, an exception will be raised. Defaults to False.
- property dof
- fit(X: pandas.core.frame.DataFrame, y: Union[pandas.core.frame.DataFrame, pandas.core.series.Series])[source]
Fit the estimator with the available data.
- Parameters
X (pandas.DataFrame) – Input data.
y (pandas.Series or pandas.DataFrame) – Target data.
- Raises
Exception – If the estimator is re-fitted. An estimator object can only be fitted once.
ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If the target data does not pass the checks of utils.check_y.
- Returns
Fitted estimator.
- Return type
- property n_parameters
- predict(X: pandas.core.frame.DataFrame, include_clusters=False, include_components=False)[source]
Predict given new input data.
- Parameters
X (pandas.DataFrame) – Input data.
include_clusters (bool, optional) – Whether to include the added clusters in the returned prediction. Defaults to False.
include_components (bool, optional) – Whether to include the contribution of the individual components of the model structure in the returned prediction. Defaults to False.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
- Returns
The predicted values.
- Return type
pandas.DataFrame
feature_encoders.models.linear module
- class feature_encoders.models.linear.LinearPredictor(*, model_structure: feature_encoders.compose._compose.ModelStructure, alpha=0.01, fit_intercept=False)[source]
Bases:
sklearn.base.RegressorMixin
,sklearn.base.BaseEstimator
A linear regression model with flexible parameterization.
- Parameters
model_structure (ModelStructure) – The structure of a linear regression model.
alpha (float, optional) – Regularization strength of the underlying ridge regression; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Defaults to 0.01.
fit_intercept (bool, optional) – Whether to fit the intercept for this model. If set to false, no intercept will be used in calculations. Defaults to False.
- property dof
- fit(X: pandas.core.frame.DataFrame, y: Union[pandas.core.frame.DataFrame, pandas.core.series.Series])[source]
Fit the estimator with the available data.
- Parameters
X (pandas.DataFrame) – Input data.
y (pandas.Series or pandas.DataFrame) – Target data.
- Raises
Exception – If the estimator is re-fitted. An estimator object can only be fitted once.
ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If the target data does not pass the checks of utils.check_y.
- Returns
Fitted estimator.
- Return type
- property n_parameters
- predict(X: pandas.core.frame.DataFrame, include_components=False)[source]
Predict using the given input data.
- Parameters
X (pandas.DataFrame) – Input data.
include_components (bool, optional) – If True, the prediction dataframe will include also the individual components’ contribution to the predicted values. Defaults to False.
- Returns
The prediction.
- Return type
pandas.DataFrame
feature_encoders.models.seasonal module
- class feature_encoders.models.seasonal.SeasonalPredictor(ds: Optional[str] = None, add_trend: bool = False, yearly_seasonality: Union[str, bool, int] = 'auto', weekly_seasonality: Union[str, bool, int] = 'auto', daily_seasonality: Union[str, bool, int] = 'auto', min_samples=0.5, alpha=0.01)[source]
Bases:
sklearn.base.BaseEstimator
Time series prediction model based on seasonal decomposition.
- Parameters
ds (str, optional) – The name of the input dataframe’s column that contains datetime information. If None, it is assumed that the datetime information is provided by the input dataframe’s index. Defaults to None.
add_trend (bool, optional) – If True, a linear time trend will be added. Defaults to False.
yearly_seasonality (Union[str, bool, int], optional) – Fit yearly seasonality. Can be ‘auto’, True, False, or a number of Fourier terms to generate. Defaults to “auto”.
weekly_seasonality (Union[str, bool, int], optional) – Fit weekly seasonality. Can be ‘auto’, True, False, or a number of Fourier terms to generate. Defaults to “auto”.
daily_seasonality (Union[str, bool, int], optional) – Fit daily seasonality. Can be ‘auto’, True, False, or a number of Fourier terms to generate. Defaults to “auto”.
min_samples (float ([0, 1]), optional) – Minimum number of samples chosen randomly from original data by the RANSAC (RANdom SAmple Consensus) algorithm. Defaults to 0.5.
alpha (float, optional) – Parameter for the underlying ridge estimator (base_estimator). It must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Defaults to 0.01.
- add_seasonality(name: str, period: Optional[float] = None, fourier_order: Optional[int] = None, condition_name: Optional[str] = None)[source]
Add a seasonal component with specified period and number of Fourier components.
If condition_name is provided, the input dataframe passed to fit and predict should have a column with the specified condition_name containing booleans that indicate when to apply seasonality.
- Parameters
name (str) – The name of the seasonality component.
period (float, optional) – Number of days in one period. Defaults to None.
fourier_order (int, optional) – Number of Fourier components to use. Defaults to None.
condition_name (str, optional) – The name of the seasonality condition. Defaults to None.
- Raises
Exception – If the method is called after the estimator is fitted.
ValueError – If either period or fourier_order are not provided and the seasonality is not in (‘daily’, ‘weekly’, ‘yearly’).
- Returns
The updated estimator object.
- Return type
- fit(X: pandas.core.frame.DataFrame, y: pandas.core.frame.DataFrame)[source]
Fit the estimator with the available data.
- Parameters
X (pandas.DataFrame) – Input data.
y (pandas.DataFrame) – Target data.
- Raises
Exception – If the estimator is re-fitted. An estimator object can only be fitted once.
ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If the target data does not pass the checks of utils.check_y.
- Returns
Fitted estimator.
- Return type
Module contents
feature_encoders.validate package
Submodules
feature_encoders.validate.schemas module
- class feature_encoders.validate.schemas.CategoricalSchema(*, type: str, feature: str, max_n_categories: int = None, stratify_by: Optional[Union[str, List[str]]] = None, excluded_categories: Optional[Union[str, List[str]]] = None, unknown_value: int = None, min_samples_leaf: int = 1, max_features: Union[str, int, float] = 'auto', random_state: int = None, encode_as: str = 'onehot')[source]
Bases:
pydantic.main.BaseModel
- encode_as: str
- excluded_categories: Optional[Union[str, List[str]]]
- feature: str
- max_features: Union[str, int, float]
- max_n_categories: Optional[int]
- min_samples_leaf: int
- random_state: Optional[int]
- stratify_by: Optional[Union[str, List[str]]]
- type: str
- unknown_value: Optional[int]
- class feature_encoders.validate.schemas.CyclicalSchema(*, type: str, seasonality: str, ds: str = None, period: float = None, fourier_order: int = None, remainder: str = 'passthrough', replace: bool = False)[source]
Bases:
pydantic.main.BaseModel
- ds: Optional[str]
- fourier_order: Optional[int]
- period: Optional[float]
- remainder: str
- replace: bool
- seasonality: str
- type: str
- class feature_encoders.validate.schemas.DatetimeSchema(*, type: str, ds: str = None, remainder: str = 'passthrough', replace: bool = False, subset: Optional[Union[str, List[str]]] = None)[source]
Bases:
pydantic.main.BaseModel
- ds: Optional[str]
- remainder: str
- replace: bool
- subset: Optional[Union[str, List[str]]]
- type: str
- class feature_encoders.validate.schemas.LinearSchema(*, type: str, feature: str, as_filter: bool = False, include_bias: bool = False)[source]
Bases:
pydantic.main.BaseModel
- as_filter: bool
- feature: str
- include_bias: bool
- type: str
- class feature_encoders.validate.schemas.SplineSchema(*, type: str, feature: str, n_knots: int = 5, degree: int = 3, strategy: Optional[Union[str, List]] = 'uniform', extrapolation: str = 'constant', include_bias: bool = False)[source]
Bases:
pydantic.main.BaseModel
- degree: Optional[int]
- extrapolation: Optional[str]
- feature: str
- include_bias: bool
- n_knots: Optional[int]
- strategy: Optional[Union[str, List]]
- type: str
Module contents
- class feature_encoders.validate.CategoricalSchema(*, type: str, feature: str, max_n_categories: int = None, stratify_by: Optional[Union[str, List[str]]] = None, excluded_categories: Optional[Union[str, List[str]]] = None, unknown_value: int = None, min_samples_leaf: int = 1, max_features: Union[str, int, float] = 'auto', random_state: int = None, encode_as: str = 'onehot')[source]
Bases:
pydantic.main.BaseModel
- encode_as: str
- excluded_categories: Optional[Union[str, List[str]]]
- feature: str
- max_features: Union[str, int, float]
- max_n_categories: Optional[int]
- min_samples_leaf: int
- random_state: Optional[int]
- stratify_by: Optional[Union[str, List[str]]]
- type: str
- unknown_value: Optional[int]
- class feature_encoders.validate.CyclicalSchema(*, type: str, seasonality: str, ds: str = None, period: float = None, fourier_order: int = None, remainder: str = 'passthrough', replace: bool = False)[source]
Bases:
pydantic.main.BaseModel
- ds: Optional[str]
- fourier_order: Optional[int]
- period: Optional[float]
- remainder: str
- replace: bool
- seasonality: str
- type: str
- class feature_encoders.validate.DatetimeSchema(*, type: str, ds: str = None, remainder: str = 'passthrough', replace: bool = False, subset: Optional[Union[str, List[str]]] = None)[source]
Bases:
pydantic.main.BaseModel
- ds: Optional[str]
- remainder: str
- replace: bool
- subset: Optional[Union[str, List[str]]]
- type: str
- class feature_encoders.validate.LinearSchema(*, type: str, feature: str, as_filter: bool = False, include_bias: bool = False)[source]
Bases:
pydantic.main.BaseModel
- as_filter: bool
- feature: str
- include_bias: bool
- type: str
- class feature_encoders.validate.SplineSchema(*, type: str, feature: str, n_knots: int = 5, degree: int = 3, strategy: Optional[Union[str, List]] = 'uniform', extrapolation: str = 'constant', include_bias: bool = False)[source]
Bases:
pydantic.main.BaseModel
- degree: Optional[int]
- extrapolation: Optional[str]
- feature: str
- include_bias: bool
- n_knots: Optional[int]
- strategy: Optional[Union[str, List]]
- type: str
Submodules
feature_encoders.settings module
feature_encoders.utils module
- feature_encoders.utils.add_constant(data: Union[numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame], prepend=True, has_constant='skip')[source]
Add a column of ones to an array.
- Parameters
data (array-like) – A column-ordered design matrix.
prepend (bool, optional) – If true, the constant is in the first column. Else the constant is appended (last column). Defaults to True.
has_constant ({'raise', 'add', 'skip'}, optional) – Behavior if
data
already has a constant. The default will return data without adding another constant. If ‘raise’, will raise an error if any column has a constant value. Using ‘add’ will add a column of 1s if a constant column is present. Defaults to “skip”.
- Returns
The original values with a constant (column of ones).
- Return type
numpy.ndarray
- feature_encoders.utils.as_list(val: Any)[source]
Cast input as list.
Helper function, always returns a list of the input value.
- feature_encoders.utils.as_series(x: Union[numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame])[source]
Cast an iterable to a Pandas Series object.
- feature_encoders.utils.check_X(X: pandas.core.frame.DataFrame, exists=None, int_is_categorical=True, return_col_info=False)[source]
Perform a series of checks on the input dataframe.
- Parameters
X (pamdas.DataFrame) – The input dataframe.
exists (str or list of str, optional) – Names of columns that must be present in the input dataframe. Defaults to None.
int_is_categorical (bool, optional) – If True, integer types are considered categorical. Defaults to True.
return_col_info (bool, optional) – If True, the function will return the names of the categorical and the names of the numerical columns, in addition to the provided dataframe. Defaults to False.
- Raises
ValueError – If the input is not a pandas DataFrame.
ValueError – If any of the column names in exists are not found in the input.
ValueError – If Nan or inf values are found in the provided input data.
- Returns
pandas.DataFrame if return_col_info is False else (pandas.DataFrame, list, list)
- feature_encoders.utils.check_y(y: Union[pandas.core.series.Series, pandas.core.frame.DataFrame], index=None)[source]
Perform a series of checks on the input dataframe.
The checks are carried out by sklearn.utils.check_array.
- Parameters
y (Union[pandas.Series, pandas.DataFrame]) – The input dataframe.
index (Union[pandas.Index, pandas.DatetimeIndex], optional) – An index to compare with the input dataframe’s index. Defaults to None.
- Raises
ValueError – If the input is neither a pandas Series nor a pandas DataFrame with only a single column.
ValueError – If the input data has different index than the one that was provided for comparison (if index is not None).
- Returns
The validated input data.
- Return type
pandas.DataFrame
- feature_encoders.utils.get_categorical_cols(X: pandas.core.frame.DataFrame, int_is_categorical=True)[source]
Return the names of the categorical columns in the input DataFrame.
- Parameters
X (pandas.DataFrame) – Input dataframe.
int_is_categorical (bool, optional) – If True, integer types are considered categorical. Defaults to True.
- Returns
The names of categorical columns in the input DataFrame.
- Return type
list
- feature_encoders.utils.get_datetime_data(X: pandas.core.frame.DataFrame, col_name=None)[source]
Get datetime information from the input dataframe.
- Parameters
X (pandas.DataFrame) – The input dataframe.
col_name (str, optional) – The name of the column that contains datetime information. If None, it is assumed that the datetime information is provided by the input dataframe’s index. Defaults to None.
- Returns
The datetime information.
- Return type
pandas.Series
- feature_encoders.utils.load_config(model='towt', features='default', merge_multiple=False)[source]
Load model configuration and feature generator mapping.
Given model and features, the function searches for files in:
conf_path = str(CONF_PATH) model_files = glob.glob(f"{conf_path}/models/{model}.*") feature_files = glob.glob(f"{conf_path}/features/{features}.*")
- Parameters
model (str, optional) – The name of the model configuration to load. Defaults to “towt”.
features (str, optional) – The name of the feature generator mapping to load. Defaults to “default”.
merge_multiple (bool, optional) – If True and more than one files are found when searching for either models or features, the contents of the files will ne merged. Otherwise, an exception will be raised. Defaults to False.
- Returns
The model configuration and feature mapping as dictionaries.
- Return type
(dict, dict)
- feature_encoders.utils.maybe_reshape_2d(arr: numpy.ndarray)[source]
Reshape an array (if needed) so it’s always 2-d and long.
- Parameters
arr (numpy.ndarray) – The input array.
- Returns
The reshaped array.
- Return type
numpy.ndarray
- feature_encoders.utils.tensor_product(a: numpy.ndarray, b: numpy.ndarray, reshape=True)[source]
Compute the tensor product of two matrices.
- Parameters
a (numpy array of shape (n, m_a)) – The first matrix.
b (numpy array of shape (n, m_b)) – The second matrix.
reshape (bool, optional) – Whether to reshape the result to be 2D (n, m_a * m_b) or return a 3D tensor (n, m_a, m_b). Defaults to True.
- Raises
ValueError – If input arrays are not 2-dimensional.
ValueError – If both input arrays do not have the same number of samples.
- Returns
numpy.ndarray of shape (n, m_a * m_b) if reshape = True else of shape (n, m_a, m_b).
Module contents
Tutorials
The functionality for generating new features
The feature-encoders
library includes a few feature generators:
TrendFeatures
: Generates time trend features.DatetimeFeatures
: Generates date and time features (such as the month of the year or the hour of the week).CyclicalFeatures
: Creates cyclical (seasonal) features as Fourier terms (similarly to the way the Prophet library generates seasonality features).
All feature generators generate pandas DataFrame
s, and they all have two common parameters:
remainder : str, {'drop', 'passthrough'}, default='passthrough'
By specifying `remainder='passthrough'`, all the remaining columns of the
input dataset will be automatically passed through (concatenated with the
output of the transformer).
replace : bool, default=False
Specifies whether replacing an existing column with the same name is allowed
(when `remainder=passthrough`).
[1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
%matplotlib inline
[2]:
from feature_encoders.generate import CyclicalFeatures, DatetimeFeatures, TrendFeatures
Load demo data
[3]:
data = pd.read_csv('data/data.csv', parse_dates=[0], index_col=0)
data = data[~data['consumption_outlier']]
Create time trend features
The ds
argument corresponds to the name of the input dataframe’s column that contains datetime information. If None
, it is assumed that the datetime information is provided by the input dataframe’s index.
[4]:
enc = TrendFeatures(ds=None, name='growth', remainder='drop')
features = enc.fit_transform(data)
[5]:
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(14, 3.54), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
ax.plot(features['growth'], label='growth')
ax.legend(loc='upper left')

Add date and time features
The subset
argument corresponds to the names of the features to generate. If None
, all features will be produced: ‘month’, ‘week’, ‘dayofyear’, ‘dayofweek’, ‘hour’, ‘hourofweek’. The last 2 features are generated only if the timestep of the input’s feature
(or index if feature
is None) is smaller than pandas.Timedelta(days=1)
.
[6]:
enc = DatetimeFeatures(ds=None, subset=None, remainder='drop')
features = enc.fit_transform(data)
features.columns
[6]:
Index(['month', 'week', 'dayofyear', 'dayofweek', 'hour', 'hourofweek'], dtype='object')
[7]:
enc = DatetimeFeatures(ds=None, remainder='drop', subset=['month', 'hourofweek'])
features = enc.fit_transform(data)
features.columns
[7]:
Index(['month', 'hourofweek'], dtype='object')
Encode cyclical (seasonal) features
The encoder is parameterized by period
(number of days in one period) and fourier_order
(number of Fourier components to use).
It can provide default values for period
and fourier_order
if seasonality
is one of daily
, weekly
or yearly
.
[8]:
daily_consumption = data[['consumption']].resample('D').sum()
[9]:
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(14, 3.54), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
daily_consumption.plot(ax=ax, alpha=0.5)

The number of seasonality features is always twice the fourier_order
.
[10]:
enc = CyclicalFeatures(ds=None, seasonality='yearly', fourier_order=3, remainder='drop')
features = enc.fit_transform(daily_consumption)
features.columns
[10]:
Index(['yearly_delim_0', 'yearly_delim_1', 'yearly_delim_2', 'yearly_delim_3',
'yearly_delim_4', 'yearly_delim_5'],
dtype='object')
Now let’s plot the new features:
[11]:
with plt.style.context('seaborn-whitegrid'):
fig, axs = plt.subplots(2*enc.fourier_order, figsize=(14, 7), dpi=96)
for i, col in enumerate(features.columns):
features[col].plot(ax=axs[i])
axs[i].set_xlabel(None)
fig.tight_layout()

Let’s also see how well this transformation works:
[12]:
regr = LinearRegression(fit_intercept=True).fit(features, daily_consumption)
pred = regr.predict(features)
[13]:
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(14, 3.54), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
daily_consumption.plot(ax=ax, alpha=0.5)
pd.Series(pred.squeeze(), index=daily_consumption.index).plot(ax=ax)

Encoding categorical features
This section explains the way categorical encoding can be carried out using feature_encoders
.
All encoders take pandas.DataFrame
s as input and generate numpy.ndarray
s as output.
[1]:
import matplotlib.cm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn.model_selection import KFold
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from matplotlib.colors import LinearSegmentedColormap, ListedColormap
%matplotlib inline
[2]:
from feature_encoders.encode import (
SafeOrdinalEncoder,
SafeOneHotEncoder,
TargetClusterEncoder,
CategoricalEncoder
)
A plotting utility:
[3]:
def get_colors(cmap, N=None, use_index="auto"):
if isinstance(cmap, str):
if use_index == "auto":
if cmap in ['Pastel1', 'Pastel2', 'Paired', 'Accent',
'Dark2', 'Set1', 'Set2', 'Set3',
'tab10', 'tab20', 'tab20b', 'tab20c']:
use_index=True
else:
use_index=False
cmap = matplotlib.cm.get_cmap(cmap)
if not N:
N = cmap.N
if use_index=="auto":
if cmap.N > 100:
use_index=False
elif isinstance(cmap, LinearSegmentedColormap):
use_index=False
elif isinstance(cmap, ListedColormap):
use_index=True
if use_index:
ind = np.arange(int(N)) % cmap.N
return cmap(ind)
else:
return cmap(np.linspace(0,1,N))
Load demo data
The demo data represents the energy consumption of a building.
[4]:
data = pd.read_csv('data/data.csv', parse_dates=[0], index_col=0)
data = data[~data['consumption_outlier']]
data.dtypes
[4]:
consumption float64
holiday object
temperature float64
consumption_outlier bool
dtype: object
holiday
is a categorical feature. The _novalue_
value corresponds to non-holiday observations.
[5]:
data['holiday'].value_counts()
[5]:
_novalue_ 35943
Immaculate Conception 192
Christmas Day 192
St Stephen's Day 192
New year 96
Epiphany 96
Easter Monday 96
Liberation Day 96
International Workers' Day 96
Republic Day 96
Assumption of Mary to Heaven 96
All Saints Day 96
Name: holiday, dtype: int64
SafeOrdinalEncoder
The SafeOrdinalEncoder
converts categorical features into ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature. Unknown categories will be replaced using the most frequent value along each column.
It is implemented as a pipeline:
UNKNOWN_VALUE = -1
Pipeline(
[
(
"select",
sklearn.compose.ColumnTransformer(
[("select", "passthrough", self.features_)], remainder="drop"
),
),
(
"encode_ordinal",
sklearn.preprocessing.OrdinalEncoder(
handle_unknown="use_encoded_value",
unknown_value=self.unknown_value or UNKNOWN_VALUE,
dtype=np.int16,
),
),
(
"impute_unknown",
sklearn.impute.SimpleImputer(
missing_values=self.unknown_value or UNKNOWN_VALUE,
strategy="most_frequent",
),
),
]
)
[6]:
enc = SafeOrdinalEncoder(feature='holiday')
kf = KFold(n_splits=5, shuffle=False)
for train_index, _ in kf.split(data):
enc = enc.fit(data.iloc[train_index])
not_seen = np.setdiff1d(data["holiday"].unique(),
data.iloc[train_index]["holiday"].unique()
)
print(f'Holidays not seen during training {not_seen}')
features = enc.transform(data[data['holiday'].isin(not_seen)])
print(f'Holidays not seen during training are transformed as {np.unique(features)}')
features = enc.transform(data[data['holiday'] == '_novalue_'])
print(f'... and the most common value is also encoded as: {np.unique(features)}')
Holidays not seen during training ['Epiphany' 'New year']
Holidays not seen during training are transformed as [9]
... and the most common value is also encoded as: [9]
Holidays not seen during training ['Easter Monday' "International Workers' Day" 'Liberation Day']
Holidays not seen during training are transformed as [8]
... and the most common value is also encoded as: [8]
Holidays not seen during training ['Republic Day']
Holidays not seen during training are transformed as [10]
... and the most common value is also encoded as: [10]
Holidays not seen during training ['Assumption of Mary to Heaven']
Holidays not seen during training are transformed as [10]
... and the most common value is also encoded as: [10]
Holidays not seen during training ['All Saints Day']
Holidays not seen during training are transformed as [10]
... and the most common value is also encoded as: [10]
[7]:
features = SafeOrdinalEncoder(feature='holiday').fit_transform(data)
assert data['holiday'].nunique() == np.unique(features).size
By default, the SafeOrdinalEncoder
considers as categorical features of type object
, int
, bool
and category
:
[8]:
enc = SafeOrdinalEncoder().fit(data)
enc.features_
[8]:
['holiday', 'consumption_outlier']
SafeOneHotEncoder
The SafeOneHotEncoder
uses a SafeOrdinalEncoder
to first safely encode the feature as an integer array and then a sklearn.preprocessing.OneHotEncoder
to encode the features as an one-hot array:
UNKNOWN_VALUE = -1
Pipeline(
[
(
"encode_ordinal",
SafeOrdinalEncoder(
feature=self.features_,
unknown_value=self.unknown_value or UNKNOWN_VALUE,
),
),
("one_hot", sklearn.preprocessing.OneHotEncoder(drop=None, sparse=False)),
]
)
[9]:
enc = SafeOneHotEncoder(feature='holiday')
kf = KFold(n_splits=5, shuffle=False)
for train_index, _ in kf.split(data):
enc = enc.fit(data.iloc[train_index])
not_seen = np.setdiff1d(data["holiday"].unique(),
data.iloc[train_index]["holiday"].unique()
)
print(f'Holidays not seen during training {not_seen}')
features = enc.transform(data[data['holiday'].isin(not_seen)])
# check that it is a proper one-hot
assert np.all(features.sum(axis=1) == 1)
print('Holidays not seen during training have non-zero value at column: '
f'{np.argmax(features == 1)}')
features = enc.transform(data[data['holiday'] == '_novalue_'])
# check that it is a proper one-hot
assert np.all(features.sum(axis=1) == 1)
print('... and the most common value also has non-zero value at column: '
f'{np.argmax(features == 1)}')
Holidays not seen during training ['Epiphany' 'New year']
Holidays not seen during training have non-zero value at column: 9
... and the most common value also has non-zero value at column: 9
Holidays not seen during training ['Easter Monday' "International Workers' Day" 'Liberation Day']
Holidays not seen during training have non-zero value at column: 8
... and the most common value also has non-zero value at column: 8
Holidays not seen during training ['Republic Day']
Holidays not seen during training have non-zero value at column: 10
... and the most common value also has non-zero value at column: 10
Holidays not seen during training ['Assumption of Mary to Heaven']
Holidays not seen during training have non-zero value at column: 10
... and the most common value also has non-zero value at column: 10
Holidays not seen during training ['All Saints Day']
Holidays not seen during training have non-zero value at column: 10
... and the most common value also has non-zero value at column: 10
All encoders have a n_features_out_
property after fitting.
[10]:
enc = SafeOneHotEncoder(feature='holiday').fit(data)
assert data['holiday'].nunique() == enc.n_features_out_
TargetClusterEncoder
Next, let’s suppose that we want to lump together all holidays into only two (2) categories. Maybe, for instance, we want to fit a model that predicts energy consumption, but we only have data for one year, and hence not enough information to be confident about the impact of each individual holiday.
We can examine how the target (consumption) changes for each holiday
value:
[11]:
to_group = data.loc[data['holiday'] != '_novalue_', ['consumption', 'holiday']]
grouped_mean = to_group.groupby('holiday').mean()
original_idx = grouped_mean.index
[12]:
grouped_mean.index = grouped_mean.index.map(lambda x: (x[:10] + '..') if len(x) > 10 else x)
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(16, 3.54), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
grouped_mean.plot.bar(ax=ax, rot=0)

One approach could be to group holiday
values together according to the different levels of the target:
[13]:
disc = KBinsDiscretizer(n_bins=2, encode='ordinal')
bins = disc.fit_transform(grouped_mean)
grouped_mean['bins'] = bins
[14]:
bin_values = [0, 1]
color_list = ['#74a9cf', '#fc8d59']
b2c = dict(zip(bin_values, color_list))
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(14, 3.54), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
grouped_mean['consumption'].plot.bar(ax=ax, rot=0,
color=[b2c[i] for i in grouped_mean['bins']])

We can plot the distribution of the consumption values for each category:
[15]:
mapping = pd.Series(data=grouped_mean['bins'].values, index=original_idx).to_dict()
data['bins'] = data['holiday'].map(lambda x: mapping.get(x))
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(16, 3.54), dpi=96)
layout = (1, 2)
ax0 = plt.subplot2grid(layout, (0, 0))
ax1 = plt.subplot2grid(layout, (0, 1))
subset = data[data['bins'] == 0]
colors = get_colors('tab10', N=subset['holiday'].nunique())
for i, (holiday, grouped) in enumerate(subset.groupby('holiday')):
grouped['consumption'].plot.kde(ax=ax0, color=colors[i], bw_method=0.5)
subset = data[data['bins'] == 1]
colors = get_colors('tab10', N=subset['holiday'].nunique())
for i, (holiday, grouped) in enumerate(subset.groupby('holiday')):
grouped['consumption'].plot.kde(ax=ax1, color=colors[i], bw_method=0.5)

Going one step further, we could examine not only the mean of the target per holiday
value but also other characteristics of its distribution. To take more aspects of the target’s distribution into account, the TargetClusterEncoder
clusters the different values of a categorical feature according to the mean, standard deviation, skewness and the Wasserstein distance between the distribution of
the corresponding target’s values and the distribution of all target’s values (used as reference).
[16]:
enc = TargetClusterEncoder(
feature='holiday',
max_n_categories=2,
excluded_categories='_novalue_'
)
X = data[['holiday']]
y = data['consumption']
enc = enc.fit(X, y)
We can update the bins
column based on the encoder’s mapping between values of holiday
and clusters:
[17]:
grouped_mean['bins'] = original_idx.map(lambda x: enc.mapping_[x])
… and plot the new features again:
[18]:
bin_values = [0, 1]
color_list = ['#74a9cf', '#fc8d59']
b2c = dict(zip(bin_values, color_list))
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(14, 3.54), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
grouped_mean['consumption'].plot.bar(
ax=ax,
rot=0,
color=[b2c[i] for i in grouped_mean['bins']]
)

Again, we can plot the target distributions for each category to see what was achieved:
[19]:
data['bins'] = data['holiday'].map(lambda x: enc.mapping_.get(x))
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(16, 3.54), dpi=96)
layout = (1, 2)
ax0 = plt.subplot2grid(layout, (0, 0))
ax1 = plt.subplot2grid(layout, (0, 1))
subset = data[data['bins'] == 0]
colors = get_colors('tab10', N=subset['holiday'].nunique())
for i, (holiday, grouped) in enumerate(subset.groupby('holiday')):
grouped['consumption'].plot.kde(ax=ax0, color=colors[i], bw_method=0.5)
subset = data[data['bins'] == 1]
colors = get_colors('tab10', N=subset['holiday'].nunique())
for i, (holiday, grouped) in enumerate(subset.groupby('holiday')):
grouped['consumption'].plot.kde(ax=ax1, color=colors[i], bw_method=0.5)

Not only the two clusters seem more homogeneous with respect to the distributions they include, but we also managed to distinguish the holidays according to their consumption profiles:
[20]:
profiles = data.loc[
data['holiday'] != '_novalue_', ['consumption', 'holiday', 'bins']
].copy()
profiles['date'] = profiles.index.date
profiles['time'] = profiles.index.time
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(12, 3.54), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
profiles.pivot(index='time', columns='date', values='consumption').plot(
ax=ax,
alpha=0.8,
legend=None,
color=[b2c[i] for i in profiles['bins'].resample('D').first().dropna()]
)
ax.xaxis.set_major_locator(ticker.MultipleLocator(3600*2))

Conditional effect on target
One may be explicitly interested in clustering the holiday
feature taking into account the hour-of-day feature: how similar are the target’s values in two distinct values of holiday
given similar values for the hour-of-day?
In this case, the encoder first stratifies the categorical feature holiday
into groups with similar values of hour-of-day, and then examines the relationship between the categorical feature’s values and the corresponding values of the target.
The stratification is carried out by a sklearn.tree.DecisionTreeRegressor
model that first fits the stratify_by
features (here hour-of-day) on the target values, and then uses the tree’s leaf nodes as groups. Only the mean of the target’s values per group is taken into account when deriving the clusters.
The parameter min_samples_leaf
defines the minimum number of samples required to be at a leaf node of the decision tree model. Note that the actual number that will be passed to the tree model is min_samples_leaf
multiplied by the number of unique values in the categorical feature to transform.
[21]:
enc = TargetClusterEncoder(
feature='holiday',
max_n_categories=2,
excluded_categories='_novalue_',
stratify_by='hour',
min_samples_leaf=5
)
[22]:
data['hour'] = data.index.hour
X = data[['holiday', 'hour']]
y = data['consumption']
enc = enc.fit(X, y)
It is easy to understand the result of this operation if we consider that when the encoder groups holidays stratified by hours, it actually tries to group the daily profiles of the different holidays in the dataset. Since we already achieved this in the previous step, there shouldn’t be any change in the way holidays are grouped:
[23]:
grouped_mean['bins'] = original_idx.map(lambda x: enc.mapping_[x])
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(14, 3.54), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
grouped_mean['consumption'].plot.bar(
ax=ax,
rot=0,
color=[b2c[i] for i in grouped_mean['bins']]
)

CategoricalEncoder
The CategoricalEncoder
encodes a categorical feature by encaptulating all the aforementioned categorical encoders so far. If max_n_categories
is not None
and the number of unique values of the categorical feature is larger than the max_n_categories
minus the excluded_categories
, the TargetClusterEncoder
will be called.
If encode_as = 'onehot'
, the result comes from a TargetClusterEncoder
+ SafeOneHotEncoder
pipeline, otherwise from a TargetClusterEncoder
+ SafeOrdinalEncoder
one:
n_categories = X[self.feature].nunique()
use_target = (self.max_n_categories is not None) and (
n_categories - len(self.excluded_categories_) > self.max_n_categories
)
if not use_target:
self.feature_pipeline_ = Pipeline(
[
(
"encode_features",
SafeOneHotEncoder(
feature=self.feature, unknown_value=self.unknown_value
),
)
if self.encode_as == "onehot"
else (
"encode_features",
SafeOrdinalEncoder(
feature=self.feature, unknown_value=self.unknown_value
),
)
]
)
else:
self.feature_pipeline_ = Pipeline(
[
(
"reduce_dimension",
TargetClusterEncoder(
feature=self.feature,
stratify_by=self.stratify_by,
max_n_categories=self.max_n_categories,
excluded_categories=self.excluded_categories,
unknown_value=self.unknown_value,
min_samples_leaf=self.min_samples_leaf,
max_features=self.max_features,
random_state=self.random_state,
),
),
(
"to_pandas",
FunctionTransformer(self._to_pandas),
),
(
"encode_features",
SafeOneHotEncoder(
feature=self.feature, unknown_value=self.unknown_value
),
)
if self.encode_as == "onehot"
else (
"encode_features",
SafeOrdinalEncoder(
feature=self.feature, unknown_value=self.unknown_value
),
),
]
)
[24]:
max_n_categories = data['holiday'].nunique() + 3
[25]:
enc = CategoricalEncoder(feature='holiday',
max_n_categories=max_n_categories,
encode_as='onehot')
features = enc.fit_transform(X, y)
features[:5]
[25]:
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])
[26]:
assert min(data['holiday'].nunique(), max_n_categories) == enc.n_features_out_
[27]:
enc = CategoricalEncoder(feature='holiday',
max_n_categories=max_n_categories,
encode_as='ordinal')
features = enc.fit_transform(X, y)
features[:5]
[27]:
array([[11],
[11],
[11],
[11],
[11]], dtype=int16)
[28]:
assert min(data['holiday'].nunique(), max_n_categories) == np.unique(features).size
[29]:
max_n_categories = data['holiday'].nunique() - 3
[30]:
enc = CategoricalEncoder(feature='holiday',
max_n_categories=max_n_categories,
encode_as='onehot')
features = enc.fit_transform(X, y)
features[:5]
[30]:
array([[0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0.]])
[31]:
assert min(data['holiday'].nunique(), max_n_categories) == enc.n_features_out_
[32]:
enc = CategoricalEncoder(feature='holiday',
max_n_categories=max_n_categories,
encode_as='ordinal')
features = enc.fit_transform(X, y)
features[:5]
[32]:
array([[5],
[5],
[5],
[5],
[5]], dtype=int16)
[33]:
assert min(data['holiday'].nunique(), max_n_categories) == np.unique(features).size
An application of the categorical encoder
Suppose we want to use the demo data to predict the energy consumption of the building. The simplest model to use is a model that includes only the hour of the week as a feature. The hour of the week is a categorical feature and it can be encoded in one-hot form:
[34]:
data['hourofweek'] = 24 * data.index.dayofweek + data.index.hour
dmatrix = CategoricalEncoder(feature='hourofweek', encode_as='onehot').fit_transform(data)
We can fit a linear model:
[35]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y)
… and evaluate it in-sample:
[36]:
pred = model.predict(dmatrix)
[37]:
print(f"In-sample CV(RMSE) (%): {100*mean_squared_error(y, pred, squared=False)/y.mean()}")
In-sample CV(RMSE) (%): 19.332298975680697
The degrees of freedom of the model are:
[38]:
np.linalg.matrix_rank(dmatrix)
[38]:
168
We can ask the CategoricalEncoder
to lump together the 168 hour-of-week values into only 60, and repeat the process:
[39]:
X = data[['hourofweek']]
y = data['consumption']
dmatrix = CategoricalEncoder(
feature='hourofweek',
encode_as='onehot',
max_n_categories=60
).fit_transform(X, y)
model = LinearRegression(fit_intercept=False).fit(dmatrix, y)
pred = model.predict(dmatrix)
print(f"In-sample CV(RMSE) (%): {100*mean_squared_error(y, pred, squared=False)/y.mean()}")
In-sample CV(RMSE) (%): 19.349420155191982
This is practically the same performance with one third of the degrees of freedom:
[40]:
np.linalg.matrix_rank(dmatrix)
[40]:
60
Encoding numerical features
This section explains the way numerical encoding can be carried out using feature_encoders
.
The SplineEncoder
takes pandas.DataFrame
s as input and generates numpy.ndarray
s as output.
[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
%matplotlib inline
[2]:
from feature_encoders.encode import SplineEncoder
We can create some synthetic data:
[3]:
def f(x):
return 10 + (x * np.sin(x))
[4]:
x_support = np.linspace(0, 15, 100)
y_support = f(x_support)
x_train = np.sort(np.random.choice(x_support[15:-15], size=25, replace=False))
y_train = f(x_train)
[5]:
X_train = pd.DataFrame(data=x_train, columns=['x'])
X_support = pd.DataFrame(data=x_support, columns=['x'])
[6]:
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(14, 3.5), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
ax.plot(X_support, y_support, label='ground truth', c='#fc8d59')
ax.plot(X_train, y_train, 'o', label='training points', c='#fc8d59')
ax.legend(loc='upper left')

Cubic spline without extrapolation:
[7]:
enc = SplineEncoder(feature='x', n_knots=5, degree=3, strategy='uniform',
extrapolation='constant', include_bias=True,)
model = make_pipeline(enc, LinearRegression(fit_intercept=False))
model.fit(X_train, y_train)
pred = model.predict(X_support)
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(14, 3.5), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
ax.plot(X_support, y_support, label='ground truth', c='#fc8d59')
ax.plot(X_train, y_train, 'o', label='training points', c='#fc8d59')
ax.plot(X_support, pred, label='spline approximation')
ax.legend(loc='upper left')

With linear extrapolation:
[8]:
enc = SplineEncoder(feature='x', n_knots=5, degree=3, strategy='uniform',
extrapolation='linear', include_bias=True,)
model = make_pipeline(enc, LinearRegression(fit_intercept=False))
model.fit(X_train, y_train)
pred = model.predict(X_support)
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(14, 3.5), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
ax.plot(X_support, y_support, label='ground truth', c='#fc8d59')
ax.plot(X_train, y_train, 'o', label='training points', c='#fc8d59')
ax.plot(X_support, pred, label='spline approximation')
ax.legend(loc='upper left')

An application of the spline encoder
The TOWT model for predicting the energy consumption of a building estimates the temperature effect separately for hours of the week with high and with low energy consumption in order to distinguish between occupied and unoccupied periods.
To this end, a flexible curve is fitted on the consumption~temperature
relationship, and if more than the 65% of the data points that correspond to a specific hour-of-week are above the fitted curve, the corresponding hour is flagged as “Occupied”, otherwise it is flagged as “Unoccupied.”
We can apply this approach using feature_encoders
functionality.
Load demo data
[9]:
data = pd.read_csv('data/data.csv', parse_dates=[0], index_col=0)
data = data[~data['consumption_outlier']]
[10]:
dmatrix = SplineEncoder(feature='temperature',
degree=1,
strategy='uniform'
).fit_transform(data)
model = LinearRegression(fit_intercept=False).fit(dmatrix, data['consumption'])
pred = pd.DataFrame(
data=model.predict(dmatrix),
index=data.index,
columns=['consumption']
)
[11]:
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(12, 3.5), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
ax.plot(data['temperature'], data['consumption'], 'o', alpha=0.02)
ax.plot(data['temperature'], pred['consumption'])

[12]:
resid = data[['consumption']] - pred[['consumption']]
mask = resid > 0
mask['hourofweek'] = 24 * mask.index.dayofweek + mask.index.hour
occupied = mask.groupby('hourofweek')['consumption'].mean() > 0.65
data['hourofweek'] = 24 * data.index.dayofweek + data.index.hour
data['occupied'] = data['hourofweek'].map(lambda x: occupied[x])
[13]:
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(12, 3.5), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
ax.scatter(data.loc[data['occupied'], 'temperature'],
data.loc[data['occupied'], 'consumption'],
s=2, alpha=0.2, label='Probably occupied')
ax.scatter(data.loc[~data['occupied'], 'temperature'],
data.loc[~data['occupied'], 'consumption'],
s=2, alpha=0.2, label='Probably not occupied')
ax.legend(fancybox=True, frameon=True, loc='upper left')

Encoding interactions
Interactions are always pairwise and always between encoders (and not features).
The supported interactions are between: (a) categorical and categorical encoders, (b) categorical and linear encoders, (c) categorical and spline encoders, (d) linear and linear encoders, and (e) spline and spline encoders.
All encoders have a n_features_out_
property after fitting.
[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_friedman1
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
%matplotlib inline
[2]:
from feature_encoders.utils import tensor_product, add_constant
from feature_encoders.generate import CyclicalFeatures, DatetimeFeatures
from feature_encoders.encode import (
CategoricalEncoder,
ICatEncoder,
SplineEncoder,
ISplineEncoder,
IdentityEncoder,
ProductEncoder,
ICatLinearEncoder,
ICatSplineEncoder
)
Load demo data
[3]:
data = pd.read_csv('data/data.csv', parse_dates=[0], index_col=0)
data = data[~data['consumption_outlier']]
Pairwise interactions between categorical features
ICatEncoder
encodes the interaction between two categorical features. Both encoders should have the same encode_as
parameter.
If encode_as = 'onehot'
, it returns the tensor product of the results of the two encoders. The tensor product combines row-per-row the results from the first and the second encoder as follows:
A small example of the tensor product function:
[4]:
a = np.array([1, 10]).reshape(1, -1)
b = np.array([10, 20, 30]).reshape(1, -1)
tensor_product(a, b)
[4]:
array([[ 10, 20, 30, 100, 200, 300]])
The easiest way to demonstate it is by combining hours of day and days of week into hours of week:
[5]:
enc = DatetimeFeatures(subset=['dayofweek', 'hour', 'hourofweek'])
data = enc.fit_transform(data)
[6]:
enc_dow = CategoricalEncoder(feature='dayofweek', encode_as='onehot')
feature_dow = enc_dow.fit_transform(data)
feature_dow.shape
[6]:
(37287, 7)
[7]:
enc_hour = CategoricalEncoder(feature='hour', encode_as='onehot')
feature_hour = enc_hour.fit_transform(data)
feature_hour.shape
[7]:
(37287, 24)
[8]:
enc = ICatEncoder(enc_dow, enc_hour).fit(data)
enc.n_features_out_
[8]:
168
[9]:
assert np.all(enc.transform(data).argmax(axis=1) == data['hourofweek'].values)
If encode_as = 'ordinal'
, it returns the combinations of the encoders’ results, where each combination is a string with :
between the two values:
[10]:
enc_dow = CategoricalEncoder(feature='dayofweek', encode_as='ordinal')
enc_hour = CategoricalEncoder(feature='hour', encode_as='ordinal')
enc = ICatEncoder(enc_dow, enc_hour)
feature_trf = enc.fit_transform(data)
feature_trf
[10]:
array([['0:12'],
['0:12'],
['0:12'],
...,
['5:21'],
['5:21'],
['5:22']], dtype='<U13')
[11]:
assert np.unique(feature_trf).size == 168
Pairwise interactions between numerical features
We can generate data for the “Friedman #1” regression problem:
[12]:
X, y = make_friedman1(n_samples=5000, n_features=5, noise=0.2)
X = pd.DataFrame(data=X, columns=[f'x_{i}' for i in range(5)])
y = pd.Series(data=y, index=X.index)
X.head()
[12]:
x_0 | x_1 | x_2 | x_3 | x_4 | |
---|---|---|---|---|---|
0 | 0.491884 | 0.597237 | 0.017681 | 0.753236 | 0.068667 |
1 | 0.281404 | 0.524407 | 0.769966 | 0.689059 | 0.385223 |
2 | 0.329995 | 0.170124 | 0.208075 | 0.390547 | 0.795747 |
3 | 0.772299 | 0.509458 | 0.309719 | 0.172743 | 0.203104 |
4 | 0.838131 | 0.057251 | 0.461958 | 0.006787 | 0.961568 |
[13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=True)
[14]:
enc_0 = SplineEncoder(feature='x_0',
n_knots=5,
degree=3,
strategy="quantile",
extrapolation="constant",
include_bias=True,
)
enc_1 = SplineEncoder(feature='x_1',
n_knots=5,
degree=3,
strategy="quantile",
extrapolation="constant",
include_bias=True,
)
enc_2 = SplineEncoder(feature='x_2',
n_knots=5,
degree=3,
strategy="quantile",
extrapolation="constant",
include_bias=True,
)
enc_3 = SplineEncoder(feature='x_3',
n_knots=5,
degree=3,
strategy="quantile",
extrapolation="constant",
include_bias=True,
)
enc_4 = SplineEncoder(feature='x_4',
n_knots=5,
degree=3,
strategy="quantile",
extrapolation="constant",
include_bias=True,
)
interact = ISplineEncoder(enc_0, enc_1)
[15]:
pipeline = Pipeline([
('features', FeatureUnion([
('inter', interact),
('enc_2', enc_2),
('enc_3', enc_3),
('enc_4', enc_4)
])
),
('regression', LinearRegression(fit_intercept=False))
])
pipeline = pipeline.fit(X_train, y_train)
The root mean squared error is very close to the noise that was injected in the data (0.2):
[16]:
print('Root mean squared out-of-sample error: '
f'{mean_squared_error(np.array(y_test), pipeline.predict(X_test), squared=False)}'
)
Root mean squared out-of-sample error: 0.1980257834415656
Linear interations are also supported through ProductEncoder
. ProductEncoder
expects IdentityEncoder
s, which are utility encoders that return what they are fed.
[17]:
enc_0 = IdentityEncoder(feature='x_0', include_bias=False,)
enc_1 = IdentityEncoder(feature='x_1', include_bias=False,)
interact = ProductEncoder(enc_0, enc_1)
This interaction is practically an element-wise multiplication of the two features:
[18]:
assert np.all(interact.fit_transform(X).squeeze() == X[['x_0', 'x_1']].prod(axis=1))
Pairwise interactions between categorical and numerical features
Suppose that we want to split the hours of the week in the demo data into two distinct categories (according to the similarities of the consumption data) and then model the impact of the outdoor temperature during each one of these categories: consumption ~ temperature:hour_of_week
.
First, we can explore the case where we split all the data by the reduced hour_of_week
and fit a consumption ~ temperature
model to each group:
[19]:
enc_occ = CategoricalEncoder(
feature='hourofweek',
max_n_categories=2,
stratify_by='temperature',
min_samples_leaf=15,
encode_as='ordinal'
)
X = data[['hourofweek', 'temperature']]
y = data['consumption']
data['groups'] = enc_occ.fit_transform(X, y)
[20]:
models = {}
for group, grouped_data in data.groupby('groups'):
model = LinearRegression(fit_intercept=True).fit(grouped_data[['temperature']],
grouped_data['consumption'])
models[group] = model
[21]:
color_list = ['#74a9cf', '#fc8d59']
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(14, 4.5), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
resid = []
for i, (group, grouped_data) in enumerate(data.groupby('groups')):
pred = models[group].predict(grouped_data[['temperature']])
resid.append(grouped_data['consumption'].values - pred)
ax.plot(grouped_data['temperature'], pred,
label=f'group: {group}', c=color_list[i])
ax.plot(grouped_data['temperature'], grouped_data['consumption'],
'o', c=color_list[i], alpha=0.01)
ax.legend(loc='upper left')

[22]:
print(f'Mean squared error: {np.mean(np.concatenate(resid)**2)}')
Mean squared error: 351555.77145852696
The same result can be achieved by first encoding the hour_of_week
feature in one-hot form and then taking the tensor product between its encoding and the temperature
feature. In this case, an intercept must be added directly to the temperature
feature, so that it is possible to model a different intercept for each categorical feature’s level:
[23]:
enc_occ = CategoricalEncoder(
feature='hourofweek',
max_n_categories=2,
stratify_by='temperature',
min_samples_leaf=15,
encode_as='onehot'
)
feature_cat = enc_occ.fit_transform(X, y)
features = tensor_product(feature_cat, add_constant(X['temperature']))
[24]:
model = LinearRegression(fit_intercept=False).fit(features, y)
pred = model.predict(features)
[25]:
resid = y.values - pred
print(f'Mean squared error: {np.mean(resid**2)}')
Mean squared error: 351555.771458527
[26]:
color_list = ['#74a9cf', '#fc8d59']
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(14, 4.5), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
for i in range(enc_occ.n_features_out_):
mask = feature_cat[:, i]==1
ax.plot(data['temperature'][mask], pred[mask], label=f'group: {i}',
c=color_list[i])
ax.plot(data['temperature'][mask], data['consumption'][mask], 'o',
c=color_list[i], alpha=0.01)
ax.legend(loc='upper left')

The conclusion here is that for the case of one categorical and one linear numerical feature, we can model the interaction by first encoding the categorical feature in one-hot form and then taking the tensor product between this encoding and the numerical feature.
This is supported by ICatLinearEncoder
:
[27]:
enc_occ = CategoricalEncoder(
feature='hourofweek',
max_n_categories=2,
stratify_by='temperature',
min_samples_leaf=15,
encode_as='onehot'
)
enc_num = IdentityEncoder(feature='temperature', include_bias=True)
enc = ICatLinearEncoder(encoder_cat=enc_occ, encoder_num=enc_num)
[28]:
features = enc.fit_transform(X, y)
model = LinearRegression(fit_intercept=False).fit(features, y)
pred = model.predict(features)
resid = y.values - pred
print(f'Mean squared error: {np.mean(resid**2)}')
Mean squared error: 351555.77145852696
Next, we want to encode the temperature
feature with splines so that to capture potential non-linearities.
A *first split, then encode* strategy looks like this:
[29]:
models = {}
encoders = {}
for group, grouped_data in data.groupby('groups'):
enc = SplineEncoder(feature='temperature',
n_knots=3,
degree=1,
strategy='uniform',
extrapolation='constant',
include_bias=True,
)
features = enc.fit_transform(grouped_data)
model = LinearRegression(fit_intercept=False).fit(features, grouped_data['consumption'])
models[group] = model
encoders[group] = enc
[30]:
color_list = ['#74a9cf', '#fc8d59']
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(14, 4.5), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
resid = []
for i, (group, grouped_data) in enumerate(data.groupby('groups')):
features = encoders[group].transform(grouped_data)
pred = models[group].predict(features)
resid.append(grouped_data['consumption'].values - pred)
ax.plot(grouped_data['temperature'], pred, '.', ms=1,
label=f'group: {group}', c=color_list[i])
ax.plot(grouped_data['temperature'], grouped_data['consumption'],
'o', c=color_list[i], alpha=0.01)
ax.legend(loc='upper left')

[31]:
print(f'Mean squared error: {np.mean(np.concatenate(resid)**2)}')
Mean squared error: 329833.6746798132
This first split, then encode strategy is implemented by ICatSplineEncoder
. Note that:
If the categorical encoder is already fitted, it will not be re-fitted during
fit
orfit_transform
.The numerical encoder will always be fitted (one encoder per level of categorical feature).
Since we employ cardinality reduction, the categorical encoder should be fitted using all data.
[32]:
enc_occ = CategoricalEncoder(
feature='hourofweek',
max_n_categories=2,
stratify_by='temperature',
min_samples_leaf=15,
encode_as='onehot'
)
# Fit the categorical encoder at global level
enc_occ = enc_occ.fit(X, y)
enc_num = SplineEncoder(feature='temperature',
n_knots=3,
degree=1,
strategy='uniform',
extrapolation='constant',
include_bias=True,
)
[33]:
enc = ICatSplineEncoder(encoder_cat=enc_occ, encoder_num=enc_num)
features = enc.fit_transform(X)
model = LinearRegression(fit_intercept=False).fit(features, y)
pred = model.predict(features)
[34]:
resid = y.values - pred
print(f'Mean squared error: {np.mean(resid**2)}')
Mean squared error: 329833.6746798132
Conditional seasonality
By combining a CategoricalEncoder
with a CyclicalFeatures
generator, we can create features of conditional seasonalities very similarly to how the Prophet library does it:
[35]:
data = pd.DataFrame(index=pd.date_range(start='1/1/2018', end='31/12/2019', freq='D'))
data['weekday'] = data.index.dayofweek < 5
data.head()
[35]:
weekday | |
---|---|
2018-01-01 | True |
2018-01-02 | True |
2018-01-03 | True |
2018-01-04 | True |
2018-01-05 | True |
[36]:
data = CyclicalFeatures(seasonality='yearly', fourier_order=3).fit_transform(data)
data.head()
[36]:
weekday | yearly_delim_0 | yearly_delim_1 | yearly_delim_2 | yearly_delim_3 | yearly_delim_4 | yearly_delim_5 | |
---|---|---|---|---|---|---|---|
2018-01-01 | True | 0.008601 | 0.999963 | 0.017202 | 0.999852 | 0.025801 | 0.999667 |
2018-01-02 | True | 0.025801 | 0.999667 | 0.051584 | 0.998669 | 0.077334 | 0.997005 |
2018-01-03 | True | 0.042993 | 0.999075 | 0.085906 | 0.996303 | 0.128661 | 0.991689 |
2018-01-04 | True | 0.060172 | 0.998188 | 0.120126 | 0.992759 | 0.179645 | 0.983732 |
2018-01-05 | True | 0.077334 | 0.997005 | 0.154204 | 0.988039 | 0.230151 | 0.973155 |
[37]:
enc_cat = CategoricalEncoder(feature='weekday', encode_as='onehot')
features_cat = enc_cat.fit_transform(data)
features_cat.shape
[37]:
(730, 2)
[38]:
enc_lin = IdentityEncoder(feature='yearly', as_filter=True)
features_cyc = enc_lin.fit_transform(data)
features_cyc.shape
[38]:
(730, 6)
As tensor product:
[39]:
features_tp = tensor_product(features_cat, features_cyc)
features_tp = pd.DataFrame(data=features_tp, index=data.index)
features_tp.shape
[39]:
(730, 12)
[40]:
with plt.style.context('seaborn-whitegrid'):
fig, axs = plt.subplots(features_tp.shape[1], figsize=(14, 10), dpi=96)
for i in range(features_tp.shape[1]):
axs[i].plot(features_tp.loc[:, i])
fig.tight_layout()

[41]:
assert np.all(features_tp.loc[data['weekday'], [0, 1, 2, 3, 4, 5]] == 0)
assert np.all(features_tp.loc[~data['weekday'], [6, 7, 8, 9, 10, 11]] == 0)
The same thing can be achieved by:
[42]:
enc_cat = CategoricalEncoder(feature='weekday', encode_as='onehot')
enc_num = IdentityEncoder(feature='yearly', as_filter=True)
enc = ICatLinearEncoder(encoder_cat=enc_cat, encoder_num=enc_num)
features_enc = enc.fit_transform(data)
features_enc = pd.DataFrame(data=features_enc, index=data.index)
[43]:
assert np.all(features_tp == features_enc)
Note that for the case of cyclical data, a first split then encode and a first encode then split strategies are equivalent, because the encoding uses only the information of each row and not any other value from the same column.
The functionality for composing linear model features
feature-encoders
includes a ModelStructure
class for aggregating feature generators and encoders into main effect and pairwise interaction terms for linear regression models.
A ModelStructure
instance can get information about features and encoders either from YAML files or through its API.
[1]:
import calendar
import json
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
[2]:
from feature_encoders.utils import load_config
from feature_encoders.compose import ModelStructure, FeatureComposer
from feature_encoders.generate import DatetimeFeatures
from feature_encoders.models import SeasonalPredictor
Reading information from YAML files
feature-encoders
expects two YAML files:
Feature generator file
A file that provides a mapping between the name of a feature generator and the classes that should be used for the validation of its inputs and for its creation:
trend:
validate: validate.TrendSchema
generate: generate.TrendFeatures
datetime:
validate: validate.DatetimeSchema
generate: generate.DatetimeFeatures
cyclical:
validate: validate.CyclicalSchema
generate: generate.CyclicalFeatures
By default, ModelStructure
searches in feature_encoders.config
to find the validation and generation classes, but one can add packages by adding the fully qualified names of the corresponding classes.
Model configuration file
These files have three sections: (a) added features, (b) regressors and (c) interactions.
Added features
The information in this section is passed to one of the feature generators in feature_encoder.generate
:
add_features:
time: # the name of the generator
ds: null
type: datetime
remainder: passthrough
subset: month, hourofweek
Regressors
The information for each regressor includes its name, the name of the feature to use and encode so that to create this regressor, the type of the encoder (linear, spline or categorical), and the parameters to pass to the corresponding encoder class from feature_encoders.encode
:
regressors:
month: # the name of the regressor
feature: month # the name of the feature
type: categorical
max_n_categories: null
encode_as: onehot
tow: # the name of the regressor
feature: hourofweek # the name of the feature
type: categorical
max_n_categories: 60
encode_as: onehot
flex_temperature:
feature: temperature
type: spline
n_knots: 5
degree: 1
strategy: uniform
extrapolation: constant
interaction_only: true # if True, it will not be included in the main features
Interactions
Interactions can introduce new regressors, reuse regressors already defined in the regressors section, as well as alter the parameters of regressors that are already defined in the regressors section:
interactions:
tow, flex_temperature:
tow:
max_n_categories: 2
stratify_by: temperature
min_samples_leaf: 15
Load configuration files
[3]:
model_conf, feature_conf = load_config(model='towt', features='default')
[4]:
print(json.dumps(model_conf, indent=4))
{
"add_features": {
"time": {
"type": "datetime",
"subset": "month, hourofweek"
}
},
"regressors": {
"month": {
"feature": "month",
"type": "categorical",
"encode_as": "onehot"
},
"tow": {
"feature": "hourofweek",
"type": "categorical",
"max_n_categories": 60,
"encode_as": "onehot"
},
"lin_temperature": {
"feature": "temperature",
"type": "linear"
},
"flex_temperature": {
"feature": "temperature",
"type": "spline",
"n_knots": 5,
"degree": 1,
"strategy": "uniform",
"extrapolation": "constant",
"include_bias": true,
"interaction_only": true
}
},
"interactions": {
"tow, flex_temperature": {
"tow": {
"max_n_categories": 2,
"stratify_by": "temperature",
"min_samples_leaf": 15
}
}
}
}
[5]:
print(json.dumps(feature_conf, indent=4))
{
"trend": {
"validate": "validate.TrendSchema",
"generate": "generate.TrendFeatures"
},
"datetime": {
"validate": "validate.DatetimeSchema",
"generate": "generate.DatetimeFeatures"
},
"cyclical": {
"validate": "validate.CyclicalSchema",
"generate": "generate.CyclicalFeatures"
}
}
Create ModelStructure
[6]:
model_structure = ModelStructure.from_config(model_conf, feature_conf)
[7]:
for key, val in model_structure.components.items():
print(key, '-->', val.keys())
add_features --> dict_keys(['time'])
main_effects --> dict_keys(['month', 'tow', 'lin_temperature'])
interactions --> dict_keys([('tow', 'flex_temperature')])
Create FeatureComposer
Given the model structure, we can create and apply a FeatureComposer
:
[8]:
composer = FeatureComposer(model_structure)
Load demo data
[9]:
data = pd.read_csv('data/data.csv', parse_dates=[0], index_col=0)
data = data[~data['consumption_outlier']]
Use the FeatureComposer
[10]:
X = data[['temperature']]
y = data['consumption']
composer = composer.fit(X, y)
The fit
method of the composer calls two methods: _create_new_features
and _create_encoders
. The feature generators are applied in the same order that they were declared in the YAML configuration file.
[11]:
for item in composer.added_features_:
print(item)
DatetimeFeatures(subset=['month', 'hourofweek'])
[12]:
for name, encoder in composer.encoders_['main_effects'].items():
print('-->', name)
print(encoder)
--> month
CategoricalEncoder(feature='month')
--> tow
CategoricalEncoder(feature='hourofweek', max_n_categories=60)
--> lin_temperature
IdentityEncoder(feature='temperature')
[13]:
for pair_name, encoder in composer.encoders_['interactions'].items():
print('-->', pair_name)
print(encoder)
--> ('tow', 'flex_temperature')
ICatSplineEncoder(encoder_cat=CategoricalEncoder(feature='hourofweek',
max_n_categories=2,
min_samples_leaf=15,
stratify_by=['temperature']),
encoder_num=SplineEncoder(degree=1, feature='temperature'))
After fitting, a composer has a component_names_
attribute:
[14]:
composer.component_names_
[14]:
['lin_temperature', 'month', 'tow', 'tow:flex_temperature']
It also has a component_matrix
attribute that shows how the different columns of the design matrix correspond to the different components. This allows us to break down a model’s prediction into the additive contribution of each component.
[15]:
composer.component_matrix
[15]:
component | lin_temperature | month | tow | tow:flex_temperature |
---|---|---|---|---|
col | ||||
0 | 0 | 1 | 0 | 0 |
1 | 0 | 1 | 0 | 0 |
2 | 0 | 1 | 0 | 0 |
3 | 0 | 1 | 0 | 0 |
4 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... |
78 | 0 | 0 | 0 | 1 |
79 | 0 | 0 | 0 | 1 |
80 | 0 | 0 | 0 | 1 |
81 | 0 | 0 | 0 | 1 |
82 | 0 | 0 | 0 | 1 |
83 rows × 4 columns
The design matrix is constructed by transforming the data:
[16]:
design_matrix = composer.transform(X)
[17]:
assert design_matrix.shape[0] == X.shape[0]
assert design_matrix.shape[1] == composer.component_matrix.shape[0]
[18]:
n_features = 0
for encoder in composer.encoders_['main_effects'].values():
n_features += encoder.n_features_out_
for encoder in composer.encoders_['interactions'].values():
n_features += encoder.n_features_out_
assert design_matrix.shape[1] == n_features
Using the API
An example of using the ModelStructure
API can be found in feature_encoders.models.SeasonalPredictor
:
def _create_composer(self):
model_structure = ModelStructure()
if self.add_trend:
model_structure = model_structure.add_new_feature(
name="added_trend",
fgen_type=TrendFeatures(
ds=self.ds,
name="growth",
remainder="passthrough",
replace=False,
),
)
model_structure = model_structure.add_main_effect(
name="trend",
enc_type=IdentityEncoder(
feature="growth",
as_filter=False,
include_bias=False,
),
)
for seasonality, props in self.seasonalities_.items():
condition_name = props["condition_name"]
model_structure = model_structure.add_new_feature(
name=seasonality,
fgen_type=CyclicalFeatures(
seasonality=seasonality,
ds=self.ds,
period=props.get("period"),
fourier_order=props.get("fourier_order"),
remainder="passthrough",
replace=False,
),
)
if condition_name is None:
model_structure = model_structure.add_main_effect(
name=seasonality,
enc_type=IdentityEncoder(
feature=seasonality,
as_filter=True,
include_bias=False,
),
)
else:
model_structure = model_structure.add_interaction(
lenc_name=condition_name,
renc_name=seasonality,
lenc_type=CategoricalEncoder(
feature=condition_name, encode_as="onehot"
),
renc_type=IdentityEncoder(
feature=seasonality, as_filter=True, include_bias=False
),
)
return FeatureComposer(model_structure)
[19]:
model = SeasonalPredictor(
ds=None,
add_trend=True,
yearly_seasonality="auto",
weekly_seasonality=False,
daily_seasonality=False,
)
We can add a different daily seasonality per day of week:
[20]:
X = DatetimeFeatures(subset='dayofweek').fit_transform(X)
X['dayofweek'] = X['dayofweek'].map(lambda x: calendar.day_abbr[x])
X = X.merge(pd.get_dummies(X['dayofweek']),
left_index=True,
right_index=True).drop('dayofweek', axis=1
)
[21]:
for i in range(7):
day = calendar.day_abbr[i]
model.add_seasonality(
f"daily_on_{day}", period=1, fourier_order=4, condition_name=day
)
[22]:
model = model.fit(X, y)
[23]:
for item in model.composer_.added_features_:
print(item)
TrendFeatures()
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Mon')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Tue')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Wed')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Thu')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Fri')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Sat')
CyclicalFeatures(fourier_order=4, period=1.0, seasonality='daily_on_Sun')
CyclicalFeatures(fourier_order=6, period=365.25, seasonality='yearly')
[24]:
for name, encoder in model.composer_.encoders_['main_effects'].items():
print('-->', name)
print(encoder)
--> trend
IdentityEncoder(feature='growth')
--> yearly
IdentityEncoder(as_filter=True, feature='yearly')
[25]:
for pair_name, encoder in model.composer_.encoders_['interactions'].items():
print('-->', pair_name)
print(encoder)
--> ('Mon', 'daily_on_Mon')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Mon'),
encoder_num=IdentityEncoder(as_filter=True,
feature='daily_on_Mon'))
--> ('Tue', 'daily_on_Tue')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Tue'),
encoder_num=IdentityEncoder(as_filter=True,
feature='daily_on_Tue'))
--> ('Wed', 'daily_on_Wed')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Wed'),
encoder_num=IdentityEncoder(as_filter=True,
feature='daily_on_Wed'))
--> ('Thu', 'daily_on_Thu')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Thu'),
encoder_num=IdentityEncoder(as_filter=True,
feature='daily_on_Thu'))
--> ('Fri', 'daily_on_Fri')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Fri'),
encoder_num=IdentityEncoder(as_filter=True,
feature='daily_on_Fri'))
--> ('Sat', 'daily_on_Sat')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Sat'),
encoder_num=IdentityEncoder(as_filter=True,
feature='daily_on_Sat'))
--> ('Sun', 'daily_on_Sun')
ICatLinearEncoder(encoder_cat=CategoricalEncoder(feature='Sun'),
encoder_num=IdentityEncoder(as_filter=True,
feature='daily_on_Sun'))
[26]:
prediction = model.predict(X)
[27]:
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(12, 3), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
prediction['consumption'][:1344].plot(ax=ax, alpha=0.8) #2 weeks data
y[:1344].plot(ax=ax, alpha=0.5)

Consistency checks
[28]:
design_matrix = model.composer_.transform(X)
[29]:
for i in range(7):
day = calendar.day_abbr[i]
subset_index = model.composer_.component_matrix[
model.composer_.component_matrix[f'{day}:daily_on_{day}'] == 1
].index
subset = pd.DataFrame(design_matrix[:, subset_index], index=X.index)
features_on = subset.columns[(subset.loc[X[X[day]==1].index] == 0).all()]
features_off = subset.columns[(subset.loc[X[X[day]==0].index] == 0).all()]
assert features_on.intersection(features_off).empty
The model works even if we replace:
model_structure = model_structure.add_interaction(
lenc_name=condition_name,
renc_name=seasonality,
lenc_type=CategoricalEncoder(
feature=condition_name, encode_as="onehot"
),
renc_type=IdentityEncoder(
feature=seasonality, as_filter=True, include_bias=False
),
)
with
model_structure = model_structure.add_interaction(
lenc_name=condition_name,
renc_name=seasonality,
lenc_type="categorical",
right_enc_type="linear",
left_feature=condition_name,
renc_type=seasonality,
**{
condition_name: {"encode_as": "onehot"},
seasonality: {"as_filter": True, "include_bias": False},
},
)
because the FeatureComposer
maps “categorical” to CategoricalEncoder
, “linear” to IdentityEncoder
and “spline” to SplineEncoder
, and passes all additional keyword arguments to the corresponding initializers.
Applications of feature-encoders
In this section, we present two applications:
one for a simple linear regression model, and
one for a grouped linear regression model (a model that construct one estimator per data group, splits data by values of a single column and fits one estimator per group).
[1]:
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
%matplotlib inline
[2]:
from feature_encoders.utils import load_config
from feature_encoders.generate import DatetimeFeatures
from feature_encoders.encode import CategoricalEncoder, SplineEncoder, ICatSplineEncoder
from feature_encoders.compose import ModelStructure
from feature_encoders.models import LinearPredictor, GroupedPredictor
[3]:
def cvrmse(y_true, y_pred):
resid = y_true - y_pred
return float(np.sqrt((resid ** 2).sum() / len(resid)) / np.mean(y_true))
def nmbe(y_true, y_pred):
resid = y_true - y_pred
return float(np.mean(resid) / np.mean(y_true))
Load demo data
The data consists of the energy consumption of a building and the outdoor air temperature.
[4]:
data = pd.read_csv('data/data.csv', parse_dates=[0], index_col=0)
data = data[~data['consumption_outlier']]
X = data[['temperature']]
y = data['consumption']
Linear regression model
The simplest model to use is a model that includes only the hour of the week as a feature. The hour of the week is a categorical feature and it can be encoded in one-hot form:
[5]:
features = DatetimeFeatures(subset='hourofweek', remainder='drop').fit_transform(X)
dmatrix = CategoricalEncoder(feature='hourofweek', encode_as='onehot').fit_transform(features)
We can fit a linear model:
[6]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y)
… and evaluate it in-sample:
[7]:
pred = model.predict(dmatrix)
[8]:
y_true = y.values
print(f"In-sample CV(RMSE) (%): {cvrmse(y_true, pred)*100}")
print(f"In-sample NMBE (%): {nmbe(y_true, pred)*100}")
In-sample CV(RMSE) (%): 19.332298975680697
In-sample NMBE (%): 7.804127359819421e-15
The degrees of freedom of the model are:
[9]:
np.linalg.matrix_rank(dmatrix)
[9]:
168
The impact of the hour of the week on energy consumption is then:
[10]:
pred = pd.DataFrame(data=pred, index=y.index, columns=['hourofweek_impact'])
date_enc = DatetimeFeatures(remainder='passthrough', subset='hourofweek')
to_plot = date_enc.fit_transform(pred).groupby('hourofweek').mean()
colors = ['#8c510a', '#d8b365', '#f6e8c3', '#f5f5f5', '#c7eae5', '#5ab4ac', '#01665e']
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(12, 3), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
intervals = np.split(to_plot.index, 7)
for i, item in enumerate(intervals):
ax.axvspan(item[0], item[-1], alpha=0.3, color=colors[i])
to_plot.plot(ax=ax)
ax.set_xlabel('Hour of week')
ax.legend(['Average contribution of hour-of-week feature'], fancybox=True, frameon=True)

We can reduce the number of categories for the hour-of-week feature while retaining as much as possible the feature’s predictive capability:
[11]:
features = DatetimeFeatures(subset='hourofweek', remainder='drop').fit_transform(X)
enc = CategoricalEncoder(feature='hourofweek', encode_as='onehot', max_n_categories=60)
dmatrix = enc.fit_transform(features, y)
[12]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y)
pred = model.predict(dmatrix)
print(f"In-sample CV(RMSE) (%): {cvrmse(y_true, pred)*100}")
print(f"In-sample NMBE (%): {nmbe(y_true, pred)*100}")
In-sample CV(RMSE) (%): 19.34946719265366
In-sample NMBE (%): 4.263718362437927e-14
This is practically the same performance with one third of the degrees of freedom:
[13]:
np.linalg.matrix_rank(dmatrix)
[13]:
60
Another component to include in the model is an interaction term between the hour of the week and the temperature.
The TOWT model estimates the temperature effect separately for periods of the day with high and with low energy consumption in order to distinguish between occupied and unoccupied building periods.
To this end, a flexible curve is fitted on the consumption~temperature
relationship, and if more than the 65% of the data points that correspond to a specific time-of-week are above the fitted curve, the corresponding hour is flagged as “Occupied”, otherwise it is flagged as “Unoccupied.”
We can apply this approach using feature-encoders
functionality:
[14]:
enc = SplineEncoder(feature='temperature', degree=1, strategy='uniform').fit(X)
dmatrix = enc.transform(X)
[15]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y)
pred = model.predict(dmatrix)
pred = pd.Series(data=pred, index=y.index)
[16]:
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(12, 3), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
ax.scatter(X['temperature'], y, s=1, alpha=0.2)
X_sorted = X.sort_values(by='temperature')
ax.plot(X_sorted, pred.loc[X_sorted.index], c='#cc4c02')

[17]:
resid = y - pred
mask = resid > 0
mask = DatetimeFeatures(subset='hourofweek').fit_transform(mask.to_frame('freq'))
occupied = mask.groupby('hourofweek')['freq'].mean() > 0.65
occupied = occupied.to_dict()
[18]:
features = DatetimeFeatures(subset='hourofweek').fit_transform(X)
features['occupied'] = features['hourofweek'].map(lambda x: occupied[x])
features.head()
[18]:
temperature | hourofweek | occupied | |
---|---|---|---|
timestamp | |||
2015-12-07 12:00:00 | 14.300 | 12 | True |
2015-12-07 12:15:00 | 14.525 | 12 | True |
2015-12-07 12:30:00 | 14.750 | 12 | True |
2015-12-07 12:45:00 | 14.975 | 12 | True |
2015-12-07 13:00:00 | 15.200 | 13 | True |
[19]:
enc_temp = SplineEncoder(feature='temperature', degree=1, strategy='uniform')
enc_occ = CategoricalEncoder(feature='occupied', encode_as='onehot')
enc_occ = enc_occ.fit(features) #fit before passed to the interaction
enc = ICatSplineEncoder(encoder_cat=enc_occ, encoder_num=enc_temp)
dmatrix = enc.fit_transform(features)
[20]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y)
pred = model.predict(dmatrix)
print(f"In-sample CV(RMSE) (%): {cvrmse(y_true, pred)*100}")
print(f"In-sample NMBE (%): {nmbe(y_true, pred)*100}")
In-sample CV(RMSE) (%): 18.2731614241806
In-sample NMBE (%): 2.152083291755081e-14
[21]:
np.linalg.matrix_rank(dmatrix)
[21]:
10
Alternatively, we can rely on feature-encoders
functionality to categorize the hours of the week into the two most dissimilar categories in terms of energy consumption given temperature information:
[22]:
features = DatetimeFeatures(subset='hourofweek').fit_transform(X)
enc_temp = SplineEncoder(feature='temperature',
degree=1,
strategy='uniform'
)
enc_occ = CategoricalEncoder(feature='hourofweek',
max_n_categories=2,
stratify_by='temperature',
min_samples_leaf=15
)
enc_occ = enc_occ.fit(features, y) #fit before passed to the interaction
enc = ICatSplineEncoder(encoder_cat=enc_occ, encoder_num=enc_temp)
dmatrix = enc.fit_transform(features)
[23]:
model = LinearRegression(fit_intercept=False).fit(dmatrix, y)
pred = model.predict(dmatrix)
print(f"In-sample CV(RMSE) (%): {cvrmse(y_true, pred)*100}")
print(f"In-sample NMBE (%): {nmbe(y_true, pred)*100}")
In-sample CV(RMSE) (%): 17.4373587893191
In-sample NMBE (%): 8.789160508284433e-14
The prediction results are better while the number of the degrees of freedom is the same:
[24]:
np.linalg.matrix_rank(dmatrix)
[24]:
10
Then, the consumption~temperature
curves per category of hour of week are:
[25]:
date_enc = DatetimeFeatures(remainder='passthrough', subset='hourofweek')
intervals = pd.concat(
( pd.cut(X['temperature'], 15, precision=0),
pd.DataFrame(data=pred, index=X.index, columns=['temperature_impact'])
),
axis=1
)
enc_cat = enc_occ.feature_pipeline_['reduce_dimension']
intervals = date_enc.fit_transform(intervals)
intervals['hourofweek'] = intervals['hourofweek'].map(lambda x: enc_cat.mapping_[x])
to_plot = (
intervals.groupby(['hourofweek', 'temperature'])['temperature_impact']
.mean()
.unstack()
)
colors = ['#8c510a', '#df65b0']
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(12, 3), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
for i, (idx, values) in enumerate(to_plot.iterrows()):
values.plot(ax=ax, lw=2, alpha=0.6, label=f'category {idx}', color=colors[i])
ax.xaxis.set_major_locator(plt.MaxNLocator(10))
ax.set_xlabel('Temperature intervals')
ax.legend(fancybox=True, frameon=True)

In confing/models
, there is a YAML file (towt.yaml) that defines a linear regression model with the two components above and two additional ones:
A categorical feature for the different months in the dataset.
A linear term for the temperature as a main effect. The interaction term between temperature and the hour of the week “corrects” the predictions of the temperature’s linear term in the main effects.
[26]:
model_conf, feature_conf = load_config(model='towt', features='default')
[27]:
print(json.dumps(model_conf, indent=4))
{
"add_features": {
"time": {
"type": "datetime",
"subset": "month, hourofweek"
}
},
"regressors": {
"month": {
"feature": "month",
"type": "categorical",
"encode_as": "onehot"
},
"tow": {
"feature": "hourofweek",
"type": "categorical",
"max_n_categories": 60,
"encode_as": "onehot"
},
"lin_temperature": {
"feature": "temperature",
"type": "linear"
},
"flex_temperature": {
"feature": "temperature",
"type": "spline",
"n_knots": 5,
"degree": 1,
"strategy": "uniform",
"extrapolation": "constant",
"include_bias": true,
"interaction_only": true
}
},
"interactions": {
"tow, flex_temperature": {
"tow": {
"max_n_categories": 2,
"stratify_by": "temperature",
"min_samples_leaf": 15
}
}
}
}
The feature_encoders.models.LinearPredictor
is a linear/ridge regression model that can be created using model configurations such as the above.
[28]:
model_structure = ModelStructure.from_config(model_conf, feature_conf)
model = LinearPredictor(model_structure=model_structure)
Fit with available data:
[29]:
%%time
model = model.fit(X, y)
Wall time: 2.4 s
Evaluate the model in-sample:
[30]:
%%time
pred = model.predict(X)
print(f"In-sample CV(RMSE) (%): {cvrmse(y, pred['consumption'])*100}")
print(f"In-sample NMBE (%): {nmbe(y, pred['consumption'])*100}")
In-sample CV(RMSE) (%): 15.718435024677973
In-sample NMBE (%): 6.953335826993842e-05
Wall time: 290 ms
The effective number of parameters (i.e. the degrees of freedom) is:
[31]:
model.dof
[31]:
79
This is how the design matrix of the regression corresponds to each regressor:
[32]:
model.composer_.component_matrix
[32]:
component | lin_temperature | month | tow | tow:flex_temperature |
---|---|---|---|---|
col | ||||
0 | 0 | 1 | 0 | 0 |
1 | 0 | 1 | 0 | 0 |
2 | 0 | 1 | 0 | 0 |
3 | 0 | 1 | 0 | 0 |
4 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... |
78 | 0 | 0 | 0 | 1 |
79 | 0 | 0 | 0 | 1 |
80 | 0 | 0 | 0 | 1 |
81 | 0 | 0 | 0 | 1 |
82 | 0 | 0 | 0 | 1 |
83 rows × 4 columns
This makes it easy to decompose the prediction into components (the regularization term alpha=0.01
in the LinearPredictor
was used primarily so that the individual components have reasonable values):
[33]:
%%time
pred = model.predict(X, include_components=True)
pred.head()
Wall time: 337 ms
[33]:
consumption | lin_temperature | month | tow | tow:flex_temperature | |
---|---|---|---|---|---|
timestamp | |||||
2015-12-07 12:00:00 | 4087.304937 | 1569.095673 | 593.424166 | 402.220121 | 1522.564976 |
2015-12-07 12:15:00 | 4084.172916 | 1593.784242 | 593.424166 | 402.220121 | 1494.744387 |
2015-12-07 12:30:00 | 4081.040895 | 1618.472810 | 593.424166 | 402.220121 | 1466.923798 |
2015-12-07 12:45:00 | 4077.908875 | 1643.161378 | 593.424166 | 402.220121 | 1439.103209 |
2015-12-07 13:00:00 | 4144.719000 | 1667.849947 | 593.424166 | 472.162267 | 1411.282620 |
[34]:
assert np.allclose(pred['consumption'],
pred[[col for col in pred.columns if col != 'consumption']].sum(axis=1)
)
[35]:
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(12, 3), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
pred['consumption'][:1344].plot(ax=ax, alpha=0.8) #2 weeks data
y[:1344].plot(ax=ax, alpha=0.5)

Grouped linear regression model
The feature_encoders.models.GroupedPredictor
can be applied on different clusters of a dataset. For this example, we assume that the clusters are created by a KMeans approach that is applied on daily consumption profiles, but there are smarter methods to distinguish between consumption profiles while ensuring that they are reliable during prediction (when no consumption data is available - see for instance how the eensight tool for automated M&V
approaches this problem).
Since each of the models in the ensemble predicts on a different subset of the input data (an observation cannot belong to more than one clusters), the final prediction is generated by vertically concatenating all the individual models’ predictions.
[36]:
data['time'] = data.index.time
data['date'] = data.index.date
to_cluster = data.pivot(index='date', columns='time', values='consumption')
to_cluster = to_cluster.fillna(method='bfill').fillna(method='ffill')
[37]:
kmeans = KMeans(n_clusters=3).fit(to_cluster.values)
groups = pd.Series(data=kmeans.labels_, index=to_cluster.index)
data['group'] = data['date'].map(lambda x: str(groups[x]))
[38]:
colors = ['#8c510a', '#3690c0', '#dd3497']
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(12, 3), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
for i, (_, grouped) in enumerate(data.groupby('group')):
grouped.pivot(index='time', columns='date', values='consumption').plot(
ax=ax, legend=False, alpha=0.05, color=colors[i])

[39]:
X = data[['temperature', 'group']]
y = data['consumption']
[40]:
model = GroupedPredictor(
group_feature='group',
model_conf=model_conf,
feature_conf=feature_conf
)
[41]:
%%time
model = model.fit(X, y)
Wall time: 2.83 s
The GroupedPredictor
applies the feature generation transformers defined in model_conf
directly on the dataset before it is split per cluster:
[42]:
model.added_features_
[42]:
[DatetimeFeatures(subset=['month', 'hourofweek'])]
… whereas the cluster predictors do not see or apply any feature generator:
[43]:
for group, est in model.estimators_.items():
print(group, '-->', est.composer_.added_features_)
0 --> []
1 --> []
2 --> []
In addition, GroupedPredictor
fits all categorical encoders in ordinal form, and then passes the encoded data to each cluster predictor:
[44]:
for name, encoder in model.encoders_['main_effects'].items():
print(name, '-->', encoder)
month --> CategoricalEncoder(encode_as='ordinal', feature='month')
tow --> CategoricalEncoder(encode_as='ordinal', feature='hourofweek',
max_n_categories=60)
… and adds the cluster
feature in every stratify_by
that is not empty:
[45]:
for pair_name, encoder in model.encoders_['interactions'].items():
print(pair_name, '-->', encoder)
('tow', 'flex_temperature') --> {'tow': CategoricalEncoder(encode_as='ordinal', feature='hourofweek',
max_n_categories=2, min_samples_leaf=15,
stratify_by=['group', 'temperature'])}
The cluster predictors get encoders that operate on data that has been transformed by the categorical encoders of the GroupedPredictor
. In this way, categorical data is always encoded with full information (while numerical data is encoded at the cluster level):
[46]:
for group, est in model.estimators_.items():
print('group ', group)
for name, encoder in est.composer_.encoders_['main_effects'].items():
print(name, '-->', encoder)
print('\n')
group 0
month --> CategoricalEncoder(feature='month__for__month')
tow --> CategoricalEncoder(feature='hourofweek__for__tow')
lin_temperature --> IdentityEncoder(feature='temperature')
group 1
month --> CategoricalEncoder(feature='month__for__month')
tow --> CategoricalEncoder(feature='hourofweek__for__tow')
lin_temperature --> IdentityEncoder(feature='temperature')
group 2
month --> CategoricalEncoder(feature='month__for__month')
tow --> CategoricalEncoder(feature='hourofweek__for__tow')
lin_temperature --> IdentityEncoder(feature='temperature')
[47]:
for group, est in model.estimators_.items():
print('group ', group)
for name, encoder in est.composer_.encoders_['interactions'].items():
print(name, '-->', encoder)
print('\n')
group 0
('tow', 'flex_temperature') --> ICatSplineEncoder(encoder_cat=CategoricalEncoder(feature='hourofweek__for__tow:flex_temperature',
min_samples_leaf=15),
encoder_num=SplineEncoder(degree=1, feature='temperature'))
group 1
('tow', 'flex_temperature') --> ICatSplineEncoder(encoder_cat=CategoricalEncoder(feature='hourofweek__for__tow:flex_temperature',
min_samples_leaf=15),
encoder_num=SplineEncoder(degree=1, feature='temperature'))
group 2
('tow', 'flex_temperature') --> ICatSplineEncoder(encoder_cat=CategoricalEncoder(feature='hourofweek__for__tow:flex_temperature',
min_samples_leaf=15),
encoder_num=SplineEncoder(degree=1, feature='temperature'))
[48]:
%%time
pred = model.predict(X)
print(f"In-sample CV(RMSE) (%): {cvrmse(y, pred['consumption'])*100}")
print(f"In-sample NMBE (%): {nmbe(y, pred['consumption'])*100}")
In-sample CV(RMSE) (%): 13.537423014233884
In-sample NMBE (%): 0.0001238626531768618
Wall time: 567 ms
The number of parameters is:
[49]:
model.n_parameters
[49]:
242
the degrees of freedom:
[50]:
model.dof
[50]:
230
Since we have fitted one LinearPredictor
per cluster, it is still easy to decompose the prediction into components:
[51]:
%%time
pred = model.predict(X, include_components=True)
pred.head()
Wall time: 650 ms
[51]:
consumption | lin_temperature | month | tow | tow:flex_temperature | |
---|---|---|---|---|---|
timestamp | |||||
2015-12-07 12:00:00 | 4323.340119 | 1439.591296 | 669.917419 | 535.061043 | 1678.770360 |
2015-12-07 12:15:00 | 4315.882619 | 1462.242208 | 669.917419 | 535.061043 | 1648.661948 |
2015-12-07 12:30:00 | 4308.425119 | 1484.893120 | 669.917419 | 535.061043 | 1618.553536 |
2015-12-07 12:45:00 | 4305.942539 | 1507.544032 | 669.917419 | 535.061043 | 1593.420044 |
2015-12-07 13:00:00 | 4398.279232 | 1530.194944 | 669.917419 | 629.880316 | 1568.286553 |
[52]:
assert np.allclose(pred['consumption'],
pred[[col for col in pred.columns if col != 'consumption']].sum(axis=1)
)
[53]:
with plt.style.context('seaborn-whitegrid'):
fig = plt.figure(figsize=(12, 3), dpi=96)
layout = (1, 1)
ax = plt.subplot2grid(layout, (0, 0))
pred['consumption'][:1344].plot(ax=ax, alpha=0.8) #2 weeks data
y[:1344].plot(ax=ax, alpha=0.5)

Getting Help
First, please check issues on Github to see if your question has already been answered there. If no solution is available there feel free to open a new issue and the authors will attempt to respond in a reasonably timely fashion.
Feature Encoders
Functionality
feature-encoders is a library for encoding categorical and numerical features to create features for linear regression models. In particular, it includes functionality for:
Applying custom feature generators to a dataset. Users can add a feature generator to the existing ones by declaring a class for the validation of their inputs and a class for their creation.
Encoding categorical and numerical features. The categorical encoder provides the option to reduce the cardinality of a categorical feature by lumping together categories for which the corresponding distibution of the target values is similar.
Encoding interactions. Interactions are always pairwise and always between encoders (and not features). The supported interactions are between: (a) categorical and categorical encoders, (b) categorical and linear encoders, (c) categorical and spline encoders, (d) linear and linear encoders, and (e) spline and spline encoders.
Composing features for linear regression. feature-encoders includes a ModelStructure class for aggregating feature generators and encoders into main effect and pairwise interaction terms for linear regression models. A ModelStructure instance can get information about additional features and encoders either from YAML files or through its API.
How to use feature-encoders
Please see our API documentation for a complete list of available functions and see our informative tutorials for more comprehensive example use cases.
Python Version
feature-encoders supports Python 3.7+
License
Copyright 2021 Hebes Intelligence. Released under the terms of the Apache License, Version 2.0.
