feature_encoders.encode package

Module contents

class feature_encoders.encode.CategoricalEncoder(*, feature, max_n_categories=None, stratify_by=None, excluded_categories=None, unknown_value=None, min_samples_leaf=1, max_features='auto', random_state=None, encode_as='onehot')[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode categorical features.

If max_n_categories is not None and the number of unique values of the categorical feature is larger than the max_n_categories minus the excluded_categories, the TargetClusterEncoder will be called.

If encode_as = ‘onehot’, the result comes from a TargetClusterEncoder + SafeOneHotEncoder pipeline, otherwise from a TargetClusterEncoder + SafeOrdinalEncoder one.

Parameters

feature (str) – The name of the categorical feature to transform. This encoder operates on a single feature.
max_n_categories (int, optional) – The maximum number of categories to produce. Defaults to None.
stratify_by (str or list of str, optional) – If not None, the encoder will first stratify the categorical feature into groups that have similar values of the features in stratify_by, and then cluster based on the relationship between the categorical feature and the target. It is used only if the number of unique categories minus the excluded_categories is larger than max_n_categories. Defaults to None.
excluded_categories (str or list of str, optional) – The names of the categories to be excluded from the clustering process. These categories will stay intact by the encoding process, so they cannot have the same values as the encoder’s results (the encoder acts as an OrdinalEncoder in the sense that the feature is converted into a column of integers 0 to n_categories - 1). Defaults to None.
unknown_value (int, optional) – This parameter will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If None, the value -1 is used. Defaults to None.
min_samples_leaf (int, optional) – The minimum number of samples required to be at a leaf node of the decision tree model that is used for stratifying the categorical feature if stratify_by is not None. The actual number that will be passed to the tree model is min_samples_leaf multiplied by the number of unique values in the categorical feature to transform. Defaults to 1.
max_features (int, float or {"auto", "sqrt", "log2"}, optional) –
The number of features that the decision tree considers when looking for the best split:
- If int, then consider max_features features at each split of the decision tree
- If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split
- If “auto”, then max_features=n_features
- If “sqrt”, then max_features=sqrt(n_features)
- If “log2”, then max_features=log2(n_features)
- If None, then max_features=n_features
Defaults to “auto”.
random_state (int or RandomState instance, optional) – Controls the randomness of the decision tree estimator. To obtain a deterministic behaviour during its fitting, random_state has to be fixed to an integer. Defaults to None.
encode_as ({'onehot', 'ordinal'}, optional) –
Method used to encode the transformed result.
- If “onehot”, encode the transformed result with one-hot encoding and return a dense array
- If “ordinal”, encode the transformed result as integer values
Defaults to “onehot”.

fit(X: pandas.core.frame.DataFrame, y: Optional[pandas.core.frame.DataFrame] = None)[source]

Fit the encoder on the available data.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (pandas.DataFrame of shape (n_samples, 1), optional) – The target dataframe. Defaults to None.

Raises

ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If the encoder is applied on numerical (float) data.
ValueError – If the number of categories minus the excluded_categories is larger than max_n_categories but target values (y) are not provided.
ValueError – If any of the values in excluded_categories is not found in the input data.

Returns

Fitted encoder.

Return type

CategoricalEncoder

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters: X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
Raises: ValueError – If the input data does not pass the checks of utils.check_X.
Returns: The encoded features as a numpy array.
Return type: numpy array

class feature_encoders.encode.ICatEncoder(encoder_left: feature_encoders.encode._encoders.CategoricalEncoder, encoder_right: feature_encoders.encode._encoders.CategoricalEncoder)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode the interaction between two categorical features.

Interactions are always pairwise and always between encoders (and not features).

Parameters

encoder_left (CategoricalEncoder) – The encoder for the first of the two features.
encoder_right (CategoricalEncoder) – The encoder for the second of the two features.

Raises

ValueError – If any of the two encoders is not a CategoricalEncoder.
ValueError – If the two encoders do not have the same encode_as parameter.

Note

Both encoders should have the same encode_as parameter. If one or both of the encoders is already fitted, it will not be re-fitted during fit or fit_transform.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the encoder on the available data.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.

Returns

Fitted encoder.

Return type

ICatEncoder

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters: X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
Returns: The matrix of interaction features as a numpy array.
Return type: numpy array

class feature_encoders.encode.ICatLinearEncoder(*, encoder_cat: feature_encoders.encode._encoders.CategoricalEncoder, encoder_num: feature_encoders.encode._encoders.IdentityEncoder)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode the interaction between one categorical and one linear numerical feature.

Parameters

encoder_cat (CategoricalEncoder) – The encoder for the categorical feature. It must encode features in an one-hot form.
encoder_num (IdentityEncoder) – The encoder for the numerical feature.

Raises

ValueError – If encoder_cat is not a CategoricalEncoder.
ValueError – If encoder_num is not an IdentityEncoder.
ValueError – If encoder_cat is not encoded as one-hot.

Note

If one or both of the encoders is already fitted, it will not be re-fitted during fit or fit_transform.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the encoder on the available data.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.

Returns

Fitted encoder.

Return type

ICatLinearEncoder

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters: X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
Returns: The matrix of interaction features as a numpy array.
Return type: numpy array

class feature_encoders.encode.ICatSplineEncoder(*, encoder_cat: feature_encoders.encode._encoders.CategoricalEncoder, encoder_num: feature_encoders.encode._encoders.SplineEncoder)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode the interaction between one categorical and one spline-encoded numerical feature.

Parameters

encoder_cat (CategoricalEncoder) – The encoder for the categorical feature. It must encode features in an one-hot form.
encoder_num (SplineEncoder) – The encoder for the numerical feature.

Raises

ValueError – If encoder_cat is not a CategoricalEncoder.
ValueError – If encoder_num is not a SplineEncoder.
ValueError – If encoder_cat is not encoded as one-hot.

Note

If the categorical encoder is already fitted, it will not be re-fitted during fit or fit_transform. The numerical encoder will always be (re)fitted (one encoder per level of categorical feature.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the encoder on the available data.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.

Returns

Fitted encoder.

Return type

ICatSplineEncoder

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters: X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
Returns: The matrix of interaction features as a numpy array.
Return type: numpy array

class feature_encoders.encode.ISplineEncoder(encoder_left: feature_encoders.encode._encoders.SplineEncoder, encoder_right: feature_encoders.encode._encoders.SplineEncoder)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode the interaction between two spline-encoded numerical features.

Parameters

encoder_left (SplineEncoder) – The encoder for the first of the two features.
encoder_right (SplineEncoder) – The encoder for the second of the two features.

Raises

ValueError – If any of the two encoders is not a SplineEncoder.

Note

If one or both of the encoders is already fitted, it will not be re-fitted during fit or fit_transform.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the encoder on the available data.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.

Returns

Fitted encoder.

Return type

ISplineEncoder

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters: X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
Returns: The matrix of interaction features as a numpy array.
Return type: numpy array

class feature_encoders.encode.IdentityEncoder(feature=None, as_filter=False, include_bias=False)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Create an encoder that returns what it is fed.

This encoder can act as a linear feature encoder.

Parameters

feature (str or list of str, optional) – The name(s) of the input dataframe’s column(s) to return. If None, the whole input dataframe will be returned. Defaults to None.
as_filter (bool, optional) – If True, the encoder will return all feature labels for which “feature in label == True”. Defaults to False.
include_bias (bool, optional) – If True, a column of ones is added to the output. Defaults to False.

Raises

ValueError – If as_filter is True, feature cannot include multiple feature names.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the encoder on the available data.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.

Raises

ValueError – If the input data does not pass the checks of utils.check_X.

Returns

Fitted encoder.

Return type

IdentityEncoder

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.

Raises

ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If include_bias is True and a column with constant values already exists in the returned columns.

Returns

The selected column subset as a numpy array.

Return type

numpy array of shape

class feature_encoders.encode.ProductEncoder(encoder_left: feature_encoders.encode._encoders.IdentityEncoder, encoder_right: feature_encoders.encode._encoders.IdentityEncoder)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode the interaction between two linear numerical features.

Parameters

encoder_left (IdentityEncoder) – The encoder for the first of the two features.
encoder_right (IdentityEncoder) – The encoder for the second of the two features.

Raises

ValueError – If any of the two encoders is not an IdentityEncoder.

Note

If one or both of the encoders is already fitted, it will not be re-fitted during fit or fit_transform.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the encoder on the available data.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.

Raises

ValueError – If any of the two encoders is not a single-feature encoder.

Returns

Fitted encoder.

Return type

ProductEncoder

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters: X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
Returns: The matrix of interaction features as a numpy array.
Return type: numpy array

class feature_encoders.encode.SafeOneHotEncoder(feature=None, unknown_value=None)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode categorical features in a one-hot form.

The encoder uses a SafeOrdinalEncoder`to first encode the feature as an integer array and then a `sklearn.preprocessing.OneHotEncoder to encode the features as an one-hot array.

Parameters

feature (str or list of str, optional) – The names of the columns to encode. If None, all categorical columns will be encoded. Defaults to None.
unknown_value (int, optional) – This parameter will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If None, the value -1 is used. During transform, unknown categories will be replaced using the most frequent value along each column. Defaults to None.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the encoder on the available data.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.

Returns

Fitted encoder.

Return type

SafeOneHotEncoder

Raises

ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If the encoder is applied on numerical (float) data.

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters: X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
Raises: ValueError – If the input data does not pass the checks of utils.check_X.
Returns: The encoded column subset as a numpy array.
Return type: numpy array of shape

class feature_encoders.encode.SafeOrdinalEncoder(feature=None, unknown_value=None)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode categorical features as an integer array.

The encoder converts the features into ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.

Parameters

feature (str or list of str, optional) – The names of the columns to encode. If None, all categorical columns will be encoded. Defaults to None.
unknown_value (int, optional) – This parameter will set the encoded value for unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If None, the value -1 is used. During transform, unknown categories will be replaced using the most frequent value along each column. Defaults to None.

fit(X: pandas.core.frame.DataFrame, y=None)[source]

Fit the encoder on the available data.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.

Returns

Fitted encoder.

Return type

SafeOrdinalEncoder

Raises

ValueError – If the input data does not pass the checks of utils.check_X.

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters: X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
Raises: ValueError – If the input data does not pass the checks of utils.check_X.
Returns: The encoded column subset as a numpy array.
Return type: numpy array of shape

class feature_encoders.encode.SplineEncoder(*, feature, n_knots=5, degree=3, strategy='uniform', extrapolation='constant', include_bias=True, order='C')[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Generate univariate B-spline bases for features.

The encoder generates a matrix consisting of n_splines=n_knots + degree - 1 spline basis functions (B-splines) of polynomial order=`degree` for the given feature.

Parameters

feature (str) – The name of the column to encode.
n_knots (int, optional) – Number of knots of the splines if knots equals one of {‘uniform’, ‘quantile’}. Must be larger or equal 2. Ignored if knots is array-like. Defaults to 5.
degree (int, optional) – The polynomial degree of the spline basis. Must be a non-negative integer. Defaults to 3.
strategy ({'uniform', 'quantile'} or array-like of shape (n_knots, n_features) –
optional): Set knot positions such that first knot <= features <= last knot.
- If ‘uniform’, n_knots number of knots are distributed uniformly from min to max values of the features (each bin has the same width)
- If ‘quantile’, they are distributed uniformly along the quantiles of the features (each bin has the same number of observations)
- If an array-like is given, it directly specifies the sorted knot positions including the boundary knots. Note that, internally, degree number of knots are added before the first knot, the same after the last knot
Defaults to “uniform”.
extrapolation ({'error', 'constant', 'linear', 'continue'}, optional) – If ‘error’, values outside the min and max values of the training features raises a ValueError. If ‘constant’, the value of the splines at minimum and maximum value of the features is used as constant extrapolation. If ‘linear’, a linear extrapolation is used. If ‘continue’, the splines are extrapolated as is, option extrapolate=True in scipy.interpolate.BSpline. Defaults to “constant”.
include_bias (bool, optional) – If False, then the last spline element inside the data range of a feature is dropped. As B-splines sum to one over the spline basis functions for each data point, they implicitly include a bias term. Defaults to True.
order ({'C', 'F'}, optional) – Order of output array. ‘F’ order is faster to compute, but may slow down subsequent estimators. Defaults to “C”.

fit(X: pandas.core.frame.DataFrame, y=None, sample_weight=None)[source]

Fit the encoder.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The data to fit.
y (None, optional) – Ignored. Defaults to None.
sample_weight (array-like of shape (n_samples,), optional) – Individual weights for each sample. Used to calculate quantiles if strategy=”quantile”. For strategy=”uniform”, zero weighted observations are ignored for finding the min and max of X. Defaults to None.

Raises

ValueError – If the input data does not pass the checks of utils.check_X.

Returns

Fitted encoder.

Return type

SplineEncoder

transform(X)[source]

Transform the feature data to B-splines.

Parameters: X (pandas.DataFrame of shape (n_samples, n_features)) – The data to transform.
Raises: ValueError – If the input data does not pass the checks of utils.check_X.
Returns: The B-splines matrix.
Return type: numpy.ndarray

class feature_encoders.encode.TargetClusterEncoder(*, feature, max_n_categories, stratify_by=None, excluded_categories=None, unknown_value=None, min_samples_leaf=5, max_features='auto', random_state=None)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Encode a categorical feature as clusters of the target’s values.

The purpose of this encoder is to reduce the cardinality of a categorical feature. This encoder does not replace unknown values with the most frequent one during transform. It just assigns them the value of unknown_value.

Parameters

feature (str) – The name of the categorical feature to transform. This encoder operates on a single feature.
max_n_categories (int, optional) – The maximum number of categories to produce. Defaults to None.
stratify_by (str or list of str, optional) – If not None, the encoder will first stratify the categorical feature into groups that have similar values of the features in stratify_by, and then cluster based on the relationship between the categorical feature and the target. It is used only if the number of unique categories minus the excluded_categories is larger than max_n_categories. Defaults to None.
excluded_categories (str or list of str, optional) – The names of the categories to be excluded from the clustering process. These categories will stay intact by the encoding process, so they cannot have the same values as the encoder’s results (the encoder acts as an OrdinalEncoder in the sense that the feature is converted into a column of integers 0 to n_categories - 1). Defaults to None.
unknown_value (int, optional) – This parameter will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If None, the value -1 is used. Defaults to None.
min_samples_leaf (int, optional) – The minimum number of samples required to be at a leaf node of the decision tree model that is used for stratifying the categorical feature if stratify_by is not None. The actual number that will be passed to the tree model is min_samples_leaf multiplied by the number of unique values in the categorical feature to transform. Defaults to 1.
max_features (int, float or {"auto", "sqrt", "log2"}, optional) –
The number of features that the decision tree considers when looking for the best split:
- If int, then consider max_features features at each split of the decision tree
- If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split
- If “auto”, then max_features=n_features
- If “sqrt”, then max_features=sqrt(n_features)
- If “log2”, then max_features=log2(n_features)
- If None, then max_features=n_features
Defaults to “auto”.
random_state (int or RandomState instance, optional) – Controls the randomness of the decision tree estimator. To obtain a deterministic behaviour during its fitting, random_state has to be fixed to an integer. Defaults to None.

fit(X: pandas.core.frame.DataFrame, y: pandas.core.frame.DataFrame)[source]

Fit the encoder on the available data.

Parameters

X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (pandas.DataFrame of shape (n_samples, 1)) – The target dataframe.

Returns

Fitted encoder.

Return type

TargetClusterEncoder

Raises

ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If the encoder is applied on numerical (float) data.
ValueError – If any of the values in excluded_categories is not found in the input data.
ValueError – If the number of categories left after removing all in excluded_categories is not larger than max_n_categories.

transform(X: pandas.core.frame.DataFrame)[source]

Apply the encoder.

Parameters: X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
Returns: The encoded column subset as a numpy array.
Return type: numpy array
Raises: ValueError – If the input data does not pass the checks of utils.check_X.