feature_encoders.encode package
Module contents
- class feature_encoders.encode.CategoricalEncoder(*, feature, max_n_categories=None, stratify_by=None, excluded_categories=None, unknown_value=None, min_samples_leaf=1, max_features='auto', random_state=None, encode_as='onehot')[source]
Bases:
sklearn.base.TransformerMixin,sklearn.base.BaseEstimatorEncode categorical features.
If max_n_categories is not None and the number of unique values of the categorical feature is larger than the max_n_categories minus the excluded_categories, the TargetClusterEncoder will be called.
If encode_as = ‘onehot’, the result comes from a TargetClusterEncoder + SafeOneHotEncoder pipeline, otherwise from a TargetClusterEncoder + SafeOrdinalEncoder one.
- Parameters
feature (str) – The name of the categorical feature to transform. This encoder operates on a single feature.
max_n_categories (int, optional) – The maximum number of categories to produce. Defaults to None.
stratify_by (str or list of str, optional) – If not None, the encoder will first stratify the categorical feature into groups that have similar values of the features in stratify_by, and then cluster based on the relationship between the categorical feature and the target. It is used only if the number of unique categories minus the excluded_categories is larger than max_n_categories. Defaults to None.
excluded_categories (str or list of str, optional) – The names of the categories to be excluded from the clustering process. These categories will stay intact by the encoding process, so they cannot have the same values as the encoder’s results (the encoder acts as an
OrdinalEncoderin the sense that the feature is converted into a column of integers 0 to n_categories - 1). Defaults to None.unknown_value (int, optional) – This parameter will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If None, the value -1 is used. Defaults to None.
min_samples_leaf (int, optional) – The minimum number of samples required to be at a leaf node of the decision tree model that is used for stratifying the categorical feature if stratify_by is not None. The actual number that will be passed to the tree model is min_samples_leaf multiplied by the number of unique values in the categorical feature to transform. Defaults to 1.
max_features (int, float or {"auto", "sqrt", "log2"}, optional) –
The number of features that the decision tree considers when looking for the best split:
If int, then consider max_features features at each split of the decision tree
If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split
If “auto”, then max_features=n_features
If “sqrt”, then max_features=sqrt(n_features)
If “log2”, then max_features=log2(n_features)
If None, then max_features=n_features
Defaults to “auto”.
random_state (int or RandomState instance, optional) – Controls the randomness of the decision tree estimator. To obtain a deterministic behaviour during its fitting,
random_statehas to be fixed to an integer. Defaults to None.encode_as ({'onehot', 'ordinal'}, optional) –
Method used to encode the transformed result.
If “onehot”, encode the transformed result with one-hot encoding and return a dense array
If “ordinal”, encode the transformed result as integer values
Defaults to “onehot”.
- fit(X: pandas.core.frame.DataFrame, y: Optional[pandas.core.frame.DataFrame] = None)[source]
Fit the encoder on the available data.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (pandas.DataFrame of shape (n_samples, 1), optional) – The target dataframe. Defaults to None.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If the encoder is applied on numerical (float) data.
ValueError – If the number of categories minus the excluded_categories is larger than max_n_categories but target values (y) are not provided.
ValueError – If any of the values in excluded_categories is not found in the input data.
- Returns
Fitted encoder.
- Return type
- transform(X: pandas.core.frame.DataFrame)[source]
Apply the encoder.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
- Returns
The encoded features as a numpy array.
- Return type
numpy array
- class feature_encoders.encode.ICatEncoder(encoder_left: feature_encoders.encode._encoders.CategoricalEncoder, encoder_right: feature_encoders.encode._encoders.CategoricalEncoder)[source]
Bases:
sklearn.base.TransformerMixin,sklearn.base.BaseEstimatorEncode the interaction between two categorical features.
Interactions are always pairwise and always between encoders (and not features).
- Parameters
encoder_left (CategoricalEncoder) – The encoder for the first of the two features.
encoder_right (CategoricalEncoder) – The encoder for the second of the two features.
- Raises
ValueError – If any of the two encoders is not a CategoricalEncoder.
ValueError – If the two encoders do not have the same encode_as parameter.
Note
Both encoders should have the same encode_as parameter. If one or both of the encoders is already fitted, it will not be re-fitted during fit or fit_transform.
- class feature_encoders.encode.ICatLinearEncoder(*, encoder_cat: feature_encoders.encode._encoders.CategoricalEncoder, encoder_num: feature_encoders.encode._encoders.IdentityEncoder)[source]
Bases:
sklearn.base.TransformerMixin,sklearn.base.BaseEstimatorEncode the interaction between one categorical and one linear numerical feature.
- Parameters
encoder_cat (CategoricalEncoder) – The encoder for the categorical feature. It must encode features in an one-hot form.
encoder_num (IdentityEncoder) – The encoder for the numerical feature.
- Raises
ValueError – If encoder_cat is not a CategoricalEncoder.
ValueError – If encoder_num is not an IdentityEncoder.
ValueError – If encoder_cat is not encoded as one-hot.
Note
If one or both of the encoders is already fitted, it will not be re-fitted during fit or fit_transform.
- class feature_encoders.encode.ICatSplineEncoder(*, encoder_cat: feature_encoders.encode._encoders.CategoricalEncoder, encoder_num: feature_encoders.encode._encoders.SplineEncoder)[source]
Bases:
sklearn.base.TransformerMixin,sklearn.base.BaseEstimatorEncode the interaction between one categorical and one spline-encoded numerical feature.
- Parameters
encoder_cat (CategoricalEncoder) – The encoder for the categorical feature. It must encode features in an one-hot form.
encoder_num (SplineEncoder) – The encoder for the numerical feature.
- Raises
ValueError – If encoder_cat is not a CategoricalEncoder.
ValueError – If encoder_num is not a SplineEncoder.
ValueError – If encoder_cat is not encoded as one-hot.
Note
If the categorical encoder is already fitted, it will not be re-fitted during fit or fit_transform. The numerical encoder will always be (re)fitted (one encoder per level of categorical feature.
- class feature_encoders.encode.ISplineEncoder(encoder_left: feature_encoders.encode._encoders.SplineEncoder, encoder_right: feature_encoders.encode._encoders.SplineEncoder)[source]
Bases:
sklearn.base.TransformerMixin,sklearn.base.BaseEstimatorEncode the interaction between two spline-encoded numerical features.
- Parameters
encoder_left (SplineEncoder) – The encoder for the first of the two features.
encoder_right (SplineEncoder) – The encoder for the second of the two features.
- Raises
ValueError – If any of the two encoders is not a SplineEncoder.
Note
If one or both of the encoders is already fitted, it will not be re-fitted during fit or fit_transform.
- class feature_encoders.encode.IdentityEncoder(feature=None, as_filter=False, include_bias=False)[source]
Bases:
sklearn.base.TransformerMixin,sklearn.base.BaseEstimatorCreate an encoder that returns what it is fed.
This encoder can act as a linear feature encoder.
- Parameters
feature (str or list of str, optional) – The name(s) of the input dataframe’s column(s) to return. If None, the whole input dataframe will be returned. Defaults to None.
as_filter (bool, optional) – If True, the encoder will return all feature labels for which “feature in label == True”. Defaults to False.
include_bias (bool, optional) – If True, a column of ones is added to the output. Defaults to False.
- Raises
ValueError – If as_filter is True, feature cannot include multiple feature names.
- fit(X: pandas.core.frame.DataFrame, y=None)[source]
Fit the encoder on the available data.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
- Returns
Fitted encoder.
- Return type
- transform(X: pandas.core.frame.DataFrame)[source]
Apply the encoder.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If include_bias is True and a column with constant values already exists in the returned columns.
- Returns
The selected column subset as a numpy array.
- Return type
numpy array of shape
- class feature_encoders.encode.ProductEncoder(encoder_left: feature_encoders.encode._encoders.IdentityEncoder, encoder_right: feature_encoders.encode._encoders.IdentityEncoder)[source]
Bases:
sklearn.base.TransformerMixin,sklearn.base.BaseEstimatorEncode the interaction between two linear numerical features.
- Parameters
encoder_left (IdentityEncoder) – The encoder for the first of the two features.
encoder_right (IdentityEncoder) – The encoder for the second of the two features.
- Raises
ValueError – If any of the two encoders is not an IdentityEncoder.
Note
If one or both of the encoders is already fitted, it will not be re-fitted during fit or fit_transform.
- fit(X: pandas.core.frame.DataFrame, y=None)[source]
Fit the encoder on the available data.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.
- Raises
ValueError – If any of the two encoders is not a single-feature encoder.
- Returns
Fitted encoder.
- Return type
- class feature_encoders.encode.SafeOneHotEncoder(feature=None, unknown_value=None)[source]
Bases:
sklearn.base.TransformerMixin,sklearn.base.BaseEstimatorEncode categorical features in a one-hot form.
The encoder uses a SafeOrdinalEncoder`to first encode the feature as an integer array and then a `sklearn.preprocessing.OneHotEncoder to encode the features as an one-hot array.
- Parameters
feature (str or list of str, optional) – The names of the columns to encode. If None, all categorical columns will be encoded. Defaults to None.
unknown_value (int, optional) – This parameter will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If None, the value -1 is used. During transform, unknown categories will be replaced using the most frequent value along each column. Defaults to None.
- fit(X: pandas.core.frame.DataFrame, y=None)[source]
Fit the encoder on the available data.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.
- Returns
Fitted encoder.
- Return type
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If the encoder is applied on numerical (float) data.
- transform(X: pandas.core.frame.DataFrame)[source]
Apply the encoder.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
- Returns
The encoded column subset as a numpy array.
- Return type
numpy array of shape
- class feature_encoders.encode.SafeOrdinalEncoder(feature=None, unknown_value=None)[source]
Bases:
sklearn.base.TransformerMixin,sklearn.base.BaseEstimatorEncode categorical features as an integer array.
The encoder converts the features into ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.
- Parameters
feature (str or list of str, optional) – The names of the columns to encode. If None, all categorical columns will be encoded. Defaults to None.
unknown_value (int, optional) – This parameter will set the encoded value for unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If None, the value -1 is used. During transform, unknown categories will be replaced using the most frequent value along each column. Defaults to None.
- fit(X: pandas.core.frame.DataFrame, y=None)[source]
Fit the encoder on the available data.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (None, optional) – Ignored. Defaults to None.
- Returns
Fitted encoder.
- Return type
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
- transform(X: pandas.core.frame.DataFrame)[source]
Apply the encoder.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
- Returns
The encoded column subset as a numpy array.
- Return type
numpy array of shape
- class feature_encoders.encode.SplineEncoder(*, feature, n_knots=5, degree=3, strategy='uniform', extrapolation='constant', include_bias=True, order='C')[source]
Bases:
sklearn.base.TransformerMixin,sklearn.base.BaseEstimatorGenerate univariate B-spline bases for features.
The encoder generates a matrix consisting of n_splines=n_knots + degree - 1 spline basis functions (B-splines) of polynomial order=`degree` for the given feature.
- Parameters
feature (str) – The name of the column to encode.
n_knots (int, optional) – Number of knots of the splines if knots equals one of {‘uniform’, ‘quantile’}. Must be larger or equal 2. Ignored if knots is array-like. Defaults to 5.
degree (int, optional) – The polynomial degree of the spline basis. Must be a non-negative integer. Defaults to 3.
strategy ({'uniform', 'quantile'} or array-like of shape (n_knots, n_features) –
optional): Set knot positions such that first knot <= features <= last knot.
If ‘uniform’, n_knots number of knots are distributed uniformly from min to max values of the features (each bin has the same width)
If ‘quantile’, they are distributed uniformly along the quantiles of the features (each bin has the same number of observations)
If an array-like is given, it directly specifies the sorted knot positions including the boundary knots. Note that, internally, degree number of knots are added before the first knot, the same after the last knot
Defaults to “uniform”.
extrapolation ({'error', 'constant', 'linear', 'continue'}, optional) – If ‘error’, values outside the min and max values of the training features raises a ValueError. If ‘constant’, the value of the splines at minimum and maximum value of the features is used as constant extrapolation. If ‘linear’, a linear extrapolation is used. If ‘continue’, the splines are extrapolated as is, option extrapolate=True in scipy.interpolate.BSpline. Defaults to “constant”.
include_bias (bool, optional) – If False, then the last spline element inside the data range of a feature is dropped. As B-splines sum to one over the spline basis functions for each data point, they implicitly include a bias term. Defaults to True.
order ({'C', 'F'}, optional) – Order of output array. ‘F’ order is faster to compute, but may slow down subsequent estimators. Defaults to “C”.
- fit(X: pandas.core.frame.DataFrame, y=None, sample_weight=None)[source]
Fit the encoder.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The data to fit.
y (None, optional) – Ignored. Defaults to None.
sample_weight (array-like of shape (n_samples,), optional) – Individual weights for each sample. Used to calculate quantiles if strategy=”quantile”. For strategy=”uniform”, zero weighted observations are ignored for finding the min and max of X. Defaults to None.
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
- Returns
Fitted encoder.
- Return type
- class feature_encoders.encode.TargetClusterEncoder(*, feature, max_n_categories, stratify_by=None, excluded_categories=None, unknown_value=None, min_samples_leaf=5, max_features='auto', random_state=None)[source]
Bases:
sklearn.base.TransformerMixin,sklearn.base.BaseEstimatorEncode a categorical feature as clusters of the target’s values.
The purpose of this encoder is to reduce the cardinality of a categorical feature. This encoder does not replace unknown values with the most frequent one during transform. It just assigns them the value of unknown_value.
- Parameters
feature (str) – The name of the categorical feature to transform. This encoder operates on a single feature.
max_n_categories (int, optional) – The maximum number of categories to produce. Defaults to None.
stratify_by (str or list of str, optional) – If not None, the encoder will first stratify the categorical feature into groups that have similar values of the features in stratify_by, and then cluster based on the relationship between the categorical feature and the target. It is used only if the number of unique categories minus the excluded_categories is larger than max_n_categories. Defaults to None.
excluded_categories (str or list of str, optional) – The names of the categories to be excluded from the clustering process. These categories will stay intact by the encoding process, so they cannot have the same values as the encoder’s results (the encoder acts as an
OrdinalEncoderin the sense that the feature is converted into a column of integers 0 to n_categories - 1). Defaults to None.unknown_value (int, optional) – This parameter will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If None, the value -1 is used. Defaults to None.
min_samples_leaf (int, optional) – The minimum number of samples required to be at a leaf node of the decision tree model that is used for stratifying the categorical feature if stratify_by is not None. The actual number that will be passed to the tree model is min_samples_leaf multiplied by the number of unique values in the categorical feature to transform. Defaults to 1.
max_features (int, float or {"auto", "sqrt", "log2"}, optional) –
The number of features that the decision tree considers when looking for the best split:
If int, then consider max_features features at each split of the decision tree
If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split
If “auto”, then max_features=n_features
If “sqrt”, then max_features=sqrt(n_features)
If “log2”, then max_features=log2(n_features)
If None, then max_features=n_features
Defaults to “auto”.
random_state (int or RandomState instance, optional) – Controls the randomness of the decision tree estimator. To obtain a deterministic behaviour during its fitting,
random_statehas to be fixed to an integer. Defaults to None.
- fit(X: pandas.core.frame.DataFrame, y: pandas.core.frame.DataFrame)[source]
Fit the encoder on the available data.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
y (pandas.DataFrame of shape (n_samples, 1)) – The target dataframe.
- Returns
Fitted encoder.
- Return type
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.
ValueError – If the encoder is applied on numerical (float) data.
ValueError – If any of the values in excluded_categories is not found in the input data.
ValueError – If the number of categories left after removing all in excluded_categories is not larger than max_n_categories.
- transform(X: pandas.core.frame.DataFrame)[source]
Apply the encoder.
- Parameters
X (pandas.DataFrame of shape (n_samples, n_features)) – The input dataframe.
- Returns
The encoded column subset as a numpy array.
- Return type
numpy array
- Raises
ValueError – If the input data does not pass the checks of utils.check_X.