explorica.data_quality

explorica.data_quality.outliers

explorica.data_quality.data_preprocessing

Data preprocessing utilities for exploratory data analysis (EDA).

Functions

get_missing(data, ascending=None, round_digits=None): Return the number and proportion of missing (NaN) values per column.
drop_missing(data, axis=0, threshold_pct=0.05, threshold_abs=None, verbose=False): Drops rows or columns containing NaNs according to a specified threshold.
get_constant_features(data, method=”top_value_ratio”, threshold=1.0, nan_policy=”drop”): Identify constant and quasi-constant features based on the frequency of the most common value. Returns a DataFrame with columns: is_constant and top_value_ratio.
get_categorical_features(data, threshold=30, **kwargs): Identifies constant and quasi-constant features in the dataset.
set_categorical(data, threshold=30, nan_policy=”drop”, verbose=False, **kwargs): Convert eligible columns to Pandas category dtype for memory optimization and improved performance in certain operations.

Notes

All methods are implemented as @staticmethod, so the class does not maintain any state.

Drop rows or columns containing NaNs according to a specified threshold.

This function removes rows (axis=0) or columns (axis=1) that contain NaN values in columns whose proportion of missing values is below (axis=0) or above (axis=1) the specified threshold. Threshold can be specified as a proportion (threshold_pct) or an absolute number (threshold_abs). Absolute threshold, if provided, overrides the proportion threshold.

Parameters:

dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Number]]

Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.

axisint, optional, default=0

Axis along which to remove NaNs:

0 : drop rows with NaNs in columns under the threshold,
1 : drop columns with NaNs above the threshold.

threshold_pctfloat, optional, default=0.05

The maximum allowed proportion of NaNs for a feature to be retained.

When axis=0 (row-wise deletion): rows are removed if the proportion of NaNs in their columns exceeds this threshold.
When axis=1 (column-wise deletion): columns are removed if the proportion of NaNs exceeds this threshold. Ignored if threshold_abs is provided.

threshold_absint, optional

The maximum allowed absolute number of NaNs for a feature to be retained.

When axis=0: rows are removed if the number of NaNs per column exceeds this threshold.
When axis=1: columns are removed if the number of NaNs exceeds this threshold.

Overrides threshold_pct if provided.

verbosebool, optional, default=False

If True, logs detailed information about the operation including:

number of rows or columns removed,
columns affected,
original and resulting DataFrame shape.

Returns:

pd.DataFrame: DataFrame after dropping rows or columns according to the threshold.

Raises:

ValueError: If data has keys and they are not unique. If threshold_abs is not a non-negative integer. If threshold_abs is greater than ‘data’ length If threshold_pct not in [0, 1] If axis is not 0 or 1.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from explorica.data_quality import drop_missing
>>> df = pd.DataFrame({"A": [1,2,3,4,5,np.nan],
...                    "B": [1,2,3,4,5,6],
...                    "C": [np.nan, 2, np.nan, np.nan, np.nan, np.nan]})
>>> # Only removes rows if NaN is less than 2 per feature
>>> print(drop_missing(df, axis=0, threshold_abs=2))
     A  B    C
0  1.0  1  NaN
1  2.0  2  2.0
2  3.0  3  NaN
3  4.0  4  NaN
4  5.0  5  NaN
>>> # Only removes columns if more than 20% of values per feature are NaN
>>> print(drop_missing(df, axis=1, threshold_pct=0.2))
     A  B
0  1.0  1
1  2.0  2
2  3.0  3
3  4.0  4
4  5.0  5
5  NaN  6

Identify categorical features in a dataset.

Identifying categorical features is a necessary preprocessing step before applying encoding strategies or statistical tests that require knowledge of feature types. This function combines dtype-based filtering with a uniqueness threshold, and optionally flags binary and constant columns, providing a flexible single-pass audit of categorical structure in the dataset.

Parameters:

dataSequence[Any] | Sequence[Sequence[Any]] | Mapping[str, Sequence[Any]]

Input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.

thresholdint, Sequence[int] or Mapping[str, int], optional, default=30

Maximum number of unique values allowed for a column to be considered categorical. If a mapping is provided, values are applied per column. Scalar values are broadcast to all columns; sequences or mappings are aligned by column name.

sign_binbool, default=False

If True, append an is_binary column to the result, marking columns with exactly two unique values.

sign_constbool, default=False

If True, append an is_constant column to the result, marking columns with only one unique value.

include_numberbool, default=False

Include numeric (number) columns that satisfy the threshold.

include_intbool, default=False

Include integer (int) columns that satisfy the threshold.

include_strbool, default=False

Include string (object) columns that satisfy the threshold.

include_boolbool, default=False

Include boolean columns.

include_datetimebool, default=False

Include datetime columns.

include_binbool, default=False

Include binary columns (exactly two unique values).

include_constbool, default=False

Include constant columns (exactly one unique value).

include_allbool, default=False

Disable dtype filtering; only threshold is applied.

includeIterable[str], default={“object”}

Explicit set of dtype aliases to include (e.g. {“object”, “number”} or {“int”, “bin”}). The parameter has the highest priority among inclusion rules:

Explicit include argument (user-defined)
Flag parameters (e.g., include_int, include_str, etc.)
Default value {“object”}

If include is provided directly, all flags are ignored.

nan_policystr | Literal[‘drop’, ‘raise’, ‘include’], default=’drop’

Policy for handling NaN values in input data:

‘raise’ : raise ValueError if any NaNs are present in data.
‘drop’drop rows (axis=0) containing NaNs before computation. This
does not drop entire columns.
‘include’ : treat NaN as a valid value and include them in computations.

Returns:

pd.DataFrame

DataFrame indexed by column names with:

categories_count : number of unique values in each column
is_category : flag for categorical columns
is_binary : (optional) flag for binary columns
is_constant : (optional) flag for constant columns

Raises:

ValueError: If input data contains duplicate column names or invalid nan_policy.
TypeError: If threshold is not scalar, list, or mapping convertible to per-column limits.

Notes

The function supports combined filtering: first by unique value count (threshold), then by dtype matching.
Internal helper functions _filter_standard_dtypes, _filter_bin_const and _filter_categories provide modular filtering logic.
The original data are not modified.
Compatible with get_constant_features for constant detection.

Examples

>>> import pandas as pd
>>> import seaborn as sns
>>> from explorica.data_quality import get_categorical_features
>>> df = sns.load_dataset("titanic")
>>> # marks as a category string and integer columns
>>> # with 4 or fewer unique objects
>>> get_categorical_features(
...     df, threshold=4, include={"str", "int"})
             categories_count  is_category
survived                    2            1
pclass                      3            1
sex                         2            1
age                        63            0
sibsp                       4            1
parch                       4            1
fare                       93            0
embarked                    3            1
class                       3            0
who                         3            1
adult_male                  2            0
deck                        7            0
embark_town                 3            1
alive                       2            1
alone                       2            0
>>> df["constant_feature"] = 0
>>> # Additionally signs binary and constant features
>>> get_categorical_features(df, threshold=10, sign_bin=True, sign_const=True)
                  categories_count  is_category  is_binary  is_constant
survived                         2            0          1            0
pclass                           3            0          0            0
sex                              2            1          1            0
age                             63            0          0            0
sibsp                            4            0          0            0
parch                            4            0          0            0
fare                            93            0          0            0
embarked                         3            1          0            0
class                            3            0          0            0
who                              3            1          0            0
adult_male                       2            0          1            0
deck                             7            0          0            0
embark_town                      3            1          0            0
alive                            2            1          1            0
alone                            2            0          1            0
constant_feature                 1            0          0            1

explorica.data_quality.data_preprocessing.get_constant_features(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], method: str = 'top_value_ratio', threshold: float | None = 1.0, nan_policy: str | Literal['drop', 'raise', 'include'] = 'drop') → DataFrame[source]

Identify constant and quasi-constant features in the dataset.

Constant and quasi-constant features carry little to no predictive information and can negatively affect model training by introducing noise or causing numerical instability. This function supports multiple detection strategies: a ratio-based approach, a uniqueness-based metric, and Shannon entropy, allowing the threshold to be interpreted either as a dominance criterion or as an information-theoretic bound.

Parameters:

dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Number]]

Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.

methodstr, default ‘top_value_ratio’

Metric used to detect constant features:

“top_value_ratio”: proportion of the most frequent value.
“non_uniqueness”: 1 - number of unique values / total count.
“entropy”: Shannon entropy of the feature.

thresholdfloat, default=1.0

Non-negative threshold value in the range [0, +∞). Decision boundary for each method:

For “top_value_ratio” or “non_uniqueness”: values >= threshold are flagged constant.
For “entropy”: values <= threshold are flagged constant.

nan_policystr | Literal, default=’drop’

Policy for handling NaN values in input data:

‘raise’ : raise ValueError if any NaNs are present in data.
‘drop’drop rows (axis=0) containing NaNs before computation. This
does not drop entire columns.
‘include’ : treat NaN as a valid value and include them in computations.

Returns:

pd.DataFrame

A DataFrame indexed by column names with:

‘is_const’: bool flag if column is (quasi-)constant
‘top_value_ratio’: proportion of the most frequent value

Raises:

ValueError: If an unsupported method or nan_policy is provided. If input contains duplicate column names. If threshold is negative.

Examples

>>> # Basic usage
>>> # Demonstrates a simple use case with the default ``top_value_ratio`` method,
>>> # which identifies constant or quasi-constant features based on the most
>>> # frequent value ratio.
>>> import pandas as pd
>>> import numpy as np
>>> from explorica.data_quality import get_constant_features
>>> data = [[1, 3, 3, 3, 3, 6], [1, 2, 3, 4, 5, 5]]
>>> print(get_constant_features(
...     data, method="top_value_ratio", threshold=0.5))
   top_value_ratio  is_const
0         0.666667       1.0
1         0.333333       0.0

>>> # Entropy-based threshold interpretation
>>> # Illustrates how an entropy threshold can be interpreted as a fraction of the
>>> # maximum information capacity (in bits) for each feature. This approach allows
>>> # defining thresholds relative to the diversity of feature values.
>>> import pandas as pd
>>> import numpy as np
>>> data = pd.DataFrame({"A": [1, 2, 3, 4, 5, 6, 6],
...                      "B": [0, 0, 0, 0, 0, 1, 1]})
>>> thresh = 0.7
>>> thresh_bits_dim = thresh * np.log2(data.nunique())
>>> print(thresh_bits_dim)
A    1.809474
B    0.700000
dtype: float64
>>> get_constant_features(
...     data, method="entropy", threshold = thresh_bits_dim.mean())
    entropy  is_const
A  2.521641       0.0
B  0.863121       1.0

explorica.data_quality.data_preprocessing.get_missing(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], ascending=None, round_digits=None) → DataFrame[source]

Calculate the number and percentage of missing (NaN) values for each column.

Identifying missing values is typically one of the first steps in exploratory data analysis, as their presence and distribution can significantly affect downstream modeling and analysis. This function provides a concise per-column summary of missing value counts and their relative proportions.

Parameters:

dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Number]]

Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.

ascendingbool, optional

If specified, sorts the result by the count_of_nans column.

If True, sorts in ascending order (fewest missing values first).
If False, sorts in descending order (most missing values first).
If None (default), no sorting is performed.

round_digitsint, optional

Number of decimal places to round the pct_of_nans values to.

Must be a non-negative integer (x >= 0).
If None (default), no rounding is applied.

Returns:

pd.DataFrame

A DataFrame with the following columns:

count_of_nans : int Number of NaN values in each column.
pct_of_nans : float Proportion of NaN values in each column (0.0 to 1.0).

Raises:

ValueError: If data has keys and they are not unique. If round_digits is not a non-negative integer.

Notes

The pct_of_nans values are calculated as the fraction of missing values relative to the total number of rows in the dataset.
Useful for quickly identifying columns with high proportions of missing data before applying data cleaning or imputation.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from explorica.data_quality import get_missing
>>> # Simple usage
>>> df = pd.DataFrame({"A": [1, 2, pd.NA, np.nan, 5, 6, 7],
...                    "B": [7, None, 5, 4, 3, 2, 1]})
>>> get_missing(df, round_digits=4)
   count_of_nans  pct_of_nans
A              2       0.2857
B              1       0.1429

Convert eligible columns to Pandas category dtype.

Useful for memory optimization and improved performance in certain operations.

Parameters:

dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Number]]

Input data. Can be 1D, 2D (sequence of sequences), or a mapping of column names to sequences.

thresholdint, Sequence[int] or Mapping[str, int], optional, default=30

Maximum number of unique values allowed for a column to be considered categorical. If a mapping is provided, values are applied per column. Scalar values are broadcast to all columns; sequences or mappings are aligned by column name.

include_numberbool, default=False

Include numeric (number) columns that satisfy the threshold.

include_intbool, default=False

Include integer (int) columns that satisfy the threshold.

include_strbool, default=False

Include string (object) columns that satisfy the threshold.

include_boolbool, default=False

Include boolean columns.

include_datetimebool, default=False

Include datetime columns.

include_binbool, default=False

Include binary columns (exactly two unique values).

include_constbool, default=False

Include constant columns (exactly one unique value).

include_allbool, default=False

Disable dtype filtering; only threshold is applied.

includeIterable[str], default={“object”}

Explicit set of dtype aliases to include (e.g. {“object”, “number”} or {“int”, “bin”}). The parameter has the highest priority among inclusion rules:

Explicit include argument (user-defined)
Flag parameters (e.g., include_int, include_str, etc.)
Default value {“object”}

If include is provided directly, all flags are ignored.

nan_policystr | Literal[‘drop’, ‘raise’, ‘include’], default=’drop’

Policy for handling NaN values in input data:

‘raise’ : raise ValueError if any NaNs are present in data.
‘drop’drop rows (axis=0) containing NaNs before computation. This
does not drop entire columns.
‘include’ : treat NaN as a valid value and include them in computations.

verbosebool, optional, default=False

If True, logs detailed information about the operation including:

count and names of affected columns.

Returns:

pd.DataFrame: A copy of the original DataFrame with selected columns converted to category dtype.

Raises:

Exception: Propagates exceptions from get_categorical_features for parameter validation errors. See get_categorical_features documentation for specific error conditions.

Notes

Converting to category can significantly reduce memory usage, especially for string/object columns with many repeated values.
category stores integer codes (int8/int16) and a category mapping, making comparisons and filtering faster than for object dtype.
For numeric columns, memory savings may be smaller, but grouping and filtering can still be faster.
The original DataFrame is not modified - a copy is returned.

Examples

>>> # Basic usage example
>>> import pandas as pd
>>> import seaborn as sns
>>> from explorica.data_quality import set_categorical
>>> df = pd.DataFrame({"A": [1, 2, 3, 4, 5, 6, 7, 8],
...                    "B": ["A", "A", "B", "C", "A", "B", "C", "A"],
...                    "C": [1, 0, 1, 0, 1, 1, 1, 1]})
>>> df = set_categorical(df, include_bin=True, include_str=True)
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       8 non-null      int64
 1   B       8 non-null      category
 2   C       8 non-null      category
dtypes: category(2), int64(1)
memory usage: 468.0 bytes

>>> # Memory usage reducing example
>>> df = sns.load_dataset('titanic')
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   survived     891 non-null    int64
 1   pclass       891 non-null    int64
 2   sex          891 non-null    object
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64
 5   parch        891 non-null    int64
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object
 8   class        891 non-null    category
 9   who          891 non-null    object
 10  adult_male   891 non-null    bool
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object
 13  alive        891 non-null    object
 14  alone        891 non-null    bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
>>> set_categorical(df).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   survived     891 non-null    int64
 1   pclass       891 non-null    int64
 2   sex          891 non-null    category
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64
 5   parch        891 non-null    int64
 6   fare         891 non-null    float64
 7   embarked     889 non-null    category
 8   class        891 non-null    category
 9   who          891 non-null    category
 10  adult_male   891 non-null    bool
 11  deck         203 non-null    category
 12  embark_town  889 non-null    category
 13  alive        891 non-null    category
 14  alone        891 non-null    bool
dtypes: bool(2), category(7), float64(2), int64(4)
memory usage: 50.8 KB

explorica.data_quality.feature_engineering

Module for feature engineering on numeric and categorical data.

This module provides utilities for fast and flexible feature transformation. It is focused on common preprocessing tasks: frequency encoding, ordinal encoding and discretization (binning) of continuous variables. Implementations accept pandas Series/DataFrame, NumPy arrays, Python sequences and mappings (dict-like inputs).

Functions

freq_encode(data, axis=0, normalize=True, round_digits=None): Performs frequency encoding on a categorical feature(s).
ordinal_encode(data, axis=0, order_method=”frequency”, order_ascending=False, **kwargs): Encode categorical values with ordinal integers based on a specified ordering rule.
discretize_continuous(data, bins=None, binning_method=”uniform”, intervals=”pandas”): Discretize continuous numeric data into categorical intervals.

Examples

>>> import pandas as pd
>>> from explorica.data_quality import freq_encode
>>> # Simple encoder usage
>>> df = pd.DataFrame({
...      "color": ["red", "blue", "red", "green", "blue", "red"],
...      "shape": ["circle", "square", "circle", "triangle", "square", "circle"]
... })
>>> encoded = freq_encode(df, round_digits=4)
>>> print(encoded)
    color   shape
0  0.5000  0.5000
1  0.3333  0.3333
2  0.5000  0.5000
3  0.1667  0.1667
4  0.3333  0.3333
5  0.5000  0.5000

explorica.data_quality.feature_engineering.discretize_continuous(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], bins: int = None, binning_method: Literal['uniform', 'quantile'] = 'uniform', intervals: str | Sequence = 'pandas') → Series | DataFrame[source]

Discretize continuous numeric data into categorical intervals.

Discretization converts continuous numeric features into ordered categorical intervals, which can improve interpretability, reduce the effect of outliers, and serve as a preprocessing step for models that benefit from categorical inputs. The function supports both uniform and quantile-based binning strategies, with flexible control over bin count and interval labeling on a per-column basis.

Parameters:

dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Number]]

Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.

binsint | Sequence[int] | Mapping[str, int], default=None

Number of discrete bins (intervals) to split each numeric feature into.

int — applies the same number of bins to all columns. Example: bins=5 is equivalent to bins={'col1': 5, 'col2': 5, ..., 'colN': 5}.
Sequence[int] — specifies an individual number of bins for each column, in order of appearance. The sequence length must match the number of columns. Example: bins=[3, 2, 3] ≡ bins={'col1': 3, 'col2': 2, 'col3': 3}.
Mapping[str, int] — explicit per-column specification. Keys must exactly match the column names present in data. Missing or extra keys will raise KeyError.

The number of bins must be a positive integer for every column. If not provided, the number of bins is automatically estimated using Sturges’ formula:

\[k = 1 + 3.322\log_{10}(n)\]

where n is the number of samples per column. Priority of bin determination:

If intervals is a sequence of custom labels, its length defines the number of bins (even if bins is specified).

Otherwise, bins is used as provided.

If neither bins nor intervals defines the bin count, the Sturges’ rule is applied.

Returns:

pd.Series or pd.DataFrame: Categorical representation of the binned data. Returns a Series for a single-column input, or a DataFrame for multi-column input.

Raises:

ValueError: If binning_method or intervals is unsupported. If intervals is sequence and don’t match bin count. If data contains NaN values. If intervals is sequence and contains NaN values. If bins is negative or not an integer.
KeyError: If the keys of bins or intervals (when provided as a mapping) do not match the column names of the input data. Also raised if intervals or bins are provided as sequences whose lengths do not correspond to the number of input features.

Warns:

UserWarning: If the number of bins specified for a feature exceeds the number of its unique values. In this case, the number of bins will be automatically reduced to n_unique - 1 for the corresponding column. A warning message will inform the user of this adjustment.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from explorica.data_quality import discretize_continuous
>>> # Simple usage
>>> df = pd.DataFrame({"f1": np.linspace(0, 1000, 100),
...                    "f2": np.linspace(0, 2150, 100)})
>>> discretize_continuous(df, bins=[10, 15])
                 f1                  f2
0   (-1.001, 100.0]   (-2.151, 143.333]
1   (-1.001, 100.0]   (-2.151, 143.333]
2   (-1.001, 100.0]   (-2.151, 143.333]
3   (-1.001, 100.0]   (-2.151, 143.333]
4   (-1.001, 100.0]   (-2.151, 143.333]
..              ...                 ...
95  (900.0, 1000.0]  (2006.667, 2150.0]
96  (900.0, 1000.0]  (2006.667, 2150.0]
97  (900.0, 1000.0]  (2006.667, 2150.0]
98  (900.0, 1000.0]  (2006.667, 2150.0]
99  (900.0, 1000.0]  (2006.667, 2150.0]

[100 rows x 2 columns]

explorica.data_quality.feature_engineering.freq_encode(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], axis: int = 0, normalize: bool = True, round_digits: int = None) → Series | DataFrame[source]

Perform frequency encoding on a categorical feature(s).

Frequency encoding replaces each category with its frequency of occurrence in the data. This is particularly useful as a preprocessing step for machine learning models that require numerical input, as it preserves information about the distribution of categorical values without introducing arbitrary ordinal relationships.

Parameters:

dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Number]]

Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.

axisint, {0, 1}, default 0

Applicable only if input is 2D:

0: encode each column independently (column-wise), returns pd.DataFrame.
1: encode each row based on the combination of column values (row-wise), returns pd.Series.

Ignored if input is 1D.

normalizebool, default=True

If True, encodes as relative frequency (proportion), otherwise as absolute count.

round_digitsint, optional

Number of decimal digits to round the encoded frequencies to. Applicable only when normalize=True. If None (default), no rounding is performed. Must be a non-negative integer.

Returns:

pd.Series or pd.DataFrame: Frequency-encoded feature(s). Returns Series for row-wise encoding or 1D input, DataFrame for column-wise encoding of multiple features.

Raises:

ValueError: If input contains NaNs. If axis is not 0 or 1. If round_digits is negative or not an integer.

Examples

>>> import pandas as pd
>>> from explorica.data_quality import freq_encode
>>> # Simple usage
>>> dataset = pd.DataFrame({
...     "groups1": ["A", "A", "A", "B", "B", "C"],
...     "groups2": ["D", "D", "D", "E", "F", "G"]
... })
>>> dataset = freq_encode(dataset, round_digits=4)
>>> dataset
   groups1  groups2
0   0.5000   0.5000
1   0.5000   0.5000
2   0.5000   0.5000
3   0.3333   0.1667
4   0.3333   0.1667
5   0.1667   0.1667

explorica.data_quality.feature_engineering.ordinal_encode(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], axis: int = 0, order_method: str = 'frequency', order_ascending: bool = False, **kwargs) → Series | DataFrame[source]

Encode categorical values with ordinal integers.

This method converts categorical data into integer-encoded representations according to the chosen ordering strategy. Supported strategies include frequency-based, alphabetical, or target-based orderings (using mean, median, or mode of a provided reference variable).

Parameters:

dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Number]]

Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.

axisint, {0, 1}, default 0

Applicable only if input is 2D:

0: encode each column independently (column-wise).
Returns a DataFrame if multiple columns are provided.
1: encode each row based on the combination of column values (row-wise).
Always returns a Series.

Ignored if input is 1D.

order_methodstr | Literal, default “frequency”

Ordering rule to determine integer assignment:

“frequency” : order by category frequency.
“alphabetical” : order alphabetically by category label.
“mean”order by the mean of corresponding order_by values per
category.
“median”order by the median of corresponding order_by values per
category.
“mode”order by the most frequent corresponding order_by value per
category.

order_ascendingbool, default False

Whether to assign integers in ascending (True) or descending (False) order.

order_bySequence, pandas.Series, pandas.DataFrame, or Mapping, optional

Numerical data to use for computing central tendency measures (mean, median, mode). Required when order_method is one of {“mean”, “median”, “mode”}. Must be aligned by shape with data.

offsetint, default 0

The starting integer value for encoding categories. Each group label is incremented by this offset, so setting offset=1 makes encoded values start from 1 instead of 0.

Returns:

pandas.Series or pandas.DataFrame

Encoded data, where each unique category or category combination is replaced by an integer reflecting its relative order:

If axis=1, returns a Series with encoded values per row.
If axis=0 and multiple columns were passed, returns a DataFrame where each column is encoded independently.
If axis=0 and a single column was passed, returns a Series.

Raises:

ValueError:: If input contains NaNs. If the provided order_method is not supported. If order_by is missing when required. If data and order_by have mismatched lengths.

Notes

When order_method is "mode", if multiple modes exist within a group, the first encountered mode is used for ordering (tie-breaking is deterministic).

Examples

>>> import pandas as pd
>>> from explorica.data_quality import ordinal_encode
>>> # Simple usage
>>> df = pd.DataFrame({"category_1": ["A", "B", "C", "A", "A", "B", "A"]})
>>> ordinal_encode(
...     df, order_method="abc", order_ascending=True, offset=1)
0    1
1    2
2    3
3    1
4    1
5    2
6    1
dtype: int64

explorica.data_quality.information_metrics

Module for information-theoretic metrics used in data quality assessment.

This module provides utilities for quantifying uncertainty, variability, and information content in datasets. Currently, it implements Shannon entropy as a measure of feature uncertainty. Future extensions may include divergence measures (e.g., KL divergence) and cross-entropy for comparing distributions.

Classes

InformationMetrics: Provides static methods for computing information metrics such as Shannon entropy.

Notes

Currently, only Shannon entropy is implemented. Other metrics may be added in future releases.

Examples

>>> import pandas as pd
>>> from explorica.data_quality import get_entropy
>>>
>>> data = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": [1, 1, 1, 2, 2]})
>>> get_entropy(data)
A    2.321928
B    0.970951
dtype: float64

explorica.data_quality.information_metrics.get_entropy(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], method: str = 'shannon', nan_policy: Literal['drop', 'raise', 'include'] = 'drop') → float | Series[source]

Compute the Shannon entropy of the input data.

Shannon entropy is a measure of uncertainty or randomness in a dataset. For a single feature, it quantifies how evenly the values are distributed. Lower values indicate more predictability (potentially constant or quasi-constant features), while higher values indicate more variability or diversity.

Parameters:

dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Number]]

Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.

methodstr, default=”shannon”

Entropy calculation method. Currently, only “shannon” is supported. Other methods (e.g., differential entropy) may be added in future releases. Entropy is calculated as:

\[H(x) = - \sum{w_i * \log_2(w_i)}\]

where w_i is the relative frequency of each unique element of the sample x.

nan_policy{‘drop’, ‘raise’, ‘include’}, default=’drop’

Policy for handling NaN values in input data:

‘raise’ : raise ValueError if any NaNs are present in data.
‘drop’drop rows (axis=0) containing NaNs before computation. This
does not drop entire columns.
‘include’ : treat NaN as a valid value and include them in computations.

Returns:

float or pd.Series

If input is 1D, returns a float representing the Shannon entropy.
If input is 2D or dict, returns a pd.Series indexed by column names.

Raises:

ValueError

If column names are not unique (in case of dict or DataFrame input)
If method is not supported

Notes

NaN values are included in the computation as a distinct category.

Examples

>>> import pandas as pd
>>> from explorica.data_quality import get_entropy
>>> # Simple usage
>>> data = pd.DataFrame({"A": [1, 1, 2, 2], "B": [1, 2, 3, 4]})
>>> get_entropy(data)
A    1.0
B    2.0
dtype: float64
>>> data = [1, 1, 1, 1, 1, 1]
>>> get_entropy(data)
np.float64(0.0)

explorica.data_quality.summary

Data-quality summary utilities.

This module provides the get_summary function, which computes a full data-quality summary for a dataset, including metrics for missing values, duplicates, distribution, basic statistics, and multicollinearity. The summary can be returned as a pandas DataFrame with MultiIndex columns or as a JSON-serializable nested dictionary.

Functions

get_summary(data, return_as=’dataframe’, auto_round=True, round_digits=4, **kwargs): Compute a data-quality summary for a dataset. Supports saving the summary to CSV, Excel, or JSON, and can return either a pandas DataFrame or a JSON-friendly nested dictionary.

Notes

The module contains internal helper functions for get_summary, which are not intended for standalone use.
Saved JSON outputs are fully serializable, with NaNs converted to None and non-numeric metrics (like mode) converted to strings.

Examples

>>> # Minimal usage (DataFrame output)
>>> import pandas as pd
>>> from explorica.data_quality import get_summary
>>> df = pd.DataFrame({
...     "x1": [1, 2, 3],
...     "x2": [2, 4, 6]
... })
>>> summary = get_summary(df, return_as="dataframe")
>>> summary
            nans              ...    multicollinearity
   count_of_nans pct_of_nans  ... is_multicollinearity  VIF
x1             0         0.0  ...                  1.0  inf
x2             0         0.0  ...                  1.0  inf

[2 rows x 16 columns]
>>> # Saving summary as JSON (nested dict, JSON-friendly)
>>> summary_dict = get_summary(
...     df,
...     return_as="dict",
...     directory="summary.json"
... )
>>> # The JSON file summary.json is saved in the current directory.
>>> summary_dict["duplicates"]
{'count_of_unique': {'x1': 3.0, 'x2': 3.0},
'pct_of_unique': {'x1': 1.0, 'x2': 1.0},
'quasi_constant_pct': {'x1': 0.3333, 'x2': 0.3333}

>>> # Verbose logging (optional)
>>> summary_verbose = get_summary(df, verbose=True)
>>> # verbose=True will log computation steps but does not affect returned object
>>> summary_verbose[["nans", "duplicates"]]
            nans                  duplicates
   count_of_nans pct_of_nans count_of_unique pct_of_unique quasi_constant_pct
x1             0         0.0               3           1.0             0.3333
x2             0         0.0               3           1.0             0.3333

explorica.data_quality.summary.get_summary(data: Sequence[Sequence], return_as: str = 'dataframe', auto_round=True, round_digits=4, **kwargs) → DataFrame | dict[source]

Compute a data-quality summary for a dataset.

The summary includes metrics for missing values, duplicates, distribution, basic statistics, and multicollinearity. The result can be returned as a pandas DataFrame with MultiIndex columns or as a JSON-serializable nested dict.

Parameters:

dataSequence | Sequence[Sequence] | Mapping[str, Sequence]: Input data. Can be 1D, 2D (sequence of sequences), or a mapping of column names to sequences.
return_as{‘dataframe’, ‘dict’}, optional, default=’dataframe’: Output format. If ‘dataframe’ (default) returns pandas.DataFrame with columns arranged as MultiIndex (group, metric). If ‘dict’ or ‘json’ returns a nested Python dict (JSON-friendly) of the form: {“group”: {“metric”: {“feature”: value, …}, …}, …}.
auto_roundbool, optional, default=True: If True numeric values are rounded for human-friendly output.
round_digitsint, optional, default=4: Number of decimal digits used when auto_round=True.

Returns:

pd.DataFrame or dict

If return_as is ‘dataframe’ or ‘df’:

pd.DataFrame with MultiIndex columns (section, metric), index = feature names.

If return_as is ‘dict’ or ‘mapping’:

Nested dict of the form {section: {metric: {feature: value}}}, JSON-friendly with NaNs converted to None and non-numeric values serialized as strings.

Sections and metrics included:

nans
- count_of_nans: number of missing values per feature
- pct_of_nans: fraction of missing values per feature (0..1)
duplicates
- count_of_unique: number of unique values per feature
- pct_of_unique: fraction of unique values per feature (0..1)
- quasi_constant_pct: top value ratio (quasi-constant score)
distribution
- is_normal: 0/1 flag, distribution approximately normal if \(|\gamma_1| \leq 0.25\) and \(|\gamma_2| \leq 0.25\)
- desc: qualitative description of distribution shape (“normal”, “left-skewed”, “right-skewed”, etc.)
- skewness: sample skewness
- kurtosis: sample excess kurtosis
stats
- mean: mean value per feature
- std: standard deviation per feature
- median: median value per feature
- mode: most frequent value per feature (original type)
- count_of_modes: number of mode values found per feature
multicollinearity
- VIF: Variance Inflation Factor (numeric features only)
- is_multicollinearity: 0/1 flag if VIF ≥ threshold_vif (default=10)

Other Parameters:

nan_policy{‘drop’, ‘raise’, ‘include’}, default=’drop’

How to handle missing values during computations.

‘drop’ - missing values are removed before computing metrics.
‘raise’ - missing values cause an exception.
‘include’ - missing values are kept only for categorical metrics (e.g. quasi_constant_pct, mode), where NaN can be treated as a category. For numerical metrics, NaN values are still dropped, as their interpretation remains undefined.

threshold_viffloat, default=10

VIF threshold for multicollinearity.

directorystr, optional

Path to save the summary. Supports:

‘.csv’: saved as a CSV file with MultiIndex columns (section, metric). Can be reopened via pd.read_csv(directory, header=[0, 1], index_col=0).
‘.xlsx’: saved as an Excel file with MultiIndex columns, preserving grouping visually.
‘.json’: saved as a JSON file using nested dict format {section: {metric: {feature: value}}}, fully JSON-serializable.

If a file already exists, can raise FileExistsError (unless overwrite=True), or PermissionError if access is denied.

overwritebool, default=True

Whether to overwrite an existing file when saving the summary.

True (default): existing files will be overwritten without error.
False: if the target file already exists, a FileExistsError is raised.

verbosebool, optional

Enable info-level logging.

Raises:

ValueError: If invalid return_as values. If data contains NaNs and nan_policy is ‘raise’

Notes

VIF is computed for numeric features only. Non-numeric features will not have VIF values;
When saving to JSON/nested dict, numeric values (e.g. NumPy scalars) are cast to native Python int/float, and non-numeric values (e.g. mode) are stored as strings to guarantee JSON safety.
In DataFrame/Excel outputs, original types are preserved.
In CSV outputs, values are stringified implicitly by pandas during to_csv().
CSV output uses flattened column names joined by ‘:’ - this improves portability but loses MultiIndex structure. To read flattened CSV back and restore MultiIndex use: cols = df.columns.str.split(‘:’, expand=True) then set df.columns = pd.MultiIndex.from_frame(cols).

Examples

>>> # Minimal usage (DataFrame output)
>>> import pandas as pd
>>> from explorica.data_quality import get_summary
>>> df = pd.DataFrame({
...     "x1": [1, 2, 3],
...     "x2": [2, 4, 6]
... })
>>> summary = get_summary(df, return_as="dataframe")
>>> summary
            nans              ...    multicollinearity
   count_of_nans pct_of_nans  ... is_multicollinearity  VIF
x1             0         0.0  ...                  1.0  inf
x2             0         0.0  ...                  1.0  inf

[2 rows x 16 columns]

>>> # Saving summary as JSON (nested dict, JSON-friendly)
>>> summary_dict = get_summary(
...     df,
...     return_as="dict",
...     directory="summary.json"
... )
>>> # The JSON file summary.json is saved in the current directory.
>>> summary_dict["duplicates"]
{'count_of_unique': {'x1': 3.0, 'x2': 3.0},
'pct_of_unique': {'x1': 1.0, 'x2': 1.0},
'quasi_constant_pct': {'x1': 0.3333, 'x2': 0.3333}

>>> # Verbose logging (optional)
>>> summary_verbose = get_summary(df, verbose=True)
>>> # verbose=True will log computation steps but does not affect returned object
>>> summary_verbose[["nans", "duplicates"]]
            nans                  duplicates
   count_of_nans pct_of_nans count_of_unique pct_of_unique quasi_constant_pct
x1             0         0.0               3           1.0             0.3333
x2             0         0.0               3           1.0             0.3333