explorica.data_quality
explorica.data_quality.data_preprocessing
Data preprocessing utilities for exploratory data analysis (EDA).
Functions
- get_missing(data, ascending=None, round_digits=None)
Return the number and proportion of missing (NaN) values per column.
- drop_missing(data, axis=0, threshold_pct=0.05, threshold_abs=None, verbose=False)
Drops rows or columns containing NaNs according to a specified threshold.
- get_constant_features(data, method=”top_value_ratio”, threshold=1.0, nan_policy=”drop”)
Identify constant and quasi-constant features based on the frequency of the most common value. Returns a DataFrame with columns: is_constant and top_value_ratio.
- get_categorical_features(data, threshold=30, **kwargs)
Identifies constant and quasi-constant features in the dataset.
- set_categorical(data, threshold=30, nan_policy=”drop”, verbose=False, **kwargs)
Convert eligible columns to Pandas category dtype for memory optimization and improved performance in certain operations.
Notes
All methods are implemented as @staticmethod, so the class does not maintain any state.
- explorica.data_quality.data_preprocessing.drop_missing(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], axis: int | None = 0, threshold_pct: float | None = 0.05, threshold_abs: int | None = None, verbose: bool | None = False) DataFrame[source]
Drop rows or columns containing NaNs according to a specified threshold.
This function removes rows (axis=0) or columns (axis=1) that contain NaN values in columns whose proportion of missing values is below (axis=0) or above (axis=1) the specified threshold. Threshold can be specified as a proportion (threshold_pct) or an absolute number (threshold_abs). Absolute threshold, if provided, overrides the proportion threshold.
- Parameters:
- dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Number]]
Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.
- axisint, optional, default=0
Axis along which to remove NaNs:
0 : drop rows with NaNs in columns under the threshold,
1 : drop columns with NaNs above the threshold.
- threshold_pctfloat, optional, default=0.05
The maximum allowed proportion of NaNs for a feature to be retained.
When axis=0 (row-wise deletion): rows are removed if the proportion of NaNs in their columns exceeds this threshold.
When axis=1 (column-wise deletion): columns are removed if the proportion of NaNs exceeds this threshold. Ignored if threshold_abs is provided.
- threshold_absint, optional
The maximum allowed absolute number of NaNs for a feature to be retained.
When axis=0: rows are removed if the number of NaNs per column exceeds this threshold.
When axis=1: columns are removed if the number of NaNs exceeds this threshold.
Overrides threshold_pct if provided.
- verbosebool, optional, default=False
If True, logs detailed information about the operation including:
number of rows or columns removed,
columns affected,
original and resulting DataFrame shape.
- Returns:
- pd.DataFrame
DataFrame after dropping rows or columns according to the threshold.
- Raises:
- ValueError
If data has keys and they are not unique. If threshold_abs is not a non-negative integer. If threshold_abs is greater than ‘data’ length If threshold_pct not in [0, 1] If axis is not 0 or 1.
Examples
>>> import pandas as pd >>> import numpy as np >>> from explorica.data_quality import drop_missing >>> df = pd.DataFrame({"A": [1,2,3,4,5,np.nan], ... "B": [1,2,3,4,5,6], ... "C": [np.nan, 2, np.nan, np.nan, np.nan, np.nan]}) >>> # Only removes rows if NaN is less than 2 per feature >>> print(drop_missing(df, axis=0, threshold_abs=2)) A B C 0 1.0 1 NaN 1 2.0 2 2.0 2 3.0 3 NaN 3 4.0 4 NaN 4 5.0 5 NaN >>> # Only removes columns if more than 20% of values per feature are NaN >>> print(drop_missing(df, axis=1, threshold_pct=0.2)) A B 0 1.0 1 1 2.0 2 2 3.0 3 3 4.0 4 4 5.0 5 5 NaN 6
- explorica.data_quality.data_preprocessing.get_categorical_features(data: Sequence[Any] | Sequence[Sequence[Any]] | Mapping[str, Sequence[Any]], threshold: int | Sequence[int] | Mapping[str, int] | None = 30, **kwargs) DataFrame[source]
Identify categorical features in a dataset.
Identifying categorical features is a necessary preprocessing step before applying encoding strategies or statistical tests that require knowledge of feature types. This function combines dtype-based filtering with a uniqueness threshold, and optionally flags binary and constant columns, providing a flexible single-pass audit of categorical structure in the dataset.
- Parameters:
- dataSequence[Any] | Sequence[Sequence[Any]] | Mapping[str, Sequence[Any]]
Input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.
- thresholdint, Sequence[int] or Mapping[str, int], optional, default=30
Maximum number of unique values allowed for a column to be considered categorical. If a mapping is provided, values are applied per column. Scalar values are broadcast to all columns; sequences or mappings are aligned by column name.
- sign_binbool, default=False
If True, append an is_binary column to the result, marking columns with exactly two unique values.
- sign_constbool, default=False
If True, append an is_constant column to the result, marking columns with only one unique value.
- include_numberbool, default=False
Include numeric (number) columns that satisfy the threshold.
- include_intbool, default=False
Include integer (int) columns that satisfy the threshold.
- include_strbool, default=False
Include string (object) columns that satisfy the threshold.
- include_boolbool, default=False
Include boolean columns.
- include_datetimebool, default=False
Include datetime columns.
- include_binbool, default=False
Include binary columns (exactly two unique values).
- include_constbool, default=False
Include constant columns (exactly one unique value).
- include_allbool, default=False
Disable dtype filtering; only threshold is applied.
- includeIterable[str], default={“object”}
Explicit set of dtype aliases to include (e.g. {“object”, “number”} or {“int”, “bin”}). The parameter has the highest priority among inclusion rules:
Explicit include argument (user-defined)
Flag parameters (e.g., include_int, include_str, etc.)
Default value {“object”}
If include is provided directly, all flags are ignored.
- nan_policystr | Literal[‘drop’, ‘raise’, ‘include’], default=’drop’
Policy for handling NaN values in input data:
‘raise’ : raise ValueError if any NaNs are present in data.
- ‘drop’drop rows (axis=0) containing NaNs before computation. This
does not drop entire columns.
‘include’ : treat NaN as a valid value and include them in computations.
- Returns:
- pd.DataFrame
DataFrame indexed by column names with:
categories_count : number of unique values in each column
is_category : flag for categorical columns
is_binary : (optional) flag for binary columns
is_constant : (optional) flag for constant columns
- Raises:
- ValueError
If input data contains duplicate column names or invalid nan_policy.
- TypeError
If threshold is not scalar, list, or mapping convertible to per-column limits.
Notes
The function supports combined filtering: first by unique value count (threshold), then by dtype matching.
Internal helper functions _filter_standard_dtypes, _filter_bin_const and _filter_categories provide modular filtering logic.
The original data are not modified.
Compatible with get_constant_features for constant detection.
Examples
>>> import pandas as pd >>> import seaborn as sns >>> from explorica.data_quality import get_categorical_features >>> df = sns.load_dataset("titanic") >>> # marks as a category string and integer columns >>> # with 4 or fewer unique objects >>> get_categorical_features( ... df, threshold=4, include={"str", "int"}) categories_count is_category survived 2 1 pclass 3 1 sex 2 1 age 63 0 sibsp 4 1 parch 4 1 fare 93 0 embarked 3 1 class 3 0 who 3 1 adult_male 2 0 deck 7 0 embark_town 3 1 alive 2 1 alone 2 0 >>> df["constant_feature"] = 0 >>> # Additionally signs binary and constant features >>> get_categorical_features(df, threshold=10, sign_bin=True, sign_const=True) categories_count is_category is_binary is_constant survived 2 0 1 0 pclass 3 0 0 0 sex 2 1 1 0 age 63 0 0 0 sibsp 4 0 0 0 parch 4 0 0 0 fare 93 0 0 0 embarked 3 1 0 0 class 3 0 0 0 who 3 1 0 0 adult_male 2 0 1 0 deck 7 0 0 0 embark_town 3 1 0 0 alive 2 1 1 0 alone 2 0 1 0 constant_feature 1 0 0 1
- explorica.data_quality.data_preprocessing.get_constant_features(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], method: str = 'top_value_ratio', threshold: float | None = 1.0, nan_policy: str | Literal['drop', 'raise', 'include'] = 'drop') DataFrame[source]
Identify constant and quasi-constant features in the dataset.
Constant and quasi-constant features carry little to no predictive information and can negatively affect model training by introducing noise or causing numerical instability. This function supports multiple detection strategies: a ratio-based approach, a uniqueness-based metric, and Shannon entropy, allowing the threshold to be interpreted either as a dominance criterion or as an information-theoretic bound.
- Parameters:
- dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Number]]
Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.
- methodstr, default ‘top_value_ratio’
Metric used to detect constant features:
“top_value_ratio”: proportion of the most frequent value.
“non_uniqueness”: 1 - number of unique values / total count.
“entropy”: Shannon entropy of the feature.
- thresholdfloat, default=1.0
Non-negative threshold value in the range [0, +∞). Decision boundary for each method:
For “top_value_ratio” or “non_uniqueness”: values >= threshold are flagged constant.
For “entropy”: values <= threshold are flagged constant.
- nan_policystr | Literal, default=’drop’
Policy for handling NaN values in input data:
‘raise’ : raise ValueError if any NaNs are present in data.
- ‘drop’drop rows (axis=0) containing NaNs before computation. This
does not drop entire columns.
‘include’ : treat NaN as a valid value and include them in computations.
- Returns:
- pd.DataFrame
A DataFrame indexed by column names with:
‘is_const’: bool flag if column is (quasi-)constant
‘top_value_ratio’: proportion of the most frequent value
- Raises:
- ValueError
If an unsupported method or nan_policy is provided. If input contains duplicate column names. If threshold is negative.
Examples
>>> # Basic usage >>> # Demonstrates a simple use case with the default ``top_value_ratio`` method, >>> # which identifies constant or quasi-constant features based on the most >>> # frequent value ratio. >>> import pandas as pd >>> import numpy as np >>> from explorica.data_quality import get_constant_features >>> data = [[1, 3, 3, 3, 3, 6], [1, 2, 3, 4, 5, 5]] >>> print(get_constant_features( ... data, method="top_value_ratio", threshold=0.5)) top_value_ratio is_const 0 0.666667 1.0 1 0.333333 0.0
>>> # Entropy-based threshold interpretation >>> # Illustrates how an entropy threshold can be interpreted as a fraction of the >>> # maximum information capacity (in bits) for each feature. This approach allows >>> # defining thresholds relative to the diversity of feature values. >>> import pandas as pd >>> import numpy as np >>> data = pd.DataFrame({"A": [1, 2, 3, 4, 5, 6, 6], ... "B": [0, 0, 0, 0, 0, 1, 1]}) >>> thresh = 0.7 >>> thresh_bits_dim = thresh * np.log2(data.nunique()) >>> print(thresh_bits_dim) A 1.809474 B 0.700000 dtype: float64 >>> get_constant_features( ... data, method="entropy", threshold = thresh_bits_dim.mean()) entropy is_const A 2.521641 0.0 B 0.863121 1.0
- explorica.data_quality.data_preprocessing.get_missing(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], ascending=None, round_digits=None) DataFrame[source]
Calculate the number and percentage of missing (NaN) values for each column.
Identifying missing values is typically one of the first steps in exploratory data analysis, as their presence and distribution can significantly affect downstream modeling and analysis. This function provides a concise per-column summary of missing value counts and their relative proportions.
- Parameters:
- dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Number]]
Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.
- ascendingbool, optional
If specified, sorts the result by the
count_of_nanscolumn.If True, sorts in ascending order (fewest missing values first).
If False, sorts in descending order (most missing values first).
If None (default), no sorting is performed.
- round_digitsint, optional
Number of decimal places to round the
pct_of_nansvalues to.Must be a non-negative integer (
x >= 0).If
None(default), no rounding is applied.
- Returns:
- pd.DataFrame
A DataFrame with the following columns:
count_of_nans : int Number of NaN values in each column.
pct_of_nans : float Proportion of NaN values in each column (0.0 to 1.0).
- Raises:
- ValueError
If data has keys and they are not unique. If round_digits is not a non-negative integer.
Notes
The pct_of_nans values are calculated as the fraction of missing values relative to the total number of rows in the dataset.
Useful for quickly identifying columns with high proportions of missing data before applying data cleaning or imputation.
Examples
>>> import pandas as pd >>> import numpy as np >>> from explorica.data_quality import get_missing >>> # Simple usage >>> df = pd.DataFrame({"A": [1, 2, pd.NA, np.nan, 5, 6, 7], ... "B": [7, None, 5, 4, 3, 2, 1]}) >>> get_missing(df, round_digits=4) count_of_nans pct_of_nans A 2 0.2857 B 1 0.1429
- explorica.data_quality.data_preprocessing.set_categorical(data: Sequence[Any] | Sequence[Sequence[Any]] | Mapping[str, Sequence[Any]], threshold: int | Sequence[int] | Mapping[str, int] | None = 30, nan_policy: str | Literal['drop', 'raise', 'include'] = 'drop', verbose: bool | None = False, **kwargs) DataFrame[source]
Convert eligible columns to Pandas category dtype.
Useful for memory optimization and improved performance in certain operations.
- Parameters:
- dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Number]]
Input data. Can be 1D, 2D (sequence of sequences), or a mapping of column names to sequences.
- thresholdint, Sequence[int] or Mapping[str, int], optional, default=30
Maximum number of unique values allowed for a column to be considered categorical. If a mapping is provided, values are applied per column. Scalar values are broadcast to all columns; sequences or mappings are aligned by column name.
- include_numberbool, default=False
Include numeric (number) columns that satisfy the threshold.
- include_intbool, default=False
Include integer (int) columns that satisfy the threshold.
- include_strbool, default=False
Include string (object) columns that satisfy the threshold.
- include_boolbool, default=False
Include boolean columns.
- include_datetimebool, default=False
Include datetime columns.
- include_binbool, default=False
Include binary columns (exactly two unique values).
- include_constbool, default=False
Include constant columns (exactly one unique value).
- include_allbool, default=False
Disable dtype filtering; only threshold is applied.
- includeIterable[str], default={“object”}
Explicit set of dtype aliases to include (e.g. {“object”, “number”} or {“int”, “bin”}). The parameter has the highest priority among inclusion rules:
Explicit include argument (user-defined)
Flag parameters (e.g., include_int, include_str, etc.)
Default value {“object”}
If include is provided directly, all flags are ignored.
- nan_policystr | Literal[‘drop’, ‘raise’, ‘include’], default=’drop’
Policy for handling NaN values in input data:
‘raise’ : raise ValueError if any NaNs are present in data.
- ‘drop’drop rows (axis=0) containing NaNs before computation. This
does not drop entire columns.
‘include’ : treat NaN as a valid value and include them in computations.
- verbosebool, optional, default=False
If True, logs detailed information about the operation including:
count and names of affected columns.
- Returns:
- pd.DataFrame
A copy of the original DataFrame with selected columns converted to category dtype.
- Raises:
- Exception
Propagates exceptions from get_categorical_features for parameter validation errors. See get_categorical_features documentation for specific error conditions.
Notes
Converting to category can significantly reduce memory usage, especially for string/object columns with many repeated values.
category stores integer codes (int8/int16) and a category mapping, making comparisons and filtering faster than for object dtype.
For numeric columns, memory savings may be smaller, but grouping and filtering can still be faster.
The original DataFrame is not modified - a copy is returned.
Examples
>>> # Basic usage example >>> import pandas as pd >>> import seaborn as sns >>> from explorica.data_quality import set_categorical >>> df = pd.DataFrame({"A": [1, 2, 3, 4, 5, 6, 7, 8], ... "B": ["A", "A", "B", "C", "A", "B", "C", "A"], ... "C": [1, 0, 1, 0, 1, 1, 1, 1]}) >>> df = set_categorical(df, include_bin=True, include_str=True) >>> df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 8 entries, 0 to 7 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 A 8 non-null int64 1 B 8 non-null category 2 C 8 non-null category dtypes: category(2), int64(1) memory usage: 468.0 bytes
>>> # Memory usage reducing example >>> df = sns.load_dataset('titanic') >>> df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 survived 891 non-null int64 1 pclass 891 non-null int64 2 sex 891 non-null object 3 age 714 non-null float64 4 sibsp 891 non-null int64 5 parch 891 non-null int64 6 fare 891 non-null float64 7 embarked 889 non-null object 8 class 891 non-null category 9 who 891 non-null object 10 adult_male 891 non-null bool 11 deck 203 non-null category 12 embark_town 889 non-null object 13 alive 891 non-null object 14 alone 891 non-null bool dtypes: bool(2), category(2), float64(2), int64(4), object(5) memory usage: 80.7+ KB >>> set_categorical(df).info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 survived 891 non-null int64 1 pclass 891 non-null int64 2 sex 891 non-null category 3 age 714 non-null float64 4 sibsp 891 non-null int64 5 parch 891 non-null int64 6 fare 891 non-null float64 7 embarked 889 non-null category 8 class 891 non-null category 9 who 891 non-null category 10 adult_male 891 non-null bool 11 deck 203 non-null category 12 embark_town 889 non-null category 13 alive 891 non-null category 14 alone 891 non-null bool dtypes: bool(2), category(7), float64(2), int64(4) memory usage: 50.8 KB
explorica.data_quality.feature_engineering
Module for feature engineering on numeric and categorical data.
This module provides utilities for fast and flexible feature transformation. It is focused on common preprocessing tasks: frequency encoding, ordinal encoding and discretization (binning) of continuous variables. Implementations accept pandas Series/DataFrame, NumPy arrays, Python sequences and mappings (dict-like inputs).
Functions
- freq_encode(data, axis=0, normalize=True, round_digits=None)
Performs frequency encoding on a categorical feature(s).
- ordinal_encode(data, axis=0, order_method=”frequency”, order_ascending=False, **kwargs)
Encode categorical values with ordinal integers based on a specified ordering rule.
- discretize_continuous(data, bins=None, binning_method=”uniform”, intervals=”pandas”)
Discretize continuous numeric data into categorical intervals.
Examples
>>> import pandas as pd
>>> from explorica.data_quality import freq_encode
>>> # Simple encoder usage
>>> df = pd.DataFrame({
... "color": ["red", "blue", "red", "green", "blue", "red"],
... "shape": ["circle", "square", "circle", "triangle", "square", "circle"]
... })
>>> encoded = freq_encode(df, round_digits=4)
>>> print(encoded)
color shape
0 0.5000 0.5000
1 0.3333 0.3333
2 0.5000 0.5000
3 0.1667 0.1667
4 0.3333 0.3333
5 0.5000 0.5000
- explorica.data_quality.feature_engineering.discretize_continuous(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], bins: int = None, binning_method: Literal['uniform', 'quantile'] = 'uniform', intervals: str | Sequence = 'pandas') Series | DataFrame[source]
Discretize continuous numeric data into categorical intervals.
Discretization converts continuous numeric features into ordered categorical intervals, which can improve interpretability, reduce the effect of outliers, and serve as a preprocessing step for models that benefit from categorical inputs. The function supports both uniform and quantile-based binning strategies, with flexible control over bin count and interval labeling on a per-column basis.
- Parameters:
- dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Number]]
Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.
- binsint | Sequence[int] | Mapping[str, int], default=None
Number of discrete bins (intervals) to split each numeric feature into.
int — applies the same number of bins to all columns. Example:
bins=5is equivalent tobins={'col1': 5, 'col2': 5, ..., 'colN': 5}.Sequence[int] — specifies an individual number of bins for each column, in order of appearance. The sequence length must match the number of columns. Example:
bins=[3, 2, 3]≡bins={'col1': 3, 'col2': 2, 'col3': 3}.Mapping[str, int] — explicit per-column specification. Keys must exactly match the column names present in
data. Missing or extra keys will raiseKeyError.
The number of bins must be a positive integer for every column. If not provided, the number of bins is automatically estimated using Sturges’ formula:
\[k = 1 + 3.322\log_{10}(n)\]where n is the number of samples per column. Priority of bin determination:
If
intervalsis a sequence of custom labels, its length defines the number of bins (even ifbinsis specified).Otherwise,
binsis used as provided.If neither
binsnorintervalsdefines the bin count, the Sturges’ rule is applied.
- Returns:
- pd.Series or pd.DataFrame
Categorical representation of the binned data. Returns a Series for a single-column input, or a DataFrame for multi-column input.
- Raises:
- ValueError
If binning_method or intervals is unsupported. If intervals is sequence and don’t match bin count. If data contains NaN values. If intervals is sequence and contains NaN values. If bins is negative or not an integer.
- KeyError
If the keys of bins or intervals (when provided as a mapping) do not match the column names of the input data. Also raised if intervals or bins are provided as sequences whose lengths do not correspond to the number of input features.
- Warns:
- UserWarning
If the number of bins specified for a feature exceeds the number of its unique values. In this case, the number of bins will be automatically reduced to
n_unique - 1for the corresponding column. A warning message will inform the user of this adjustment.
Examples
>>> import pandas as pd >>> import numpy as np >>> from explorica.data_quality import discretize_continuous >>> # Simple usage >>> df = pd.DataFrame({"f1": np.linspace(0, 1000, 100), ... "f2": np.linspace(0, 2150, 100)}) >>> discretize_continuous(df, bins=[10, 15]) f1 f2 0 (-1.001, 100.0] (-2.151, 143.333] 1 (-1.001, 100.0] (-2.151, 143.333] 2 (-1.001, 100.0] (-2.151, 143.333] 3 (-1.001, 100.0] (-2.151, 143.333] 4 (-1.001, 100.0] (-2.151, 143.333] .. ... ... 95 (900.0, 1000.0] (2006.667, 2150.0] 96 (900.0, 1000.0] (2006.667, 2150.0] 97 (900.0, 1000.0] (2006.667, 2150.0] 98 (900.0, 1000.0] (2006.667, 2150.0] 99 (900.0, 1000.0] (2006.667, 2150.0] [100 rows x 2 columns]
- explorica.data_quality.feature_engineering.freq_encode(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], axis: int = 0, normalize: bool = True, round_digits: int = None) Series | DataFrame[source]
Perform frequency encoding on a categorical feature(s).
Frequency encoding replaces each category with its frequency of occurrence in the data. This is particularly useful as a preprocessing step for machine learning models that require numerical input, as it preserves information about the distribution of categorical values without introducing arbitrary ordinal relationships.
- Parameters:
- dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Number]]
Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.
- axisint, {0, 1}, default 0
Applicable only if input is 2D:
0: encode each column independently (column-wise), returns pd.DataFrame.
1: encode each row based on the combination of column values (row-wise), returns pd.Series.
Ignored if input is 1D.
- normalizebool, default=True
If True, encodes as relative frequency (proportion), otherwise as absolute count.
- round_digitsint, optional
Number of decimal digits to round the encoded frequencies to. Applicable only when normalize=True. If None (default), no rounding is performed. Must be a non-negative integer.
- Returns:
- pd.Series or pd.DataFrame
Frequency-encoded feature(s). Returns Series for row-wise encoding or 1D input, DataFrame for column-wise encoding of multiple features.
- Raises:
- ValueError
If input contains NaNs. If axis is not 0 or 1. If round_digits is negative or not an integer.
Examples
>>> import pandas as pd >>> from explorica.data_quality import freq_encode >>> # Simple usage >>> dataset = pd.DataFrame({ ... "groups1": ["A", "A", "A", "B", "B", "C"], ... "groups2": ["D", "D", "D", "E", "F", "G"] ... }) >>> dataset = freq_encode(dataset, round_digits=4) >>> dataset groups1 groups2 0 0.5000 0.5000 1 0.5000 0.5000 2 0.5000 0.5000 3 0.3333 0.1667 4 0.3333 0.1667 5 0.1667 0.1667
- explorica.data_quality.feature_engineering.ordinal_encode(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], axis: int = 0, order_method: str = 'frequency', order_ascending: bool = False, **kwargs) Series | DataFrame[source]
Encode categorical values with ordinal integers.
This method converts categorical data into integer-encoded representations according to the chosen ordering strategy. Supported strategies include frequency-based, alphabetical, or target-based orderings (using mean, median, or mode of a provided reference variable).
- Parameters:
- dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Number]]
Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.
- axisint, {0, 1}, default 0
Applicable only if input is 2D:
- 0: encode each column independently (column-wise).
Returns a DataFrame if multiple columns are provided.
- 1: encode each row based on the combination of column values (row-wise).
Always returns a Series.
Ignored if input is 1D.
- order_methodstr | Literal, default “frequency”
Ordering rule to determine integer assignment:
“frequency” : order by category frequency.
“alphabetical” : order alphabetically by category label.
- “mean”order by the mean of corresponding order_by values per
category.
- “median”order by the median of corresponding order_by values per
category.
- “mode”order by the most frequent corresponding order_by value per
category.
- order_ascendingbool, default False
Whether to assign integers in ascending (True) or descending (False) order.
- order_bySequence, pandas.Series, pandas.DataFrame, or Mapping, optional
Numerical data to use for computing central tendency measures (mean, median, mode). Required when order_method is one of {“mean”, “median”, “mode”}. Must be aligned by shape with data.
- offsetint, default 0
The starting integer value for encoding categories. Each group label is incremented by this offset, so setting offset=1 makes encoded values start from 1 instead of 0.
- Returns:
- pandas.Series or pandas.DataFrame
Encoded data, where each unique category or category combination is replaced by an integer reflecting its relative order:
If
axis=1, returns a Series with encoded values per row.If
axis=0and multiple columns were passed, returns a DataFrame where each column is encoded independently.If
axis=0and a single column was passed, returns a Series.
- Raises:
- ValueError:
If input contains NaNs. If the provided order_method is not supported. If order_by is missing when required. If data and order_by have mismatched lengths.
Notes
When
order_methodis"mode", if multiple modes exist within a group, the first encountered mode is used for ordering (tie-breaking is deterministic).
Examples
>>> import pandas as pd >>> from explorica.data_quality import ordinal_encode >>> # Simple usage >>> df = pd.DataFrame({"category_1": ["A", "B", "C", "A", "A", "B", "A"]}) >>> ordinal_encode( ... df, order_method="abc", order_ascending=True, offset=1) 0 1 1 2 2 3 3 1 4 1 5 2 6 1 dtype: int64
explorica.data_quality.information_metrics
Module for information-theoretic metrics used in data quality assessment.
This module provides utilities for quantifying uncertainty, variability, and information content in datasets. Currently, it implements Shannon entropy as a measure of feature uncertainty. Future extensions may include divergence measures (e.g., KL divergence) and cross-entropy for comparing distributions.
Classes
- InformationMetrics
Provides static methods for computing information metrics such as Shannon entropy.
Notes
Currently, only Shannon entropy is implemented. Other metrics may be added in future releases.
Examples
>>> import pandas as pd
>>> from explorica.data_quality import get_entropy
>>>
>>> data = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": [1, 1, 1, 2, 2]})
>>> get_entropy(data)
A 2.321928
B 0.970951
dtype: float64
- explorica.data_quality.information_metrics.get_entropy(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], method: str = 'shannon', nan_policy: Literal['drop', 'raise', 'include'] = 'drop') float | Series[source]
Compute the Shannon entropy of the input data.
Shannon entropy is a measure of uncertainty or randomness in a dataset. For a single feature, it quantifies how evenly the values are distributed. Lower values indicate more predictability (potentially constant or quasi-constant features), while higher values indicate more variability or diversity.
- Parameters:
- dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Number]]
Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.
- methodstr, default=”shannon”
Entropy calculation method. Currently, only “shannon” is supported. Other methods (e.g., differential entropy) may be added in future releases. Entropy is calculated as:
\[H(x) = - \sum{w_i * \log_2(w_i)}\]where w_i is the relative frequency of each unique element of the sample x.
- nan_policy{‘drop’, ‘raise’, ‘include’}, default=’drop’
Policy for handling NaN values in input data:
‘raise’ : raise ValueError if any NaNs are present in data.
- ‘drop’drop rows (axis=0) containing NaNs before computation. This
does not drop entire columns.
‘include’ : treat NaN as a valid value and include them in computations.
- Returns:
- float or pd.Series
If input is 1D, returns a float representing the Shannon entropy.
If input is 2D or dict, returns a pd.Series indexed by column names.
- Raises:
- ValueError
If column names are not unique (in case of dict or DataFrame input)
If method is not supported
Notes
NaN values are included in the computation as a distinct category.
Examples
>>> import pandas as pd >>> from explorica.data_quality import get_entropy >>> # Simple usage >>> data = pd.DataFrame({"A": [1, 1, 2, 2], "B": [1, 2, 3, 4]}) >>> get_entropy(data) A 1.0 B 2.0 dtype: float64 >>> data = [1, 1, 1, 1, 1, 1] >>> get_entropy(data) np.float64(0.0)
explorica.data_quality.summary
Data-quality summary utilities.
This module provides the get_summary function, which computes a full data-quality summary for a dataset, including metrics for missing values, duplicates, distribution, basic statistics, and multicollinearity. The summary can be returned as a pandas DataFrame with MultiIndex columns or as a JSON-serializable nested dictionary.
Functions
- get_summary(data, return_as=’dataframe’, auto_round=True, round_digits=4, **kwargs)
Compute a data-quality summary for a dataset. Supports saving the summary to CSV, Excel, or JSON, and can return either a pandas DataFrame or a JSON-friendly nested dictionary.
Notes
The module contains internal helper functions for get_summary, which are not intended for standalone use.
Saved JSON outputs are fully serializable, with NaNs converted to None and non-numeric metrics (like mode) converted to strings.
Examples
>>> # Minimal usage (DataFrame output)
>>> import pandas as pd
>>> from explorica.data_quality import get_summary
>>> df = pd.DataFrame({
... "x1": [1, 2, 3],
... "x2": [2, 4, 6]
... })
>>> summary = get_summary(df, return_as="dataframe")
>>> summary
nans ... multicollinearity
count_of_nans pct_of_nans ... is_multicollinearity VIF
x1 0 0.0 ... 1.0 inf
x2 0 0.0 ... 1.0 inf
[2 rows x 16 columns]
>>> # Saving summary as JSON (nested dict, JSON-friendly)
>>> summary_dict = get_summary(
... df,
... return_as="dict",
... directory="summary.json"
... )
>>> # The JSON file summary.json is saved in the current directory.
>>> summary_dict["duplicates"]
{'count_of_unique': {'x1': 3.0, 'x2': 3.0},
'pct_of_unique': {'x1': 1.0, 'x2': 1.0},
'quasi_constant_pct': {'x1': 0.3333, 'x2': 0.3333}
>>> # Verbose logging (optional)
>>> summary_verbose = get_summary(df, verbose=True)
>>> # verbose=True will log computation steps but does not affect returned object
>>> summary_verbose[["nans", "duplicates"]]
nans duplicates
count_of_nans pct_of_nans count_of_unique pct_of_unique quasi_constant_pct
x1 0 0.0 3 1.0 0.3333
x2 0 0.0 3 1.0 0.3333
- explorica.data_quality.summary.get_summary(data: Sequence[Sequence], return_as: str = 'dataframe', auto_round=True, round_digits=4, **kwargs) DataFrame | dict[source]
Compute a data-quality summary for a dataset.
The summary includes metrics for missing values, duplicates, distribution, basic statistics, and multicollinearity. The result can be returned as a pandas DataFrame with MultiIndex columns or as a JSON-serializable nested dict.
- Parameters:
- dataSequence | Sequence[Sequence] | Mapping[str, Sequence]
Input data. Can be 1D, 2D (sequence of sequences), or a mapping of column names to sequences.
- return_as{‘dataframe’, ‘dict’}, optional, default=’dataframe’
Output format. If ‘dataframe’ (default) returns pandas.DataFrame with columns arranged as MultiIndex (group, metric). If ‘dict’ or ‘json’ returns a nested Python dict (JSON-friendly) of the form: {“group”: {“metric”: {“feature”: value, …}, …}, …}.
- auto_roundbool, optional, default=True
If True numeric values are rounded for human-friendly output.
- round_digitsint, optional, default=4
Number of decimal digits used when auto_round=True.
- Returns:
- pd.DataFrame or dict
If return_as is ‘dataframe’ or ‘df’:
pd.DataFrame with MultiIndex columns (section, metric), index = feature names.
If return_as is ‘dict’ or ‘mapping’:
Nested dict of the form {section: {metric: {feature: value}}}, JSON-friendly with NaNs converted to None and non-numeric values serialized as strings.
Sections and metrics included:
nans
count_of_nans: number of missing values per feature
pct_of_nans: fraction of missing values per feature (0..1)
duplicates
count_of_unique: number of unique values per feature
pct_of_unique: fraction of unique values per feature (0..1)
quasi_constant_pct: top value ratio (quasi-constant score)
distribution
is_normal: 0/1 flag, distribution approximately normal if \(|\gamma_1| \leq 0.25\) and \(|\gamma_2| \leq 0.25\)
desc: qualitative description of distribution shape (“normal”, “left-skewed”, “right-skewed”, etc.)
skewness: sample skewness
kurtosis: sample excess kurtosis
stats
mean: mean value per feature
std: standard deviation per feature
median: median value per feature
mode: most frequent value per feature (original type)
count_of_modes: number of mode values found per feature
multicollinearity
VIF: Variance Inflation Factor (numeric features only)
is_multicollinearity: 0/1 flag if VIF ≥ threshold_vif (default=10)
- Other Parameters:
- nan_policy{‘drop’, ‘raise’, ‘include’}, default=’drop’
How to handle missing values during computations.
‘drop’ - missing values are removed before computing metrics.
‘raise’ - missing values cause an exception.
‘include’ - missing values are kept only for categorical metrics (e.g. quasi_constant_pct, mode), where NaN can be treated as a category. For numerical metrics, NaN values are still dropped, as their interpretation remains undefined.
- threshold_viffloat, default=10
VIF threshold for multicollinearity.
- directorystr, optional
Path to save the summary. Supports:
‘.csv’: saved as a CSV file with MultiIndex columns (section, metric). Can be reopened via pd.read_csv(directory, header=[0, 1], index_col=0).
‘.xlsx’: saved as an Excel file with MultiIndex columns, preserving grouping visually.
‘.json’: saved as a JSON file using nested dict format {section: {metric: {feature: value}}}, fully JSON-serializable.
If a file already exists, can raise FileExistsError (unless overwrite=True), or PermissionError if access is denied.
- overwritebool, default=True
Whether to overwrite an existing file when saving the summary.
True (default): existing files will be overwritten without error.
False: if the target file already exists, a FileExistsError is raised.
- verbosebool, optional
Enable info-level logging.
- Raises:
- ValueError
If invalid return_as values. If data contains NaNs and nan_policy is ‘raise’
Notes
VIF is computed for numeric features only. Non-numeric features will not have VIF values;
When saving to JSON/nested dict, numeric values (e.g. NumPy scalars) are cast to native Python int/float, and non-numeric values (e.g. mode) are stored as strings to guarantee JSON safety.
In DataFrame/Excel outputs, original types are preserved.
In CSV outputs, values are stringified implicitly by pandas during to_csv().
CSV output uses flattened column names joined by ‘:’ - this improves portability but loses MultiIndex structure. To read flattened CSV back and restore MultiIndex use: cols = df.columns.str.split(‘:’, expand=True) then set df.columns = pd.MultiIndex.from_frame(cols).
Examples
>>> # Minimal usage (DataFrame output) >>> import pandas as pd >>> from explorica.data_quality import get_summary >>> df = pd.DataFrame({ ... "x1": [1, 2, 3], ... "x2": [2, 4, 6] ... }) >>> summary = get_summary(df, return_as="dataframe") >>> summary nans ... multicollinearity count_of_nans pct_of_nans ... is_multicollinearity VIF x1 0 0.0 ... 1.0 inf x2 0 0.0 ... 1.0 inf [2 rows x 16 columns]
>>> # Saving summary as JSON (nested dict, JSON-friendly) >>> summary_dict = get_summary( ... df, ... return_as="dict", ... directory="summary.json" ... ) >>> # The JSON file summary.json is saved in the current directory. >>> summary_dict["duplicates"] {'count_of_unique': {'x1': 3.0, 'x2': 3.0}, 'pct_of_unique': {'x1': 1.0, 'x2': 1.0}, 'quasi_constant_pct': {'x1': 0.3333, 'x2': 0.3333}
>>> # Verbose logging (optional) >>> summary_verbose = get_summary(df, verbose=True) >>> # verbose=True will log computation steps but does not affect returned object >>> summary_verbose[["nans", "duplicates"]] nans duplicates count_of_nans pct_of_nans count_of_unique pct_of_unique quasi_constant_pct x1 0 0.0 3 1.0 0.3333 x2 0 0.0 3 1.0 0.3333