explorica.data_quality.outliers

explorica.data_quality.outliers.detection

Module for detecting outliers in numeric data.

This module contains the DetectionMethods class, which provides a collection of outlier detection methods that can be applied to pandas Series or numeric arrays.

Functions

detect_iqr(data, iqr_factor=1.5, get_boxplot=False, nan_policy=”drop”, boxplot_kws=None)

Detects outliers in a numerical series using the Interquartile Range (IQR) method.

detect_zscore(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], threshold: Optional[float] = 2.0)

Detects outliers in a numerical series using the Z-score method.

Examples

>>> import pandas as pd
>>> from explorica.data_quality.outliers import detect_iqr, detect_zscore
>>> df = pd.DataFrame({"x": [1,2,3,4,100]})
>>> # Detect IQR outliers
>>> detect_iqr(df)
4    100.0
Name: x, dtype: float64
>>> # Detect Z-score outliers
>>> detect_zscore(df, threshold=1.5)
4    100.0
Name: x, dtype: float64
explorica.data_quality.outliers.detection.detect_iqr(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], iqr_factor: float = 1.5, get_boxplot: bool | None = False, nan_policy: str = 'drop', boxplot_kws: dict = None) Series | DataFrame | tuple[Series | DataFrame, VisualizationResult][source]

Detect outliers in a numerical series using the Interquartile Range method.

This method identifies values that are significantly lower or higher than the typical range of the data. For 1D input, returns a Series of outliers; for 2D input, returns a DataFrame where non-outlier positions are NaN. Optionally, a boxplot visualization can be generated for the first column to visually inspect outliers.

Parameters:
dataSequence[float]|Sequence[Sequence[float]]

A numeric sequence (1D) or a sequence of sequences (2D) to analyze. Will be converted to a pandas DataFrame internally. Each inner sequence is treated as a separate column.

iqr_factorfloat, default 1.5

Multiplier for the Interquartile Range used to define outlier bounds.

get_boxplotbool, optional

If True, returns a tuple (outliers, boxplot_figure) where boxplot_figure is a VisualizationResult for the first column only.

nan_policy{“drop”, “raise”}, default=”drop”

How to handle NaN values.

boxplot_kwsdict, optional

Additional keyword arguments passed to explorica.visualizations.boxplot (e.g., color, figsize, title). Only applied if get_boxplot=True.

Returns:
pd.Series or pd.DataFrame or tuple[pd.Series | pd.DataFrame, VisualizationResult]
  • Single column input: returns a pandas.Series with outlier values at original indices.

  • Multi-column input: returns a pandas.DataFrame with outlier values and NaN elsewhere.

  • If get_boxplot=True, returns a tuple with outliers and the boxplot figure.

Raises:
ValueError

If nan_policy=’raise’ and missing values (NaN/null) are found in the data. If iqr_factor is negative.

Warns:
UserWarning

If any features have constant or nearly constant values, as outliers cannot exist in such series.

Notes

  • An outlier is defined as a value below Q1 - iqr_factor * IQR or above Q3 + iqr_factor * IQR.

  • For 2D inputs, each column is processed independently.

  • The boxplot is always generated only for the first column if get_boxplot=True.

Examples

>>> import pandas as pd
>>> from explorica.data_quality.outliers import detect_iqr
>>> s = pd.Series([1, 2, 2, 3, 13, 1, 100, 90])
>>> outliers = detect_iqr(s, iqr_factor=1.5)
>>> outliers
6    100.0
7     90.0
Name: 0, dtype: float64
>>> # Several columns DataFrame
>>> df = pd.DataFrame({"A": [1, 2, 3, 50], "B": [5, 6, 7, 8]})
>>> outliers_df = detect_iqr(df)
>>> outliers_df
      A   B
3  50.0 NaN
>>> # With boxplot and custom styling
>>> outliers, plot_result = detect_iqr(s, get_boxplot=True,
...     boxplot_kws={"style": "whitegrid", "figsize": (8, 4)})
>>> plot_result.figure.show()
explorica.data_quality.outliers.detection.detect_zscore(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], threshold: float | None = 2.5) Series | DataFrame[source]

Detect outliers in a numerical series using the Z-score method.

The Z-score method identifies outliers based on their standardized distance from the mean:

\[Z = frac{x - \overline{x}}{\sigma}\]

where \(\overline{x}\) is the mean and \(\sigma\) is the standard deviation.

Parameters:
dataSequence[float] or Sequence[Sequence[float]]

Input numeric data. Can be a 1D sequence or a 2D structure convertible to a pandas DataFrame.

thresholdfloat, default=2.5

Z-score threshold for identifying outliers. Values with an absolute Z-score greater than this threshold are considered outliers.

Returns:
pd.Series or pd.DataFrame

If the input contains a single feature, returns a Series of outlier values. If multiple features are provided, returns a DataFrame with NaN for non-outlier positions.

Raises:
ValueError

If threshold is not positive or the input contains NaN values. If the input contains any NaN values.

Warns:
UserWarning

If any features have constant or nearly constant values, as outliers cannot exist in such series.

Examples

>>> import pandas as pd
>>> from explorica.data_quality.outliers import detect_zscore
>>> # Simple usage
>>> s = pd.Series([1, 2, 2, 3, 13, 1, 1000, 2, -1000])
>>> outliers = detect_zscore(s, threshold=1)
>>> outliers
6    1000.0
8   -1000.0
Name: 0, dtype: float64
>>> # Returns a Series with outlier values and their original indices

explorica.data_quality.outliers.handling

Module for handling outliers in numerical datasets.

This module defines the HandlingMethods class, which provides utility methods to detect, remove, and replace outliers in number sequences using common statistical techniques such as the Interquartile Range (IQR) and Z-score methods.

Functions

replace_outliers(data, detection_method=”iqr”, strategy=”median”, recursive=False, **kwargs)

Replaces outliers in sequences or mappings according to the specified detection method and replacement strategy.

remove_outliers(data, subset=None, detection_method=”iqr”, recursive=False, **kwargs)

Remove outliers from a given sequence of numerical data.

Examples

>>> import pandas as pd
>>> from explorica.data_quality.outliers import remove_outliers
>>> df = pd.DataFrame([2, 1, 5, 4, 4, 3, 500, 9, 2, 10])
>>> outliers = remove_outliers(df, detection_method="iqr")
>>> outliers
0     2
1     1
2     5
3     4
4     4
5     3
7     9
8     2
9    10
Name: 0, dtype: int64
explorica.data_quality.outliers.handling.remove_outliers(data: Sequence[float] | Sequence[Sequence[float]], subset: Sequence[str] | None = None, detection_method: str | None = 'iqr', recursive: bool | None = False, **kwargs) Series | DataFrame[source]

Remove outliers from a given sequence of numerical data.

This method supports two outlier detection techniques:

  1. IQR (Interquartile Range)

  2. Z-score

Outliers can be removed in three modes:

  • Single removal (default)

  • Iterative removal (iters > 0)

  • Recursive removal until no outliers remain (recursive=True)

Parameters:
dataSequence[float] or Sequence[Sequence[float]]

Input data from which outliers should be removed. Can be a list, NumPy array, pandas Series, DataFrame, etc.

subsetSequence[str], default None

Features subset by column names. If specified, i_subset is ignored.

i_subsetSequence[int], default None

Features subset by column positions (like iloc). Used only if subset is None.

detection_methodstr, default ‘iqr’

Method used for outlier detection. Supported methods are:

  • ‘iqr’ : Interquartile Range method

  • ‘zscore’ : Z-score method

recursivebool, default False

If True, removes outliers repeatedly until no outliers remain. Ignored if iters is specified.

itersint, optional

Number of iterations to remove outliers. Must be a positive integer. If specified, recursive is ignored.

remove_mode{‘any’, ‘all’}, default ‘any’

Defines how to treat multi-column outliers:

  • ‘any’: remove a row if any feature in subset is an outlier

  • ‘all’: remove a row only if all features in subset are outliers”

zscore_thresholdfloat, default 2.0

Threshold in units of standard deviations for Z-score detection. Z-values beyond this threshold are considered outliers. Has effect only if detection_method=’zscore’. If set, it overrides the “threshold” key in zscore_kws.

zscore_kwsdict, default {“threshold”: 2.0}

Dictionary of additional parameters to pass to Outliers.detect_zscore. Can be used to customize detection behavior. Has effect only if detection_method=’zscore’.

Returns:
pd.Series or pd.DataFrame

Cleaned data with outliers removed. Returns a Series if the input has a single column, otherwise returns a DataFrame.

Raises:
ValueError

If input data contains NaN values If the provided detection_method or remove_mode is not supported If iters is not a positive integer.

Examples

>>> import pandas as pd
>>> from explorica.data_quality.outliers import remove_outliers
>>> # Simple usage
>>> table = pd.DataFrame(
...     {
...         "feature1": [1, 2, 3, 4, 5, 10],
...         "feature2": [2, 3, 4, 5, 1000, 1],
...         "feature3": [0, 10003, 10004, 10005, 10006, 10008],
...     },
... )
>>> table = remove_outliers(
...     table, detection_method="iqr", remove_mode="any"
... )
>>> table
   feature1  feature2  feature3
1         2         3     10003
2         3         4     10004
3         4         5     10005
>>> # Recursive drop method usage
>>> data_series = [1, 2, 3, 4, 5, 6, 11, 20]
>>> result = remove_outliers(
... data_series, detection_method="iqr", recursive=True
... )
>>> result
0    1
1    2
2    3
3    4
4    5
5    6
Name: 0, dtype: int64
>>> # In this case, '11' is only classified as an outlier after '20' is removed.
>>> # This is equivalent to calling:
>>> # remove_outliers(remove_outliers(data_series))
explorica.data_quality.outliers.handling.replace_outliers(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], detection_method: str | None = 'iqr', strategy: str | None = 'median', recursive: bool | None = False, **kwargs) Series | DataFrame[source]

Replace outliers in sequences or mappings.

Replaces outliers according to the specified detection method and replacement strategy.

Parameters:
dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]]

Input data to process. Can be:

  • 1D sequence -> returns pd.Series

  • 2D sequence -> returns pd.DataFrame

  • Mapping of column names to sequences -> returns pd.DataFrame

detection_methodstr, default ‘iqr’

Method to detect outliers. Supported options:

  • ‘iqr’ : Interquartile Range method

  • ‘zscore’ : Z-score method

strategystr, default ‘median’

Method to replace detected outliers. Supported options:

  • ‘median’ : replace with median of the column

  • ‘mean’ : replace with mean of the column

  • ‘mode’ : replace with mode of the column

  • ‘random’ : replace with a random value sampled from the non-outlier values

  • ‘custom’ : replace with a user-provided value (see custom_value)

custom_valuescalar, optional, default None

Value to use when strategy=’custom’. Must be provided in this case.

random_stateint, optional, default None

Seed for random number generator used in ‘random’ replacement strategy. Ensures reproducible replacements.

recursivebool, default False

If True, replaces outliers repeatedly until no outliers remain. Ignored if iters is specified.

itersint, optional

Number of iterations to replace outliers. Must be a positive integer. If specified, recursive is ignored.

subsetSequence[str], default None

Features subset by column names. If specified, i_subset is ignored.

i_subsetSequence[int], default None

Features subset by column positions (like iloc). Used only if subset is None.

zscore_thresholdfloat, default 2.0

Threshold in units of standard deviations for Z-score detection. Z-values beyond this threshold are considered outliers. Has effect only if detection_method=’zscore’. If set, it overrides the “threshold” key in zscore_kws.

iqr_factorfloat, default 1.5

Used in iqr detection. Multiplier for the Interquartile Range used to define outlier bounds. Has effect only if detection_method=’iqr’ If set, it overrides the “iqr_factor” key in iqr_kws.

zscore_kwsdict, optional

Additional keyword arguments passed to data_quality.detect_zscore. See Outliers.detect_zscore for full details.

iqr_kwsdict, optional

Additional keyword arguments passed to data_quality.detect_iqr. See data_quality.detect_iqr for full details.

Returns:
pd.Series or pd.DataFrame

Object of same shape as input with outliers replaced.

  • Returns pd.Series if input is 1D or if the DataFrame has only one column.

  • Returns pd.DataFrame otherwise.

Replacement values respect original data types: integers are rounded automatically if replacement value is float.

Raises:
ValueError

If input data contains NaN values If the provided detection_method or strategy is not supported If iters is not a positive integer. If strategy=’custom’ and custom_value is not provided.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from explorica.data_quality.outliers import replace_outliers
>>> data = pd.DataFrame({
...     "feature_1": [1.0, 2.4, 1.6, 12, 1.2, 501.1, 0.6],
...     "feature_2": [10, 11, 9, 12, 10, 11, 500]
... })
>>> result = replace_outliers(data, detection_method="iqr", strategy="mean")
>>> np.round(result, 4)
   feature_1  feature_2
0     1.0000         10
1     2.4000         11
2     1.6000          9
3    12.0000         12
4     1.2000         10
5     3.1333         11
6     0.6000         10

explorica.data_quality.outliers.stats

Module for statistical metrics and distribution analysis.

This module defines tools for computing standardized statistical moments (skewness and excess kurtosis) and for describing the shape of numeric distributions.

Functions

get_skewness(data, method=”general”)

Compute the skewness (third standardized moment) of a numeric sequence.

get_kurtosis(data, method=”general”)

Compute the excess kurtosis (fourth standardized moment minus 3) of a numeric sequence.

describe_distributions(data, threshold_skewness=0.25, threshold_kurtosis=0.25, return_as=”dataframe”, **kwargs)

Describe shape (skewness / kurtosis) of one or multiple numeric distributions.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from explorica.data_quality.outliers import get_skewness
>>> df = pd.DataFrame({
...     "x": [1, 2, 3, 4, 5],
...     "y": [1, 4, 8, 16, 32]
... })
>>> skewness = get_skewness(df, method="general")
>>> np.round(skewness, 4)
x    0.0000
y    0.8447
dtype: float64
explorica.data_quality.outliers.stats.describe_distributions(data: Sequence[Sequence[float]] | DataFrame | Mapping[str, Sequence[float]], threshold_skewness: float | None = 0.25, threshold_kurtosis: float | None = 0.25, return_as: str | None = 'dataframe', **kwargs) DataFrame | dict[source]

Describe shape (skewness / kurtosis) of one or multiple numeric distributions.

The function computes skewness and excess kurtosis for each 1-D sequence in data and classifies the distribution shape according to the provided absolute thresholds. Distributions whose absolute skewness and absolute excess kurtosis are both less than or equal to the corresponding thresholds are considered “normal”.

Parameters:
dataSequence, Mapping[str, Sequence[Number]]

Input container with one or more numeric sequences (distributions). Supported forms:

  • 2D sequence (e.g. list of lists, list/array of 1D arrays): each inner sequence represents one distribution;

  • pandas.DataFrame: each column is treated as a separate distribution;

  • Mapping (e.g. dict, OrderedDict): mapping keys are used as feature names and mapping values should be 1D numeric sequences.

In the Mapping and DataFrame cases the order of returned metrics follows the order of mapping keys or DataFrame columns respectively. For plain sequences the order follows the sequence order and the resulting DataFrame will use a RangeIndex.

threshold_skewnessfloat, optional, default=0.25

Absolute skewness threshold. If abs(skewness) <= threshold_skewness the distribution is considered not skewed (with respect to this threshold).

threshold_kurtosisfloat, optional, default=0.25

Absolute excess kurtosis threshold. If abs(kurtosis) <= threshold_kurtosis the distribution is considered not kurtotic (with respect to this threshold). Note: this function uses excess kurtosis (kurtosis - 3), so a normal distribution is approximately 0.

return_as{‘dataframe’, ‘dict’}, optional, default=’dataframe’

Output format:

  • 'dataframe' — return a pandas.DataFrame with columns: ['is_normal', 'desc', 'skewness', 'kurtosis']. If input was a DataFrame or Mapping the index will reflect column names / mapping keys.

  • 'dict' — return a dict with keys 'is_normal', 'desc', 'skewness', 'kurtosis' and list-like values in the same order as the features.

Returns:
pandas.DataFrame or dict

Either a DataFrame (if return_as=’dataframe`) or a dict (if return_as=’dict’) containing the following entries per feature:

  • is_normal (int) - 1 if both \(|\gamma_1|\) and \(|\gamma_2|\) are within thresholds.

  • desc (str) - human-friendly description, one of: 'normal', 'left-skewed', 'right-skewed', 'low-pitched' (platykurtic) and/or 'high-pitched' (leptokurtic). Multiple descriptors are joined by a comma (e.g. 'right-skewed, high-pitched').

  • skewness \(\gamma_1\) (float) - skewness (third standardized moment).

  • kurtosis \(\gamma_2\) (float) - excess kurtosis (fourth standardized moment minus 3).

Other Parameters:
method_skewness{“general”, “sample”}, default=”general”

Method to compute skewness. It is used in data_quality.get_skewness, See data_quality.get_skewness for full details.

method_kurtosis{“general”, “sample”}, default=”general”

Method to compute kurtosis. It is used in data_quality.get_kurtosis, See data_quality.get_kurtosis for full details.

Raises:
ValueError

If return_as is not in {'dataframe', 'dict'}.

See also

explorica.data_quality.outliers.stats.get_skewness

The underlying computation function.

explorica.data_quality.outliers.stats.get_kurtosis

The underlying computation function.

Notes

  • The function expects numeric, one-dimensional sequences for each distribution. If mapping values are heterogeneous (different lengths / non-sequences) the behavior may be unexpected — prefer passing a DataFrame or a well-formed Mapping.

  • Threshold checks are inclusive: equality to threshold counts as within.

  • For programmatic consumption prefer return_as='dataframe' (tabular form). The dict form returns lists of values aligned to the feature order (not a transposed mapping of feature -> single-structure per feature).

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from explorica.data_quality.outliers import describe_distributions
>>> # Simple usage
>>> np.random.seed(42) # Set seed for reproducibility
>>> df = pd.DataFrame({
...     "x": np.random.normal(size=1000),
...     "y": np.random.exponential(size=1000)
... })
>>> result = describe_distributions(df, threshold_skewness=0.3)
>>> np.round(result, 4)
   skewness  kurtosis  is_normal                        desc
x    0.1168    0.0662          1                      normal
y    1.9808    5.3794          0  right-skewed, high-pitched
>>> result = describe_distributions(df, return_as='dict')
>>> list(result.keys())
['skewness', 'kurtosis', 'is_normal', 'desc']
explorica.data_quality.outliers.stats.get_kurtosis(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[float]], method: str = 'general') float[source]

Compute the excess kurtosis of a numeric sequence.

Computed as:

\[\gamma_2 = \frac{m_4}{\sigma^4} - 3\]

Where \(m_4\) is:

\[m_4 = \frac{\sum{(x_i - \overline{x})^4}}{n}\]
Parameters:
dataSequence | Mapping[str, Sequence[Number]]

Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.

method{“general”, “sample”}, default “general”

Method to compute excess kurtosis:

  • “general”: population excess kurtosis, computed as

    \(\frac{m_4}{\sigma^4} - 3\)

  • “sample”: biased sample excess kurtosis,

    computed as \(\frac{m_4}{(S^2 * \frac{n}{n-1})^2} - 3\)

Note that this function does not yet implement the unbiased Fisher correction for sample kurtosis.

Returns:
pd.Series | float

Excess kurtosis value of the input data. 0.0 for normal distribution, positive values indicate heavier tails, negative values indicate lighter tails. If the sample variance is close to zero, the excess kurtosis value will be replaced by np.nan.

Raises:
ValueError

If input contains NaNs. If provided method is not supported.

Warns:
UserWarning

If any features have variance < 1e-8.

Examples

>>> import numpy as np
>>> from explorica.data_quality.outliers import get_kurtosis
>>> # Simple usage
>>> data_series = [2, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 12]
>>> result = get_kurtosis(data_series)
>>> # Round coefficients for doctests reproducibility
>>> np.round(result, 4)
np.float64(-0.4778)
explorica.data_quality.outliers.stats.get_skewness(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[float]], method: str = 'general') float | Series[source]

Compute the skewness of a numeric sequence.

Computed as:

\[\gamma_1 = \frac{m_3}{\sigma^3} - 3\]

Where \(m_3\) is:

\[m_3 = \frac{\sum{(x_i - \overline{x})^3}}{n}\]
Parameters:
dataSequence | Mapping[str, Sequence[Number]]

Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.

methodstr, {“general”, “sample”}, default “general”

Method to compute skewness:

  • “general”: standard formula \(\gamma_1 = \frac{m_3}{\sigma^3}\)

  • “sample”: corrected for sample size, \(\gamma_1 = \frac{m3}{(S^2*\frac{n}{n-1})^{3/2}}\)

Returns:
float or pd.Series

Skewness of input data. Returns a single float if input is 1D or a Series of skewness values (one per column) if input is 2D or a mapping.

Raises:
ValueError

If input contains NaNs. If provided method is not supported.

Warning

UserWarning

If any features have variance < 1e-8.

Notes

For numerical stability, variance close to zero is treated as zero.

Examples

>>> from explorica.data_quality.outliers import get_skewness
>>> # Simple usage
>>> print(get_skewness({"a": [1,2,3], "b": [2,3,4]}, method="sample"))
a    0.0
b    0.0
dtype: float64