explorica.data_quality.outliers
explorica.data_quality.outliers.detection
Module for detecting outliers in numeric data.
This module contains the DetectionMethods class, which provides a collection of outlier detection methods that can be applied to pandas Series or numeric arrays.
Functions
detect_iqr(data, iqr_factor=1.5, get_boxplot=False, nan_policy=”drop”, boxplot_kws=None)
Detects outliers in a numerical series using the Interquartile Range (IQR) method.
detect_zscore(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], threshold: Optional[float] = 2.0)
Detects outliers in a numerical series using the Z-score method.
Examples
>>> import pandas as pd
>>> from explorica.data_quality.outliers import detect_iqr, detect_zscore
>>> df = pd.DataFrame({"x": [1,2,3,4,100]})
>>> # Detect IQR outliers
>>> detect_iqr(df)
4 100.0
Name: x, dtype: float64
>>> # Detect Z-score outliers
>>> detect_zscore(df, threshold=1.5)
4 100.0
Name: x, dtype: float64
- explorica.data_quality.outliers.detection.detect_iqr(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], iqr_factor: float = 1.5, get_boxplot: bool | None = False, nan_policy: str = 'drop', boxplot_kws: dict = None) Series | DataFrame | tuple[Series | DataFrame, VisualizationResult][source]
Detect outliers in a numerical series using the Interquartile Range method.
This method identifies values that are significantly lower or higher than the typical range of the data. For 1D input, returns a Series of outliers; for 2D input, returns a DataFrame where non-outlier positions are NaN. Optionally, a boxplot visualization can be generated for the first column to visually inspect outliers.
- Parameters:
- dataSequence[float]|Sequence[Sequence[float]]
A numeric sequence (1D) or a sequence of sequences (2D) to analyze. Will be converted to a pandas DataFrame internally. Each inner sequence is treated as a separate column.
- iqr_factorfloat, default 1.5
Multiplier for the Interquartile Range used to define outlier bounds.
- get_boxplotbool, optional
If True, returns a tuple (outliers, boxplot_figure) where boxplot_figure is a VisualizationResult for the first column only.
- nan_policy{“drop”, “raise”}, default=”drop”
How to handle NaN values.
- boxplot_kwsdict, optional
Additional keyword arguments passed to explorica.visualizations.boxplot (e.g., color, figsize, title). Only applied if get_boxplot=True.
- Returns:
- pd.Series or pd.DataFrame or tuple[pd.Series | pd.DataFrame, VisualizationResult]
Single column input: returns a pandas.Series with outlier values at original indices.
Multi-column input: returns a pandas.DataFrame with outlier values and NaN elsewhere.
If get_boxplot=True, returns a tuple with outliers and the boxplot figure.
- Raises:
- ValueError
If nan_policy=’raise’ and missing values (NaN/null) are found in the data. If iqr_factor is negative.
- Warns:
- UserWarning
If any features have constant or nearly constant values, as outliers cannot exist in such series.
Notes
An outlier is defined as a value below Q1 - iqr_factor * IQR or above Q3 + iqr_factor * IQR.
For 2D inputs, each column is processed independently.
The boxplot is always generated only for the first column if get_boxplot=True.
Examples
>>> import pandas as pd >>> from explorica.data_quality.outliers import detect_iqr >>> s = pd.Series([1, 2, 2, 3, 13, 1, 100, 90]) >>> outliers = detect_iqr(s, iqr_factor=1.5) >>> outliers 6 100.0 7 90.0 Name: 0, dtype: float64
>>> # Several columns DataFrame >>> df = pd.DataFrame({"A": [1, 2, 3, 50], "B": [5, 6, 7, 8]}) >>> outliers_df = detect_iqr(df) >>> outliers_df A B 3 50.0 NaN
>>> # With boxplot and custom styling >>> outliers, plot_result = detect_iqr(s, get_boxplot=True, ... boxplot_kws={"style": "whitegrid", "figsize": (8, 4)}) >>> plot_result.figure.show()
- explorica.data_quality.outliers.detection.detect_zscore(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], threshold: float | None = 2.5) Series | DataFrame[source]
Detect outliers in a numerical series using the Z-score method.
The Z-score method identifies outliers based on their standardized distance from the mean:
\[Z = frac{x - \overline{x}}{\sigma}\]where \(\overline{x}\) is the mean and \(\sigma\) is the standard deviation.
- Parameters:
- dataSequence[float] or Sequence[Sequence[float]]
Input numeric data. Can be a 1D sequence or a 2D structure convertible to a pandas DataFrame.
- thresholdfloat, default=2.5
Z-score threshold for identifying outliers. Values with an absolute Z-score greater than this threshold are considered outliers.
- Returns:
- pd.Series or pd.DataFrame
If the input contains a single feature, returns a Series of outlier values. If multiple features are provided, returns a DataFrame with NaN for non-outlier positions.
- Raises:
- ValueError
If threshold is not positive or the input contains NaN values. If the input contains any NaN values.
- Warns:
- UserWarning
If any features have constant or nearly constant values, as outliers cannot exist in such series.
Examples
>>> import pandas as pd >>> from explorica.data_quality.outliers import detect_zscore >>> # Simple usage >>> s = pd.Series([1, 2, 2, 3, 13, 1, 1000, 2, -1000]) >>> outliers = detect_zscore(s, threshold=1) >>> outliers 6 1000.0 8 -1000.0 Name: 0, dtype: float64 >>> # Returns a Series with outlier values and their original indices
explorica.data_quality.outliers.handling
Module for handling outliers in numerical datasets.
This module defines the HandlingMethods class, which provides utility methods to detect, remove, and replace outliers in number sequences using common statistical techniques such as the Interquartile Range (IQR) and Z-score methods.
Functions
replace_outliers(data, detection_method=”iqr”, strategy=”median”, recursive=False, **kwargs)
Replaces outliers in sequences or mappings according to the specified detection method and replacement strategy.
remove_outliers(data, subset=None, detection_method=”iqr”, recursive=False, **kwargs)
Remove outliers from a given sequence of numerical data.
Examples
>>> import pandas as pd
>>> from explorica.data_quality.outliers import remove_outliers
>>> df = pd.DataFrame([2, 1, 5, 4, 4, 3, 500, 9, 2, 10])
>>> outliers = remove_outliers(df, detection_method="iqr")
>>> outliers
0 2
1 1
2 5
3 4
4 4
5 3
7 9
8 2
9 10
Name: 0, dtype: int64
- explorica.data_quality.outliers.handling.remove_outliers(data: Sequence[float] | Sequence[Sequence[float]], subset: Sequence[str] | None = None, detection_method: str | None = 'iqr', recursive: bool | None = False, **kwargs) Series | DataFrame[source]
Remove outliers from a given sequence of numerical data.
This method supports two outlier detection techniques:
IQR (Interquartile Range)
Z-score
Outliers can be removed in three modes:
Single removal (default)
Iterative removal (iters > 0)
Recursive removal until no outliers remain (recursive=True)
- Parameters:
- dataSequence[float] or Sequence[Sequence[float]]
Input data from which outliers should be removed. Can be a list, NumPy array, pandas Series, DataFrame, etc.
- subsetSequence[str], default None
Features subset by column names. If specified, i_subset is ignored.
- i_subsetSequence[int], default None
Features subset by column positions (like iloc). Used only if subset is None.
- detection_methodstr, default ‘iqr’
Method used for outlier detection. Supported methods are:
‘iqr’ : Interquartile Range method
‘zscore’ : Z-score method
- recursivebool, default False
If True, removes outliers repeatedly until no outliers remain. Ignored if iters is specified.
- itersint, optional
Number of iterations to remove outliers. Must be a positive integer. If specified, recursive is ignored.
- remove_mode{‘any’, ‘all’}, default ‘any’
Defines how to treat multi-column outliers:
‘any’: remove a row if any feature in subset is an outlier
‘all’: remove a row only if all features in subset are outliers”
- zscore_thresholdfloat, default 2.0
Threshold in units of standard deviations for Z-score detection. Z-values beyond this threshold are considered outliers. Has effect only if detection_method=’zscore’. If set, it overrides the “threshold” key in zscore_kws.
- zscore_kwsdict, default {“threshold”: 2.0}
Dictionary of additional parameters to pass to Outliers.detect_zscore. Can be used to customize detection behavior. Has effect only if detection_method=’zscore’.
- Returns:
- pd.Series or pd.DataFrame
Cleaned data with outliers removed. Returns a Series if the input has a single column, otherwise returns a DataFrame.
- Raises:
- ValueError
If input data contains NaN values If the provided detection_method or remove_mode is not supported If iters is not a positive integer.
Examples
>>> import pandas as pd >>> from explorica.data_quality.outliers import remove_outliers >>> # Simple usage >>> table = pd.DataFrame( ... { ... "feature1": [1, 2, 3, 4, 5, 10], ... "feature2": [2, 3, 4, 5, 1000, 1], ... "feature3": [0, 10003, 10004, 10005, 10006, 10008], ... }, ... ) >>> table = remove_outliers( ... table, detection_method="iqr", remove_mode="any" ... ) >>> table feature1 feature2 feature3 1 2 3 10003 2 3 4 10004 3 4 5 10005
>>> # Recursive drop method usage >>> data_series = [1, 2, 3, 4, 5, 6, 11, 20] >>> result = remove_outliers( ... data_series, detection_method="iqr", recursive=True ... ) >>> result 0 1 1 2 2 3 3 4 4 5 5 6 Name: 0, dtype: int64 >>> # In this case, '11' is only classified as an outlier after '20' is removed. >>> # This is equivalent to calling: >>> # remove_outliers(remove_outliers(data_series))
- explorica.data_quality.outliers.handling.replace_outliers(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]], detection_method: str | None = 'iqr', strategy: str | None = 'median', recursive: bool | None = False, **kwargs) Series | DataFrame[source]
Replace outliers in sequences or mappings.
Replaces outliers according to the specified detection method and replacement strategy.
- Parameters:
- dataSequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[Any]]
Input data to process. Can be:
1D sequence -> returns pd.Series
2D sequence -> returns pd.DataFrame
Mapping of column names to sequences -> returns pd.DataFrame
- detection_methodstr, default ‘iqr’
Method to detect outliers. Supported options:
‘iqr’ : Interquartile Range method
‘zscore’ : Z-score method
- strategystr, default ‘median’
Method to replace detected outliers. Supported options:
‘median’ : replace with median of the column
‘mean’ : replace with mean of the column
‘mode’ : replace with mode of the column
‘random’ : replace with a random value sampled from the non-outlier values
‘custom’ : replace with a user-provided value (see custom_value)
- custom_valuescalar, optional, default None
Value to use when strategy=’custom’. Must be provided in this case.
- random_stateint, optional, default None
Seed for random number generator used in ‘random’ replacement strategy. Ensures reproducible replacements.
- recursivebool, default False
If True, replaces outliers repeatedly until no outliers remain. Ignored if iters is specified.
- itersint, optional
Number of iterations to replace outliers. Must be a positive integer. If specified, recursive is ignored.
- subsetSequence[str], default None
Features subset by column names. If specified, i_subset is ignored.
- i_subsetSequence[int], default None
Features subset by column positions (like iloc). Used only if subset is None.
- zscore_thresholdfloat, default 2.0
Threshold in units of standard deviations for Z-score detection. Z-values beyond this threshold are considered outliers. Has effect only if detection_method=’zscore’. If set, it overrides the “threshold” key in zscore_kws.
- iqr_factorfloat, default 1.5
Used in iqr detection. Multiplier for the Interquartile Range used to define outlier bounds. Has effect only if detection_method=’iqr’ If set, it overrides the “iqr_factor” key in iqr_kws.
- zscore_kwsdict, optional
Additional keyword arguments passed to data_quality.detect_zscore. See Outliers.detect_zscore for full details.
- iqr_kwsdict, optional
Additional keyword arguments passed to data_quality.detect_iqr. See data_quality.detect_iqr for full details.
- Returns:
- pd.Series or pd.DataFrame
Object of same shape as input with outliers replaced.
Returns pd.Series if input is 1D or if the DataFrame has only one column.
Returns pd.DataFrame otherwise.
Replacement values respect original data types: integers are rounded automatically if replacement value is float.
- Raises:
- ValueError
If input data contains NaN values If the provided detection_method or strategy is not supported If iters is not a positive integer. If strategy=’custom’ and custom_value is not provided.
Examples
>>> import numpy as np >>> import pandas as pd >>> from explorica.data_quality.outliers import replace_outliers >>> data = pd.DataFrame({ ... "feature_1": [1.0, 2.4, 1.6, 12, 1.2, 501.1, 0.6], ... "feature_2": [10, 11, 9, 12, 10, 11, 500] ... }) >>> result = replace_outliers(data, detection_method="iqr", strategy="mean") >>> np.round(result, 4) feature_1 feature_2 0 1.0000 10 1 2.4000 11 2 1.6000 9 3 12.0000 12 4 1.2000 10 5 3.1333 11 6 0.6000 10
explorica.data_quality.outliers.stats
Module for statistical metrics and distribution analysis.
This module defines tools for computing standardized statistical moments (skewness and excess kurtosis) and for describing the shape of numeric distributions.
Functions
- get_skewness(data, method=”general”)
Compute the skewness (third standardized moment) of a numeric sequence.
- get_kurtosis(data, method=”general”)
Compute the excess kurtosis (fourth standardized moment minus 3) of a numeric sequence.
describe_distributions(data, threshold_skewness=0.25, threshold_kurtosis=0.25, return_as=”dataframe”, **kwargs)
Describe shape (skewness / kurtosis) of one or multiple numeric distributions.
Examples
>>> import numpy as np
>>> import pandas as pd
>>> from explorica.data_quality.outliers import get_skewness
>>> df = pd.DataFrame({
... "x": [1, 2, 3, 4, 5],
... "y": [1, 4, 8, 16, 32]
... })
>>> skewness = get_skewness(df, method="general")
>>> np.round(skewness, 4)
x 0.0000
y 0.8447
dtype: float64
- explorica.data_quality.outliers.stats.describe_distributions(data: Sequence[Sequence[float]] | DataFrame | Mapping[str, Sequence[float]], threshold_skewness: float | None = 0.25, threshold_kurtosis: float | None = 0.25, return_as: str | None = 'dataframe', **kwargs) DataFrame | dict[source]
Describe shape (skewness / kurtosis) of one or multiple numeric distributions.
The function computes skewness and excess kurtosis for each 1-D sequence in data and classifies the distribution shape according to the provided absolute thresholds. Distributions whose absolute skewness and absolute excess kurtosis are both less than or equal to the corresponding thresholds are considered “normal”.
- Parameters:
- dataSequence, Mapping[str, Sequence[Number]]
Input container with one or more numeric sequences (distributions). Supported forms:
2D sequence (e.g. list of lists, list/array of 1D arrays): each inner sequence represents one distribution;
pandas.DataFrame: each column is treated as a separate distribution;Mapping(e.g. dict, OrderedDict): mapping keys are used as feature names and mapping values should be 1D numeric sequences.
In the Mapping and DataFrame cases the order of returned metrics follows the order of mapping keys or DataFrame columns respectively. For plain sequences the order follows the sequence order and the resulting DataFrame will use a RangeIndex.
- threshold_skewnessfloat, optional, default=0.25
Absolute skewness threshold. If
abs(skewness) <= threshold_skewnessthe distribution is considered not skewed (with respect to this threshold).- threshold_kurtosisfloat, optional, default=0.25
Absolute excess kurtosis threshold. If
abs(kurtosis) <= threshold_kurtosisthe distribution is considered not kurtotic (with respect to this threshold). Note: this function uses excess kurtosis (kurtosis - 3), so a normal distribution is approximately 0.- return_as{‘dataframe’, ‘dict’}, optional, default=’dataframe’
Output format:
'dataframe'— return apandas.DataFramewith columns:['is_normal', 'desc', 'skewness', 'kurtosis']. If input was a DataFrame or Mapping the index will reflect column names / mapping keys.'dict'— return a dict with keys'is_normal','desc','skewness','kurtosis'and list-like values in the same order as the features.
- Returns:
- pandas.DataFrame or dict
Either a DataFrame (if return_as=’dataframe`) or a dict (if return_as=’dict’) containing the following entries per feature:
is_normal(int) - 1 if both \(|\gamma_1|\) and \(|\gamma_2|\) are within thresholds.desc(str) - human-friendly description, one of:'normal','left-skewed','right-skewed','low-pitched'(platykurtic) and/or'high-pitched'(leptokurtic). Multiple descriptors are joined by a comma (e.g.'right-skewed, high-pitched').skewness \(\gamma_1\) (float) - skewness (third standardized moment).
kurtosis \(\gamma_2\) (float) - excess kurtosis (fourth standardized moment minus 3).
- Other Parameters:
- method_skewness{“general”, “sample”}, default=”general”
Method to compute skewness. It is used in data_quality.get_skewness, See data_quality.get_skewness for full details.
- method_kurtosis{“general”, “sample”}, default=”general”
Method to compute kurtosis. It is used in data_quality.get_kurtosis, See data_quality.get_kurtosis for full details.
- Raises:
- ValueError
If
return_asis not in{'dataframe', 'dict'}.
See also
explorica.data_quality.outliers.stats.get_skewnessThe underlying computation function.
explorica.data_quality.outliers.stats.get_kurtosisThe underlying computation function.
Notes
The function expects numeric, one-dimensional sequences for each distribution. If mapping values are heterogeneous (different lengths / non-sequences) the behavior may be unexpected — prefer passing a DataFrame or a well-formed Mapping.
Threshold checks are inclusive: equality to threshold counts as within.
For programmatic consumption prefer
return_as='dataframe'(tabular form). Thedictform returns lists of values aligned to the feature order (not a transposed mapping of feature -> single-structure per feature).
Examples
>>> import numpy as np >>> import pandas as pd >>> from explorica.data_quality.outliers import describe_distributions >>> # Simple usage >>> np.random.seed(42) # Set seed for reproducibility >>> df = pd.DataFrame({ ... "x": np.random.normal(size=1000), ... "y": np.random.exponential(size=1000) ... }) >>> result = describe_distributions(df, threshold_skewness=0.3) >>> np.round(result, 4) skewness kurtosis is_normal desc x 0.1168 0.0662 1 normal y 1.9808 5.3794 0 right-skewed, high-pitched >>> result = describe_distributions(df, return_as='dict') >>> list(result.keys()) ['skewness', 'kurtosis', 'is_normal', 'desc']
- explorica.data_quality.outliers.stats.get_kurtosis(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[float]], method: str = 'general') float[source]
Compute the excess kurtosis of a numeric sequence.
Computed as:
\[\gamma_2 = \frac{m_4}{\sigma^4} - 3\]Where \(m_4\) is:
\[m_4 = \frac{\sum{(x_i - \overline{x})^4}}{n}\]- Parameters:
- dataSequence | Mapping[str, Sequence[Number]]
Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.
- method{“general”, “sample”}, default “general”
Method to compute excess kurtosis:
- “general”: population excess kurtosis, computed as
\(\frac{m_4}{\sigma^4} - 3\)
- “sample”: biased sample excess kurtosis,
computed as \(\frac{m_4}{(S^2 * \frac{n}{n-1})^2} - 3\)
Note that this function does not yet implement the unbiased Fisher correction for sample kurtosis.
- Returns:
- pd.Series | float
Excess kurtosis value of the input data. 0.0 for normal distribution, positive values indicate heavier tails, negative values indicate lighter tails. If the sample variance is close to zero, the excess kurtosis value will be replaced by np.nan.
- Raises:
- ValueError
If input contains NaNs. If provided method is not supported.
- Warns:
- UserWarning
If any features have variance < 1e-8.
Examples
>>> import numpy as np >>> from explorica.data_quality.outliers import get_kurtosis >>> # Simple usage >>> data_series = [2, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 12] >>> result = get_kurtosis(data_series) >>> # Round coefficients for doctests reproducibility >>> np.round(result, 4) np.float64(-0.4778)
- explorica.data_quality.outliers.stats.get_skewness(data: Sequence[float] | Sequence[Sequence[float]] | Mapping[str, Sequence[float]], method: str = 'general') float | Series[source]
Compute the skewness of a numeric sequence.
Computed as:
\[\gamma_1 = \frac{m_3}{\sigma^3} - 3\]Where \(m_3\) is:
\[m_3 = \frac{\sum{(x_i - \overline{x})^3}}{n}\]- Parameters:
- dataSequence | Mapping[str, Sequence[Number]]
Numeric input data. Can be 1D (sequence of numbers), 2D (sequence of sequences), or a mapping of column names to sequences.
- methodstr, {“general”, “sample”}, default “general”
Method to compute skewness:
“general”: standard formula \(\gamma_1 = \frac{m_3}{\sigma^3}\)
“sample”: corrected for sample size, \(\gamma_1 = \frac{m3}{(S^2*\frac{n}{n-1})^{3/2}}\)
- Returns:
- float or pd.Series
Skewness of input data. Returns a single float if input is 1D or a Series of skewness values (one per column) if input is 2D or a mapping.
- Raises:
- ValueError
If input contains NaNs. If provided method is not supported.
Warning
- UserWarning
If any features have variance < 1e-8.
Notes
For numerical stability, variance close to zero is treated as zero.
Examples
>>> from explorica.data_quality.outliers import get_skewness >>> # Simple usage >>> print(get_skewness({"a": [1,2,3], "b": [2,3,4]}, method="sample")) a 0.0 b 0.0 dtype: float64