explorica.interactions

explorica.interactions.aggregators

Module provides utilities for aggregating interactions between features in a dataset. It contains functions to identify and return significant feature pairs based on various correlation and association measures.

The main function, high_corr_pairs, evaluates feature-to-feature relationships using linear (Pearson, Spearman), non-linear (e.g. exponential, binomial, power-law), categorical (Cramér’s V), and hybrid (η²) measures. Users can optionally enable non-linear and multiple- correlation modes.

Functions

detect_multicollinearity(numeric_features=None, category_features=None, method=”VIF”, return_as=”dataframe”, **kwargs)

Detect multicollinearity among features using either Variance Inflation Factor (VIF) or correlation-based methods.

high_corr_pairs(numeric_features=None, category_features=None, threshold=0.7, **kwargs): Finds and returns all significant pairs of correlated features from the input datasets.

explorica.interactions.aggregators.detect_multicollinearity(numeric_features: Sequence[Sequence[Number]] = None, category_features: Sequence[Sequence] = None, method: str = 'VIF', return_as: str = 'dataframe', **kwargs) → dict | DataFrame[source]

Detect multicollinearity using either VIF or correlation-based methods.

Multicollinearity occurs when features are highly correlated with each other, which can destabilize model coefficients and reduce interpretability. This function provides two approaches: VIF quantifies how much the variance of a regression coefficient is inflated due to collinearity with other features, while the correlation-based method offers a broader assessment covering numeric-numeric, numeric-categorical, and categorical-categorical feature pairs.

Parameters:

numeric_featuresSequence of sequences of numbers, optional

Numerical feature matrix or compatible structure (array-like or DataFrame). Required for method='VIF'. Used together with category_features when correlation-based method is selected.

category_featuresSequence of sequences, optional

Categorical feature matrix or compatible structure (array-like or DataFrame). Only used with method='corr'. Not evaluated under VIF.

method{“VIF”, “corr”}, default=”VIF”

Method to detect multicollinearity:

“VIF” : Compute Variance Inflation Factor for numerical features.
“corr” : Detect multicollinearity based on the highest pairwise absolute correlation between features (numeric–numeric, numeric–categorical, categorical–categorical). Supported correlation metrics include: sqrt_eta_squared, cramer_v, pearson, spearman.

return_as{“dataframe”, “dict”}, default=”dataframe”

Output format of the result:

“dataframe” : Pandas DataFrame with features as index and metrics as columns.
“dict” : Nested dictionary of the form {metric: {feature: value, ...}, ...}.

variance_inflation_thresholdfloat, default=10

Threshold above which a feature is considered collinear in VIF method.

correlation_thresholdfloat, default=0.95

Threshold for the highest absolute correlation of a feature with any other feature. If this value is exceeded, the feature is considered collinear.

Returns:

dict or pd.DataFrame

Multicollinearity assessment, depending on return_as:

If “dataframe”: DataFrame with columns for metrics (e.g., “VIF”, “multicollinearity”) and rows corresponding to features.
If “dict”: Mapping of metrics to per-feature values.

Raises:

ValueError: If all inputs are empty. If lengths of numeric_features and category_features do not match. If any input array contains NaN values. If method or return_as is not one of the supported values.

Notes

VIF can be infinite if the dataset contains functionally dependent features.
Categorical features are not evaluated under VIF.

Examples

>>> import pandas as pd
>>> from explorica.interactions.aggregators import detect_multicollinearity
>>> # Simple usage
>>> X_num = pd.DataFrame({"x1": [1, 2, 3], "x2": [2, 4, 6], "x3": [1, 0, 1]})
>>> detect_multicollinearity(X_num, method="VIF", return_as="dataframe")
    VIF  multicollinearity
x1  inf                1.0
x2  inf                1.0
x3  1.0                0.0

explorica.interactions.aggregators.high_corr_pairs(numeric_features: Sequence[Sequence[Number]] = None, category_features: Sequence[Sequence] = None, threshold: float = 0.7, **kwargs) → DataFrame | None[source]

Find and return all significant pairs of correlated features from the dataset.

This method evaluates feature-to-feature relationships using a set of correlation measures, including linear (Pearson, Spearman), non-linear (e.g. exponential, binomial, power-law), categorical (Cramér’s V), and hybrid (η²). Users can optionally enable non-linear and multiple-correlation modes.

Parameters:

numeric_featurespd.DataFrame, optional: A DataFrame of numerical features. Required for linear, η², non-linear, and multiple correlation.
category_featurespd.DataFrame, optional: A DataFrame of categorical features. Required for Cramér’s V and η² computations.
ystr, optional: Target feature name to compute correlations with. If None, all pairwise comparisons are evaluated.
nonlinear_includedbool, default=False: Whether to include non-linear correlation measures for numeric features.
multiple_includedbool, default=False: Whether to include multiple correlation analysis (for numeric features only).
thresholdfloat, default=0.7: Minimum absolute value of correlation to consider a pair as significantly dependent.

Returns:

pd.DataFrame or None: A DataFrame with columns [‘X’, ‘Y’, ‘coef’, ‘method’], listing feature pairs whose correlation (in absolute value) exceeds the threshold. Returns None if no such pairs found.

Raises:

ValueError: If neither input DataFrame is provided. If numeric_features or category_features are of unequal lengths. If numeric_features or category_features contain NaN values. If numeric_features or category_features contain duplicate column name.

Notes

Linear correlation methods: Pearson, Spearman
Non-linear methods (enabled via nonlinear_included): exp, binomial, ln, hyperbolic, power
Categorical methods: Cramér’s V, η² (eta)
The method skips self-comparisons.
Targeted correlation (y) will produce only pairs involving the specified target.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from explorica.interactions.aggregators import high_corr_pairs
>>> # Simple usage
>>> data = pd.DataFrame({
...     "X1": [1, 2, 3, 4, 5, 6],
...     "X2": [12, 10, 8, 6, 4, 2],
...     "X3": [9, 3, 5, 2, 6, 1],
...     "X4": [3, 2, 1, 3, 2, 1],
...     "target": [2, 3, 4, 6, 8, 10],
... })
>>> y = pd.Series([2, 3, 4, 6, 8, 10], name="y")
>>> result_df = high_corr_pairs(data, y="target", threshold=0.0)
>>> # Round coefficients for doctests reproducibility
>>> result_df["coef"] = np.round(result_df["coef"], 4)
>>> result_df
    X       Y    coef    method
0  X2  target -1.0000  spearman
1  X1  target  1.0000  spearman
2  X1  target  0.9885   pearson
3  X2  target -0.9885   pearson
4  X3  target -0.6000  spearman
5  X3  target -0.5731   pearson
6  X4  target -0.4781  spearman
7  X4  target -0.4353   pearson

explorica.interactions.correlation_matrices

Module provides tools for constructing various types of correlation and dependence matrices for both numerical and categorical features in a dataset. This module is intended to be used via the public facade InteractionAnalyzer but can also be used directly for advanced analyses.

The module supports linear correlations (Pearson, Spearman), categorical associations (Cramér’s V, η²), multiple-factor correlations, and correlation indices for non-linear dependencies.

Functions

corr_matrix(dataset, method=”pearson”, groups=None): Compute a correlation or association matrix using the specified method.
corr_matrix_linear(dataset, method=”pearson”): Compute a correlation matrix for numeric features using Pearson or Spearman correlation.
corr_matrix_cramer_v(dataset, bias_correction=True): Compute the Cramér’s V correlation matrix for categorical variables.
corr_matrix_eta(dataset, categories): Compute a correlation matrix between numerical features and categorical features using the square root of eta-squared (η²).
corr_vector_multiple(x, y): Calculates multiple correlation coefficients between the target variable y and all possible combinations of 2 or more features from x.
corr_matrix_multiple(dataset): Compute a multi-factor correlation matrix for all target features in a dataset.
corr_matrix_corr_index(dataset, method=”linear”): Compute a correlation index matrix for all features in a dataset.

explorica.interactions.correlation_matrices.corr_matrix(dataset: Sequence[Sequence], method: str = 'pearson', groups: Sequence[Sequence] = None) → DataFrame[source]

Compute a correlation or association matrix using the specified method.

This function supports both classical correlation coefficients and a set of nonlinear correlation indices based on specific functional dependencies. For nonlinear methods, the correlation is evaluated by fitting transformations (e.g., exponential, logarithmic) and computing the strength of the relationship between features accordingly.

Parameters:

datasetSequence[Sequence]

Sequence with numeric or categorical features. For ‘pearson’, ‘spearman’, ‘multiple’, and all nonlinear methods, all columns must be numeric. For ‘cramer_v’, all columns should be categorical. For ‘eta’, numeric features in dataset are compared to categorical features in groups.

methodstr, optional

Method used to compute correlation or association:

‘pearson’ : Pearson correlation (linear, continuous features).
‘spearman’ : Spearman rank correlation (monotonic, non-parametric).
‘cramer_v’ : Cramér’s V (categorical-categorical association).
‘eta’ : Eta coefficient (numeric-categorical association, asymmetric).
‘multiple’Multiple correlation coefficients for each numeric feature
as a target and all remaining numeric features as predictors.
‘exp’ : Nonlinear correlation index assuming exponential dependence.
‘binomial’ : Nonlinear correlation index assuming binomial dependence.
‘ln’ : Nonlinear correlation index assuming logarithmic dependence.
‘hyperbolic’ : Nonlinear correlation index assuming hyperbolic dependence.
‘power’ : Nonlinear correlation index assuming power-law dependence.

groupsSequence[Sequence], optional

Sequence of categorical grouping variables required for the ‘eta’ method. Must have the same number of rows as dataset.

Returns:

pd.DataFrame

For ‘pearson’ and ‘spearman’: symmetric correlation matrix of shape (n_numeric_features, n_numeric_features).
For ‘cramer_v’: symmetric matrix of shape (n_features, n_features), representing categorical associations.
For ‘eta’: asymmetric matrix of shape (n_numeric_features, n_grouping_features), showing strength of association between numeric and categorical variables.
For nonlinear methods (‘exp’, ‘binomial’, ‘ln’, ‘hyperbolic’, ‘power’): asymmetric matrix of shape (n_numeric_features, n_numeric_features), showing the correlation index based on the specified nonlinear model.

Raises:

ValueError: If the specified method is not supported. If groups is required (for ‘eta’) but not provided or mismatched in length.

Warns:

UserWarning: If Multicollinearity detected in the dataset (for ‘multiple’), some features are linearly dependent.

Notes

Pearson, Spearman, and other numerical correlation methods internally select only features of numeric type (Number) from the provided DataFrame. Non-numeric columns (e.g., categorical strings or object types) are ignored.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from explorica.interactions.correlation_matrices import corr_matrix
>>> # Simple usage
>>> data = pd.DataFrame({
...     "X1": [1, 2, 3, 4, 5, 6],
...     "X2": [12, 10, 8, 6, 4, 2],
...     "X3": [9, 3, 5, 2, 6, 1],
...     "X4": [3, 2, 1, 3, 2, 1],
... })
>>> result_df = corr_matrix(data, method="spearman")
>>> # Round coefficients for doctests reproducibility
>>> np.round(result_df, 4)
        X1      X2      X3      X4
X1  1.0000 -1.0000 -0.6000 -0.4781
X2 -1.0000  1.0000  0.6000  0.4781
X3 -0.6000  0.6000  1.0000  0.3586
X4 -0.4781  0.4781  0.3586  1.0000

explorica.interactions.correlation_matrices.corr_matrix_corr_index(dataset: Sequence[Sequence[Number]], method: str = 'linear') → DataFrame[source]

Compute a correlation index matrix for all features in a dataset.

This method computes a pairwise correlation index (√R²) between features using non-linear regression-based methods. The supported methods are: ‘linear’, ‘exp’, ‘binomial’, ‘ln’, ‘hyperbolic’, and ‘power’.

Parameters:

datasetSequence[Sequence[Number]]: Input data, where each inner sequence represents a feature/column. Can be a pandas DataFrame, numpy array, dict or nested lists. Must not contain NaN values, and column names must be unique.
methodstr, default=’linear’: Method used to compute the correlation index. Supported options are: {‘exp’, ‘binomial’, ‘ln’, ‘hyperbolic’, ‘power’, ‘linear’}.

Returns:

pd.DataFrame: DataFrame of size (n_features x n_features) containing the correlation index (\(\sqrt{R^2}\)) values for each pair of features.

Raises:

ValueError: If the dataset contains NaN values. If column names are duplicated. If the selected method is not supported.

See also

explorica.interactions.correlation_metrics.corr_index: The underlying computation function.

Notes

Any invalid pair according to these method constraints will have NaN as a result.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from explorica.interactions.correlation_matrices import corr_matrix_corr_index
>>>
>>> data = pd.DataFrame({
...     "X1": [1, 2, 3, 4, 5, 6],
...     "X2": [2, 4, 6, 8, 10, 12],
...     "X3": [1, 4, 9, 16, 25, 36],
...     "X4": [3, 2, 1, 3, 2, 1],
... })
>>> result_df = corr_matrix_corr_index(data, method="binomial")
>>> # Round coefficients for doctests reproducibility
>>> np.round(result_df, 4)
        X1      X2      X3      X4
X1  1.0000  1.0000  1.0000  0.4781
X2  1.0000  1.0000  1.0000  0.4781
X3  0.9969  0.9969  1.0000  0.4964
X4  0.4781  0.4781  0.4696  1.0000

explorica.interactions.correlation_matrices.corr_matrix_cramer_v(dataset: Sequence[Sequence], bias_correction: bool = True) → DataFrame[source]

Compute the Cramér’s V dependency matrix for categorical variables.

Useful for exploratory analysis of datasets with multiple categorical variables, providing a pairwise overview of their associations. Bias correction option is available.

Parameters:

datasetSequence of sequences or pandas.DataFrame: Input dataset containing categorical variables. If a non-DataFrame sequence is passed, it will be converted into a DataFrame internally.
bias_correctionbool, default=True: Whether to apply bias correction in the calculation of Cramér’s V.

Returns:

pandas.DataFrame: A square symmetric correlation matrix where each entry (i, j) represents the Cramér’s V correlation between columns i and j. The diagonal entries are equal to 1.

Raises:

ValueError: If ‘dataset’ contains NaN values.

See also

explorica.interactions.correlation_metrics.cramer_v: The underlying computation function.

Notes

Cramér’s V is a measure of association between two nominal (categorical) variables, ranging from 0 (no association) to 1 (perfect association).
This implementation ensures the matrix is symmetric and always has ones on the diagonal.
The underlying cramer_v function may currently produce biased results in some cases due to known issues with bias correction.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from explorica.interactions.correlation_matrices import corr_matrix_cramer_v
>>> # Simple usage
>>> groups_table = pd.DataFrame({
...     "Group_A": ["A", "A", "A", "B", "B", "B"],
...     "Group_B": [1, 2, 3, 1, 2, 3],
...     "Group_C": ["C", "C", "C", "D", "D", "D"],
... })
>>> result_df = corr_matrix_cramer_v(groups_table, bias_correction=False)
>>> # Round coefficients for doctests reproducibility
>>> np.round(result_df, 4)
         Group_A  Group_B  Group_C
Group_A      1.0      0.0      1.0
Group_B      0.0      1.0      0.0
Group_C      1.0      0.0      1.0

explorica.interactions.correlation_matrices.corr_matrix_eta(dataset: Sequence[Sequence[Number]], categories: Sequence[Sequence]) → DataFrame[source]

Compute a dependency matrix based on square root of eta-squared (η²).

This function measures the strength of association between continuous (numeric) variables and categorical variables. The result is a matrix of eta coefficients, where rows correspond to numeric features and columns correspond to categorical features.

Parameters:

datasetSequence[Sequence[Number]] or pandas.DataFrame: The input dataset containing numeric features. Will be converted to a DataFrame if not already.
categoriesSequence[Sequence] or pandas.DataFrame: The categorical grouping variables. Will be converted to a DataFrame if not already.

Returns:

pandas.DataFrame: A matrix of shape (n_numeric_features, n_categorical_features), where each entry is the eta coefficient (sqrt(η²)) between the corresponding numeric and categorical variable.

Raises:

ValueError: If ‘dataset’ or ‘category’ contains NaN values. If input sequences lengths mismatch

See also

explorica.interactions.correlation_metrics.eta_squared: The underlying computation function.

Notes

Eta coefficient values are in [0, 1], with higher values indicating stronger association.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from explorica.interactions.correlation_matrices import corr_matrix_eta
>>> # Simple usage
>>> data = pd.DataFrame({
...     "X1": [1, 3, 5, 6, 1, 8],
...     "X2": [0, 0, 0, 1, 1, 1],
...     "X3": [7, 4, 2, 5, 1, 1],
...     "X4": [3, 2, 1, 3, 2, 1],
... })
>>> groups_table = pd.DataFrame({
...     "Group_A": ["A", "A", "A", "B", "B", "B"],
...     "Group_B": [1, 2, 3, 1, 2, 3],
...     "Group_C": ["C", "C", "C", "D", "D", "D"],
... })
>>> result_df = corr_matrix_eta(data, groups_table)
>>> # Round coefficients for doctests reproducibility
>>> np.round(result_df, 4)
    Group_A  Group_B  Group_C
X1   0.3873   0.7246   0.3873
X2   1.0000   0.0000   1.0000
X3   0.4523   0.8726   0.4523
X4   0.0000   1.0000   0.0000

explorica.interactions.correlation_matrices.corr_matrix_linear(dataset: Sequence[Sequence[Number]], method: str = 'pearson') → DataFrame[source]

Compute a correlation matrix for numeric features.

Computes using Pearson or Spearman correlation.

Parameters:

datasetSequence[Sequence[Number]]: Input dataset. Must be convertible to a pandas DataFrame with numeric columns.
methodstr, default=”pearson”: Correlation method to use. Supported values: “pearson”, “spearman”.

Returns:

pd.DataFrame: Correlation matrix of numeric features. Rows and columns correspond to feature names.

Raises:

ValueError: If method is not in {“pearson”, “spearman”}.

Notes

Only numeric columns are considered; non-numeric columns are ignored.
The dataset is automatically converted to a pandas DataFrame if it isn’t one.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from explorica.interactions.correlation_matrices import corr_matrix_linear
>>> # Pearson method usage
>>>
>>> X = pd.DataFrame({
...     "X1": [1, 3, 5, 6, 1],
...     "X2": [2, 3, 4, 1, 9],
...     "X3": [7, 4, 2, 5, 1],
...     "X4": [1, 2, 3, 4, 5],
... })
>>> result_df = corr_matrix_linear(X, method="pearson")
>>> # Round coefficients for doctests reproducibility
>>> np.round(result_df, 4)
        X1      X2      X3      X4
X1  1.0000 -0.5210 -0.0367  0.2080
X2 -0.5210  1.0000 -0.8136  0.6092
X3 -0.0367 -0.8136  1.0000 -0.7285
X4  0.2080  0.6092 -0.7285  1.0000

>>> # Spearman method usage
>>> result_df = corr_matrix_linear(X, method="spearman")
>>> np.round(result_df, 4)
        X1      X2      X3      X4
X1  1.0000 -0.4617  0.1026  0.2052
X2 -0.4617  1.0000 -0.9000  0.4000
X3  0.1026 -0.9000  1.0000 -0.7000
X4  0.2052  0.4000 -0.7000  1.0000

explorica.interactions.correlation_matrices.corr_matrix_multiple(dataset: Sequence[Sequence[Number]]) → DataFrame[source]

Compute a multi-factor correlation matrix for all target features in a dataset.

For each column in the dataset, this method treats it as a target variable and computes correlations between that target and all remaining features (predictors) using corr_vector_multiple. The results are concatenated into a single DataFrame with columns [‘corr_coef’, ‘feature_combination’, ‘target’].

Parameters:

datasetSequence[Sequence[Number]]: Input data, where each inner sequence represents a feature/column. Can be a pandas DataFrame, numpy array, or nested lists. Note: The dataset must contain at least 3 columns, since one column is treated as the target and at least two others are required as predictors.

Returns:

pd.DataFrame

DataFrame with the following columns:

corr_coef: correlation coefficient
for the target-predictor combination.
feature_combination: tuple or list of predictor feature names.
target: the name of the current target feature.

Raises:

ValueError: If the input DataFrame contains duplicate column names. If the input DataFrame contains NaN values.

Warns:

UserWarning: If a set of predictors is linearly dependent (multicollinearity detected), a UserWarning is issued. This warning appears only once, regardless of how many targets trigger it.

See also

explorica.interactions.correlation_matrices.corr_vector_multiple: The underlying computation function.
explorica.interactions.correlation_metrics.corr_multiple: The underlying computation function.

Notes

The function handles datasets with any number of features >= 3.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from explorica.interactions.correlation_matrices import corr_matrix_multiple
>>> # Simple usage
>>> X = pd.DataFrame({
...     "X1": [1, 3, 5, 6, 1],
...     "X2": [2, 3, 4, 1, 9],
...     "X3": [7, 4, 2, 5, 1],
...     "X4": [1, 2, 3, 4, 5],
... })
>>> result_df = corr_matrix_multiple(X)
>>> # Round coefficients for doctests reproducibility
>>> result_df["corr_coef"] = np.round(result_df["corr_coef"], 4)
>>> result_df = result_df.sort_values(by="corr_coef", ascending=False)
>>> result_df
    corr_coef feature_combination target
7      0.9985        (X1, X3, X4)     X2
11     0.9963        (X1, X2, X4)     X3
3      0.9958        (X2, X3, X4)     X1
4      0.9828            (X1, X3)     X2
15     0.9803        (X1, X2, X3)     X4
8      0.9763            (X1, X2)     X3
0      0.9482            (X2, X3)     X1
5      0.8998            (X1, X4)     X2
12     0.8660            (X1, X2)     X4
10     0.8650            (X2, X4)     X3
1      0.8428            (X2, X4)     X1
6      0.8140            (X3, X4)     X2
13     0.7507            (X1, X3)     X4
9      0.7379            (X1, X4)     X3
14     0.7290            (X2, X3)     X4
2      0.2671            (X3, X4)     X1

explorica.interactions.correlation_matrices.corr_vector_multiple(x: DataFrame, y: Series) → DataFrame[source]

Compute multiple correlation between y and predictor combinations from x.

Calculates multiple correlation coefficients between the target variable y and all possible combinations of 2 or more features from x.

Parameters:

xpd.DataFrame: Feature matrix (only numeric features should be used).
ypd.Series: Target vector. The function computes correlation between this target and feature combinations.

Returns:

pd.DataFrame

A DataFrame with columns:

‘corr_coef’: multiple correlation coefficient for a given combination
‘feature_combination’: tuple of feature names used in the combination
‘target’: name of the target variable

Raises:

ValueError:

If x or y contains NaN values.
If the number of samples in x and y do not match.

Warns:

UserWarning: If the predictors in x are found to be multicollinear (i.e., the determinant of their correlation matrix is zero).

See also

explorica.interactions.correlation_metrics.corr_multiple: The underlying computation function.

Notes

The method renames columns as ‘X1’, ‘X2’, etc., to handling duplicate names
The method computes all possible combinationsof features of size 2 and larger.
This method can be computationally expensive for large datasets, as it evaluates all possible combinations of features.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from explorica.interactions.correlation_matrices import corr_vector_multiple
>>>
>>>
>>> X = pd.DataFrame({
...     "X1": [1, 3, 5, 6, 1],
...     "X2": [2, 3, 4, 1, 9],
...     "X3": [7, 4, 2, 5, 1],
... })
>>> y = pd.Series([1, 2, 3, 4, 5], name="y")
>>> result_df = corr_vector_multiple(X, y)
>>> # Round coefficients for doctests reproducibility
>>> result_df["corr_coef"] = np.round(result_df["corr_coef"], 4)
>>> result_df
   corr_coef feature_combination target
0     0.8660            (X1, X2)      y
1     0.7507            (X1, X3)      y
2     0.7290            (X2, X3)      y
3     0.9803        (X1, X2, X3)      y

explorica.interactions.correlation_metrics

Statistical correlation and association measures.

This module provides a collection of functions for computing various statistical dependency measures between variables.

Functions

cramer_v(x, y, bias_correction=True, yates_correction=False): Calculates Cramér’s V statistic for measuring the association between two categorical variables.
eta_squared(values, categories): Calculate the eta-squared (η²) statistic for categorical and numeric variables. η² (eta squared) is a measure of effect size used to quantify the proportion of variance in a numerical variable that can be attributed to differences between categories of a categorical variable.
corr_index(x, y, method=”linear”, custom_function=None, normalization_bounds=None): Calculates a nonlinear correlation index between two series x and y, based on the proportion of variance explained by the fitted function.
corr_multiple(x, y): Computes the multiple correlation coefficient (R) between a set of predictor variables (x) and a response variable (y), based on the determinant of their correlation matrices.

explorica.interactions.correlation_metrics.corr_index(x: Sequence[Number], y: Sequence[Number], method: str = 'linear', custom_function: Callable[[Number], Number] | None = None, normalization_bounds: Sequence[Number] = None) → float[source]

Calculate a nonlinear correlation index.

Calculates a nonlinear correlation index between two series x and y, based on the proportion of variance explained by the fitted function.

The index is computed as:

\[R_I=\sqrt{\frac{Q_R}{Q}}= \sqrt{\frac{\sum{\left(y(x_i)-\overline y\right)^2}} {\sum(y_i-\overline y)^2}}\]

where \(Q_R\) is the sum of squared errors between the true \(y_i\) and predicted \(y(x_i)\), and \(Q\) is the total sum of squares \(y\) with respect to its mean.

This method is designed to handle non-linear dependencies and is robust to monotonic transformations. In degenerate cases (e.g., constant \(x\) or \(y\)), a correlation index of 0 is returned.

Important: If the model fit yields \(Q_R\) > \(Q\) (i.e., performs worse than a mean-based predictor), the index is treated as 0. This reflects the model’s failure to explain any meaningful variance.

Supported methods:

‘linear’: \(y = ax + b\)
‘binomial’: \(y = ax^2 + bx + c\)
‘exp’: \(y = be^{ax}\)
‘ln’: \(y = a\ln(x) + b\)
‘hyperbolic’: \(y = \frac{a}{x} + b\)
‘power’: \(y = ax^b\)
‘custom’: \(y =\) your function

Parameters:

xSequence[Number]

The explanatory variable.

ySequence[Number]

The response variable.

methodstr, default=’linear’

Regression model to use. See supported methods above.

custom_functionCallable, optional

A user-defined function that takes a Number values and returns Number predictions. Required when method=’custom’.

normalization_boundsSequence[Number], optional

A pair of numeric values (lower_bound, upper_bound) specifying the range to which the input x values are rescaled before fitting. If None (default), automatic normalization is applied only for methods with domain restrictions:

‘exp’ → scaled to [0, 5]
‘ln’ → scaled to [1, 10]
‘power’ → scaled to [1, 10]
‘hyperbolic’ → scaled to [2, 20]

For all other methods, no normalization is applied unless explicitly provided by the user.

Returns:

float: A nonlinear correlation index in the range (0 ≤ r ≤ 1).

Raises:

ValueError

If:

an unsupported method is passed
x and y are of unequal lengths
any NaNs are present in x or y
input values violate the domain of the selected function (e.g., log(0))
method=’custom’ is specified but no custom_function is provided.

Notes

If x or y is constant (i.e., has zero variance), the correlation index is defined as 0 to avoid division by zero or meaningless regression.
If the fitted model performs worse than the mean (\(Q_R\) > \(Q\)), \(R_I\) is also defined as 0.
User-provided normalization_bounds take precedence and will be applied regardless of the chosen method.
Automatic normalization is enabled selectively for models with restricted domains to:
1. Prevent numerical instability due to extremely large/small values.
2. Ensure compatibility with the domain restrictions of certain functions (e.g., logarithmic, power, or hyperbolic forms).
Parameter estimation is performed using scipy.optimize.curve_fit with a least-squares objective.

Examples

>>> import numpy as np
>>> from explorica.interactions.correlation_metrics import corr_index
>>> # Linear method usage
>>> x = [1, 2, 3, 4, 5]
>>> y = [2, 4, 6, 8, 10]
>>> np.round(corr_index(x, y), 4)
np.float64(1.0)

>>> # Non-linear method usage
>>> x = [1, 2, 3, 4, 5]
>>> y = [2.7, 7.4, 20.1, 54.6, 148.4]
>>> np.round(corr_index(x, y, method='exp'), 4)
np.float64(1.0)

explorica.interactions.correlation_metrics.corr_multiple(x: Sequence[Sequence[Number]], y: Sequence[Number]) → float[source]

Compute the multiple correlation coefficient.

Multiple correlation coefficient (\(R\)) between a set of predictor variables (\(x\)) and a response variable (\(y\)). In this case based on the determinant of their correlation matrices.

The function uses the formula:

\[R = \sqrt{R^2}=\sqrt{1-\frac{|r|}{|r_x|}}\]

where \(r\) is the correlation matrix of predictors and the response, and \(r_x\) is the correlation matrix of the predictors alone.

This implementation is robust to singular correlation matrices. If \(r_x\) is singular (\(|r_x|=0\)), the function returns \(R = 0\).

Parameters:

xSequence[Sequence[Number]]: A sequence of predictor (explanatory) variables. Each column is assumed to represent one independent variable.
ySequence[Number]: A series representing the response (dependent) variable.

Returns:

float: The multiple correlation coefficient (R), ranging from 0 to 1.

Raises:

ValueError: If the number of samples in x and y do not match. If any NaN values are present in x or y.

Warns:

UserWarning: If the predictors in x are found to be multicollinear (i.e., the determinant of their correlation matrix is zero).

Examples

>>> import numpy as np
>>> from explorica.interactions.correlation_metrics import corr_multiple
>>> # Simple usage
>>> X = [[1, 2, 3, 4, 5], [2, 3, 5, 6, 8]]
>>> y = [2, 4, 5, 7, 9]
>>> result = corr_multiple(X, y)
>>> np.round(result, 4)
np.float64(0.9954)

explorica.interactions.correlation_metrics.cramer_v(x: Sequence, y: Sequence, bias_correction: bool = True, yates_correction: bool = False) → float[source]

Calculate Cramér’s V statistic.

Calculates Cramér’s V (\(V\)) for measuring the association between two categorical variables. Bias correction and Yates’ correction are available as options.

Calculated as:

Without bias correction:

\[V = \sqrt{ \frac{\phi^2}{\min(r - 1, k - 1)}}\]

With bias correction (Bergsma, 2013):

\[V = \sqrt{\frac{\phi^2_{\text{corr}}}{\min(r - 1, k - 1)}}\]

\(\phi^2_{\text{corr}}\) is:

\[\phi^2_{\text{corr}} = \max\left( 0, \phi^2 - \frac{(r-1)(k-1)}{n-1} \right)\]

Where \(\phi^2\) is:

\[\phi^2 = \frac{\chi^2}{n}\]

\(n\) is sample size. \(r\) and \(k\) are number of rows and columns in the contingency table respectively.

Parameters:

xSequence: First categorical variable.
ySequence: Second categorical variable.
bias_correctionbool, optional, default=True: Whether to apply bias correction (recommended for small samples).
yates_correctionbool, optional, default=False: Whether to apply Yates’ correction for continuity (only applies to 2x2 tables; usually set to False when using Cramér’s V).

Returns:

float: Cramér’s V value, ranging from 0 (no association) to 1 (perfect association). Returns 0 if the statistic is undefined (e.g., due to zero denominator).

Raises:

ValueError: If ‘x’ or ‘y’ contains NaN values. If input sequences lengths mismatch.

Examples

>>> # Simple usage
>>> from explorica.interactions.correlation_metrics import cramer_v
>>> categories1 = ['A', 'A', 'A', 'B', 'B', 'B']
>>> categories2 = [1, 1, 1, 2, 2, 2]
>>> cramer_v(categories1, categories2, bias_correction=False)
np.float64(1.0)

>>> categories1 = ['A', 'A', 'A', 'B', 'B', 'B']
>>> categories2 = ['C', 'D', 'E', 'C', 'D', 'E']
>>> cramer_v(categories1, categories2)
np.float64(0.0)

explorica.interactions.correlation_metrics.eta_squared(values: Sequence[Number], categories: Sequence) → float[source]

Calculate the eta-squared (\(\eta^2\)) statistic.

\(\eta^2\) (eta squared) is a measure of effect size used to quantify the proportion of variance in a numerical variable that can be attributed to differences between categories of a categorical variable.

Calculated as:

\[\eta^2=\frac{Q_A}{Q}=\frac{\sum{(\overline{x}_i-\overline{x})^2n_i}} {\sum(x_j-\overline x)^2}\]

Parameters:

valuesSequence[Number]: A numerical sequence representing the dependent (response) variable.
categoriesSequence: A categorical sequence representing the independent (grouping) variable.

Returns:

float

Eta-squared statistic in the range [0, 1], where:

0 means no association between variables,
1 means perfect association (all variance explained by groups).

Raises:

ValueError: If the lengths of values and categories do not match, or if either of them contains NaN values.

Notes

If the total variance of values is zero, the function returns 0. NaN values should be handled before calling this function.

Examples

>>> import numpy as np
>>> from explorica.interactions.correlation_metrics import eta_squared
>>> # Simple usage
>>> values = [2.3, 3.1, 2.8, 5.5, 6.0, 5.8]
>>> categories = ['A', 'A', 'A', 'B', 'B', 'B']
>>> result = eta_squared(values, categories)
>>> np.round(result, 4)
np.float64(0.9682)