explorica.interactions
explorica.interactions.aggregators
Module provides utilities for aggregating interactions between features in a dataset. It contains functions to identify and return significant feature pairs based on various correlation and association measures.
The main function, high_corr_pairs, evaluates feature-to-feature relationships using linear (Pearson, Spearman), non-linear (e.g. exponential, binomial, power-law), categorical (Cramér’s V), and hybrid (η²) measures. Users can optionally enable non-linear and multiple- correlation modes.
Functions
detect_multicollinearity(numeric_features=None, category_features=None, method=”VIF”, return_as=”dataframe”, **kwargs)
Detect multicollinearity among features using either Variance Inflation Factor (VIF) or correlation-based methods.
- high_corr_pairs(numeric_features=None, category_features=None, threshold=0.7, **kwargs)
Finds and returns all significant pairs of correlated features from the input datasets.
- explorica.interactions.aggregators.detect_multicollinearity(numeric_features: Sequence[Sequence[Number]] = None, category_features: Sequence[Sequence] = None, method: str = 'VIF', return_as: str = 'dataframe', **kwargs) dict | DataFrame[source]
Detect multicollinearity using either VIF or correlation-based methods.
Multicollinearity occurs when features are highly correlated with each other, which can destabilize model coefficients and reduce interpretability. This function provides two approaches: VIF quantifies how much the variance of a regression coefficient is inflated due to collinearity with other features, while the correlation-based method offers a broader assessment covering numeric-numeric, numeric-categorical, and categorical-categorical feature pairs.
- Parameters:
- numeric_featuresSequence of sequences of numbers, optional
Numerical feature matrix or compatible structure (array-like or DataFrame). Required for
method='VIF'. Used together with category_features when correlation-based method is selected.- category_featuresSequence of sequences, optional
Categorical feature matrix or compatible structure (array-like or DataFrame). Only used with
method='corr'. Not evaluated under VIF.- method{“VIF”, “corr”}, default=”VIF”
Method to detect multicollinearity:
“VIF” : Compute Variance Inflation Factor for numerical features.
“corr” : Detect multicollinearity based on the highest pairwise absolute correlation between features (numeric–numeric, numeric–categorical, categorical–categorical). Supported correlation metrics include:
sqrt_eta_squared,cramer_v,pearson,spearman.
- return_as{“dataframe”, “dict”}, default=”dataframe”
Output format of the result:
“dataframe” : Pandas DataFrame with features as index and metrics as columns.
“dict” : Nested dictionary of the form
{metric: {feature: value, ...}, ...}.
- variance_inflation_thresholdfloat, default=10
Threshold above which a feature is considered collinear in VIF method.
- correlation_thresholdfloat, default=0.95
Threshold for the highest absolute correlation of a feature with any other feature. If this value is exceeded, the feature is considered collinear.
- Returns:
- dict or pd.DataFrame
Multicollinearity assessment, depending on
return_as:If “dataframe”: DataFrame with columns for metrics (e.g., “VIF”, “multicollinearity”) and rows corresponding to features.
If “dict”: Mapping of metrics to per-feature values.
- Raises:
- ValueError
If all inputs are empty. If lengths of numeric_features and category_features do not match. If any input array contains NaN values. If method or return_as is not one of the supported values.
Notes
VIF can be infinite if the dataset contains functionally dependent features.
Categorical features are not evaluated under VIF.
Examples
>>> import pandas as pd >>> from explorica.interactions.aggregators import detect_multicollinearity >>> # Simple usage >>> X_num = pd.DataFrame({"x1": [1, 2, 3], "x2": [2, 4, 6], "x3": [1, 0, 1]}) >>> detect_multicollinearity(X_num, method="VIF", return_as="dataframe") VIF multicollinearity x1 inf 1.0 x2 inf 1.0 x3 1.0 0.0
- explorica.interactions.aggregators.high_corr_pairs(numeric_features: Sequence[Sequence[Number]] = None, category_features: Sequence[Sequence] = None, threshold: float = 0.7, **kwargs) DataFrame | None[source]
Find and return all significant pairs of correlated features from the dataset.
This method evaluates feature-to-feature relationships using a set of correlation measures, including linear (Pearson, Spearman), non-linear (e.g. exponential, binomial, power-law), categorical (Cramér’s V), and hybrid (η²). Users can optionally enable non-linear and multiple-correlation modes.
- Parameters:
- numeric_featurespd.DataFrame, optional
A DataFrame of numerical features. Required for linear, η², non-linear, and multiple correlation.
- category_featurespd.DataFrame, optional
A DataFrame of categorical features. Required for Cramér’s V and η² computations.
- ystr, optional
Target feature name to compute correlations with. If None, all pairwise comparisons are evaluated.
- nonlinear_includedbool, default=False
Whether to include non-linear correlation measures for numeric features.
- multiple_includedbool, default=False
Whether to include multiple correlation analysis (for numeric features only).
- thresholdfloat, default=0.7
Minimum absolute value of correlation to consider a pair as significantly dependent.
- Returns:
- pd.DataFrame or None
A DataFrame with columns [‘X’, ‘Y’, ‘coef’, ‘method’], listing feature pairs whose correlation (in absolute value) exceeds the threshold. Returns None if no such pairs found.
- Raises:
- ValueError
If neither input DataFrame is provided. If numeric_features or category_features are of unequal lengths. If numeric_features or category_features contain NaN values. If numeric_features or category_features contain duplicate column name.
Notes
Linear correlation methods: Pearson, Spearman
Non-linear methods (enabled via nonlinear_included): exp, binomial, ln, hyperbolic, power
Categorical methods: Cramér’s V, η² (eta)
The method skips self-comparisons.
Targeted correlation (y) will produce only pairs involving the specified target.
Examples
>>> import pandas as pd >>> import numpy as np >>> from explorica.interactions.aggregators import high_corr_pairs >>> # Simple usage >>> data = pd.DataFrame({ ... "X1": [1, 2, 3, 4, 5, 6], ... "X2": [12, 10, 8, 6, 4, 2], ... "X3": [9, 3, 5, 2, 6, 1], ... "X4": [3, 2, 1, 3, 2, 1], ... "target": [2, 3, 4, 6, 8, 10], ... }) >>> y = pd.Series([2, 3, 4, 6, 8, 10], name="y") >>> result_df = high_corr_pairs(data, y="target", threshold=0.0) >>> # Round coefficients for doctests reproducibility >>> result_df["coef"] = np.round(result_df["coef"], 4) >>> result_df X Y coef method 0 X2 target -1.0000 spearman 1 X1 target 1.0000 spearman 2 X1 target 0.9885 pearson 3 X2 target -0.9885 pearson 4 X3 target -0.6000 spearman 5 X3 target -0.5731 pearson 6 X4 target -0.4781 spearman 7 X4 target -0.4353 pearson
explorica.interactions.correlation_matrices
Module provides tools for constructing various types of correlation and dependence matrices for both numerical and categorical features in a dataset. This module is intended to be used via the public facade InteractionAnalyzer but can also be used directly for advanced analyses.
The module supports linear correlations (Pearson, Spearman), categorical associations (Cramér’s V, η²), multiple-factor correlations, and correlation indices for non-linear dependencies.
Functions
- corr_matrix(dataset, method=”pearson”, groups=None)
Compute a correlation or association matrix using the specified method.
- corr_matrix_linear(dataset, method=”pearson”)
Compute a correlation matrix for numeric features using Pearson or Spearman correlation.
- corr_matrix_cramer_v(dataset, bias_correction=True)
Compute the Cramér’s V correlation matrix for categorical variables.
- corr_matrix_eta(dataset, categories)
Compute a correlation matrix between numerical features and categorical features using the square root of eta-squared (η²).
- corr_vector_multiple(x, y)
Calculates multiple correlation coefficients between the target variable y and all possible combinations of 2 or more features from x.
- corr_matrix_multiple(dataset)
Compute a multi-factor correlation matrix for all target features in a dataset.
- corr_matrix_corr_index(dataset, method=”linear”)
Compute a correlation index matrix for all features in a dataset.
- explorica.interactions.correlation_matrices.corr_matrix(dataset: Sequence[Sequence], method: str = 'pearson', groups: Sequence[Sequence] = None) DataFrame[source]
Compute a correlation or association matrix using the specified method.
This function supports both classical correlation coefficients and a set of nonlinear correlation indices based on specific functional dependencies. For nonlinear methods, the correlation is evaluated by fitting transformations (e.g., exponential, logarithmic) and computing the strength of the relationship between features accordingly.
- Parameters:
- datasetSequence[Sequence]
Sequence with numeric or categorical features. For ‘pearson’, ‘spearman’, ‘multiple’, and all nonlinear methods, all columns must be numeric. For ‘cramer_v’, all columns should be categorical. For ‘eta’, numeric features in dataset are compared to categorical features in groups.
- methodstr, optional
Method used to compute correlation or association:
‘pearson’ : Pearson correlation (linear, continuous features).
‘spearman’ : Spearman rank correlation (monotonic, non-parametric).
‘cramer_v’ : Cramér’s V (categorical-categorical association).
‘eta’ : Eta coefficient (numeric-categorical association, asymmetric).
- ‘multiple’Multiple correlation coefficients for each numeric feature
as a target and all remaining numeric features as predictors.
‘exp’ : Nonlinear correlation index assuming exponential dependence.
‘binomial’ : Nonlinear correlation index assuming binomial dependence.
‘ln’ : Nonlinear correlation index assuming logarithmic dependence.
‘hyperbolic’ : Nonlinear correlation index assuming hyperbolic dependence.
‘power’ : Nonlinear correlation index assuming power-law dependence.
- groupsSequence[Sequence], optional
Sequence of categorical grouping variables required for the ‘eta’ method. Must have the same number of rows as dataset.
- Returns:
- pd.DataFrame
For ‘pearson’ and ‘spearman’: symmetric correlation matrix of shape (n_numeric_features, n_numeric_features).
For ‘cramer_v’: symmetric matrix of shape (n_features, n_features), representing categorical associations.
For ‘eta’: asymmetric matrix of shape (n_numeric_features, n_grouping_features), showing strength of association between numeric and categorical variables.
For nonlinear methods (‘exp’, ‘binomial’, ‘ln’, ‘hyperbolic’, ‘power’): asymmetric matrix of shape (n_numeric_features, n_numeric_features), showing the correlation index based on the specified nonlinear model.
- Raises:
- ValueError
If the specified method is not supported. If groups is required (for ‘eta’) but not provided or mismatched in length.
- Warns:
- UserWarning
If Multicollinearity detected in the dataset (for ‘multiple’), some features are linearly dependent.
Notes
Pearson, Spearman, and other numerical correlation methods internally select only features of numeric type (Number) from the provided DataFrame. Non-numeric columns (e.g., categorical strings or object types) are ignored.
Examples
>>> import pandas as pd >>> import numpy as np >>> from explorica.interactions.correlation_matrices import corr_matrix >>> # Simple usage >>> data = pd.DataFrame({ ... "X1": [1, 2, 3, 4, 5, 6], ... "X2": [12, 10, 8, 6, 4, 2], ... "X3": [9, 3, 5, 2, 6, 1], ... "X4": [3, 2, 1, 3, 2, 1], ... }) >>> result_df = corr_matrix(data, method="spearman") >>> # Round coefficients for doctests reproducibility >>> np.round(result_df, 4) X1 X2 X3 X4 X1 1.0000 -1.0000 -0.6000 -0.4781 X2 -1.0000 1.0000 0.6000 0.4781 X3 -0.6000 0.6000 1.0000 0.3586 X4 -0.4781 0.4781 0.3586 1.0000
- explorica.interactions.correlation_matrices.corr_matrix_corr_index(dataset: Sequence[Sequence[Number]], method: str = 'linear') DataFrame[source]
Compute a correlation index matrix for all features in a dataset.
This method computes a pairwise correlation index (√R²) between features using non-linear regression-based methods. The supported methods are: ‘linear’, ‘exp’, ‘binomial’, ‘ln’, ‘hyperbolic’, and ‘power’.
- Parameters:
- datasetSequence[Sequence[Number]]
Input data, where each inner sequence represents a feature/column. Can be a pandas DataFrame, numpy array, dict or nested lists. Must not contain NaN values, and column names must be unique.
- methodstr, default=’linear’
Method used to compute the correlation index. Supported options are: {‘exp’, ‘binomial’, ‘ln’, ‘hyperbolic’, ‘power’, ‘linear’}.
- Returns:
- pd.DataFrame
DataFrame of size (n_features x n_features) containing the correlation index (\(\sqrt{R^2}\)) values for each pair of features.
- Raises:
- ValueError
If the dataset contains NaN values. If column names are duplicated. If the selected method is not supported.
See also
explorica.interactions.correlation_metrics.corr_indexThe underlying computation function.
Notes
Any invalid pair according to these method constraints will have NaN as a result.
Examples
>>> import pandas as pd >>> import numpy as np >>> from explorica.interactions.correlation_matrices import corr_matrix_corr_index >>> >>> data = pd.DataFrame({ ... "X1": [1, 2, 3, 4, 5, 6], ... "X2": [2, 4, 6, 8, 10, 12], ... "X3": [1, 4, 9, 16, 25, 36], ... "X4": [3, 2, 1, 3, 2, 1], ... }) >>> result_df = corr_matrix_corr_index(data, method="binomial") >>> # Round coefficients for doctests reproducibility >>> np.round(result_df, 4) X1 X2 X3 X4 X1 1.0000 1.0000 1.0000 0.4781 X2 1.0000 1.0000 1.0000 0.4781 X3 0.9969 0.9969 1.0000 0.4964 X4 0.4781 0.4781 0.4696 1.0000
- explorica.interactions.correlation_matrices.corr_matrix_cramer_v(dataset: Sequence[Sequence], bias_correction: bool = True) DataFrame[source]
Compute the Cramér’s V dependency matrix for categorical variables.
Useful for exploratory analysis of datasets with multiple categorical variables, providing a pairwise overview of their associations. Bias correction option is available.
- Parameters:
- datasetSequence of sequences or pandas.DataFrame
Input dataset containing categorical variables. If a non-DataFrame sequence is passed, it will be converted into a DataFrame internally.
- bias_correctionbool, default=True
Whether to apply bias correction in the calculation of Cramér’s V.
- Returns:
- pandas.DataFrame
A square symmetric correlation matrix where each entry (i, j) represents the Cramér’s V correlation between columns i and j. The diagonal entries are equal to 1.
- Raises:
- ValueError
If ‘dataset’ contains NaN values.
See also
explorica.interactions.correlation_metrics.cramer_vThe underlying computation function.
Notes
Cramér’s V is a measure of association between two nominal (categorical) variables, ranging from 0 (no association) to 1 (perfect association).
This implementation ensures the matrix is symmetric and always has ones on the diagonal.
The underlying cramer_v function may currently produce biased results in some cases due to known issues with bias correction.
Examples
>>> import pandas as pd >>> import numpy as np >>> from explorica.interactions.correlation_matrices import corr_matrix_cramer_v >>> # Simple usage >>> groups_table = pd.DataFrame({ ... "Group_A": ["A", "A", "A", "B", "B", "B"], ... "Group_B": [1, 2, 3, 1, 2, 3], ... "Group_C": ["C", "C", "C", "D", "D", "D"], ... }) >>> result_df = corr_matrix_cramer_v(groups_table, bias_correction=False) >>> # Round coefficients for doctests reproducibility >>> np.round(result_df, 4) Group_A Group_B Group_C Group_A 1.0 0.0 1.0 Group_B 0.0 1.0 0.0 Group_C 1.0 0.0 1.0
- explorica.interactions.correlation_matrices.corr_matrix_eta(dataset: Sequence[Sequence[Number]], categories: Sequence[Sequence]) DataFrame[source]
Compute a dependency matrix based on square root of eta-squared (η²).
This function measures the strength of association between continuous (numeric) variables and categorical variables. The result is a matrix of eta coefficients, where rows correspond to numeric features and columns correspond to categorical features.
- Parameters:
- datasetSequence[Sequence[Number]] or pandas.DataFrame
The input dataset containing numeric features. Will be converted to a DataFrame if not already.
- categoriesSequence[Sequence] or pandas.DataFrame
The categorical grouping variables. Will be converted to a DataFrame if not already.
- Returns:
- pandas.DataFrame
A matrix of shape (n_numeric_features, n_categorical_features), where each entry is the eta coefficient (sqrt(η²)) between the corresponding numeric and categorical variable.
- Raises:
- ValueError
If ‘dataset’ or ‘category’ contains NaN values. If input sequences lengths mismatch
See also
explorica.interactions.correlation_metrics.eta_squaredThe underlying computation function.
Notes
Eta coefficient values are in [0, 1], with higher values indicating stronger association.
Examples
>>> import pandas as pd >>> import numpy as np >>> from explorica.interactions.correlation_matrices import corr_matrix_eta >>> # Simple usage >>> data = pd.DataFrame({ ... "X1": [1, 3, 5, 6, 1, 8], ... "X2": [0, 0, 0, 1, 1, 1], ... "X3": [7, 4, 2, 5, 1, 1], ... "X4": [3, 2, 1, 3, 2, 1], ... }) >>> groups_table = pd.DataFrame({ ... "Group_A": ["A", "A", "A", "B", "B", "B"], ... "Group_B": [1, 2, 3, 1, 2, 3], ... "Group_C": ["C", "C", "C", "D", "D", "D"], ... }) >>> result_df = corr_matrix_eta(data, groups_table) >>> # Round coefficients for doctests reproducibility >>> np.round(result_df, 4) Group_A Group_B Group_C X1 0.3873 0.7246 0.3873 X2 1.0000 0.0000 1.0000 X3 0.4523 0.8726 0.4523 X4 0.0000 1.0000 0.0000
- explorica.interactions.correlation_matrices.corr_matrix_linear(dataset: Sequence[Sequence[Number]], method: str = 'pearson') DataFrame[source]
Compute a correlation matrix for numeric features.
Computes using Pearson or Spearman correlation.
- Parameters:
- datasetSequence[Sequence[Number]]
Input dataset. Must be convertible to a pandas DataFrame with numeric columns.
- methodstr, default=”pearson”
Correlation method to use. Supported values: “pearson”, “spearman”.
- Returns:
- pd.DataFrame
Correlation matrix of numeric features. Rows and columns correspond to feature names.
- Raises:
- ValueError
If method is not in {“pearson”, “spearman”}.
Notes
Only numeric columns are considered; non-numeric columns are ignored.
The dataset is automatically converted to a pandas DataFrame if it isn’t one.
Examples
>>> import pandas as pd >>> import numpy as np >>> from explorica.interactions.correlation_matrices import corr_matrix_linear >>> # Pearson method usage >>> >>> X = pd.DataFrame({ ... "X1": [1, 3, 5, 6, 1], ... "X2": [2, 3, 4, 1, 9], ... "X3": [7, 4, 2, 5, 1], ... "X4": [1, 2, 3, 4, 5], ... }) >>> result_df = corr_matrix_linear(X, method="pearson") >>> # Round coefficients for doctests reproducibility >>> np.round(result_df, 4) X1 X2 X3 X4 X1 1.0000 -0.5210 -0.0367 0.2080 X2 -0.5210 1.0000 -0.8136 0.6092 X3 -0.0367 -0.8136 1.0000 -0.7285 X4 0.2080 0.6092 -0.7285 1.0000
>>> # Spearman method usage >>> result_df = corr_matrix_linear(X, method="spearman") >>> np.round(result_df, 4) X1 X2 X3 X4 X1 1.0000 -0.4617 0.1026 0.2052 X2 -0.4617 1.0000 -0.9000 0.4000 X3 0.1026 -0.9000 1.0000 -0.7000 X4 0.2052 0.4000 -0.7000 1.0000
- explorica.interactions.correlation_matrices.corr_matrix_multiple(dataset: Sequence[Sequence[Number]]) DataFrame[source]
Compute a multi-factor correlation matrix for all target features in a dataset.
For each column in the dataset, this method treats it as a target variable and computes correlations between that target and all remaining features (predictors) using corr_vector_multiple. The results are concatenated into a single DataFrame with columns [‘corr_coef’, ‘feature_combination’, ‘target’].
- Parameters:
- datasetSequence[Sequence[Number]]
Input data, where each inner sequence represents a feature/column. Can be a pandas DataFrame, numpy array, or nested lists. Note: The dataset must contain at least 3 columns, since one column is treated as the target and at least two others are required as predictors.
- Returns:
- pd.DataFrame
DataFrame with the following columns:
- corr_coef: correlation coefficient
for the target-predictor combination.
feature_combination: tuple or list of predictor feature names.
target: the name of the current target feature.
- Raises:
- ValueError
If the input DataFrame contains duplicate column names. If the input DataFrame contains NaN values.
- Warns:
- UserWarning
If a set of predictors is linearly dependent (multicollinearity detected), a UserWarning is issued. This warning appears only once, regardless of how many targets trigger it.
See also
explorica.interactions.correlation_matrices.corr_vector_multipleThe underlying computation function.
explorica.interactions.correlation_metrics.corr_multipleThe underlying computation function.
Notes
The function handles datasets with any number of features >= 3.
Examples
>>> import pandas as pd >>> import numpy as np >>> from explorica.interactions.correlation_matrices import corr_matrix_multiple >>> # Simple usage >>> X = pd.DataFrame({ ... "X1": [1, 3, 5, 6, 1], ... "X2": [2, 3, 4, 1, 9], ... "X3": [7, 4, 2, 5, 1], ... "X4": [1, 2, 3, 4, 5], ... }) >>> result_df = corr_matrix_multiple(X) >>> # Round coefficients for doctests reproducibility >>> result_df["corr_coef"] = np.round(result_df["corr_coef"], 4) >>> result_df = result_df.sort_values(by="corr_coef", ascending=False) >>> result_df corr_coef feature_combination target 7 0.9985 (X1, X3, X4) X2 11 0.9963 (X1, X2, X4) X3 3 0.9958 (X2, X3, X4) X1 4 0.9828 (X1, X3) X2 15 0.9803 (X1, X2, X3) X4 8 0.9763 (X1, X2) X3 0 0.9482 (X2, X3) X1 5 0.8998 (X1, X4) X2 12 0.8660 (X1, X2) X4 10 0.8650 (X2, X4) X3 1 0.8428 (X2, X4) X1 6 0.8140 (X3, X4) X2 13 0.7507 (X1, X3) X4 9 0.7379 (X1, X4) X3 14 0.7290 (X2, X3) X4 2 0.2671 (X3, X4) X1
- explorica.interactions.correlation_matrices.corr_vector_multiple(x: DataFrame, y: Series) DataFrame[source]
Compute multiple correlation between y and predictor combinations from x.
Calculates multiple correlation coefficients between the target variable y and all possible combinations of 2 or more features from x.
- Parameters:
- xpd.DataFrame
Feature matrix (only numeric features should be used).
- ypd.Series
Target vector. The function computes correlation between this target and feature combinations.
- Returns:
- pd.DataFrame
A DataFrame with columns:
‘corr_coef’: multiple correlation coefficient for a given combination
‘feature_combination’: tuple of feature names used in the combination
‘target’: name of the target variable
- Raises:
- ValueError:
If x or y contains NaN values.
If the number of samples in x and y do not match.
- Warns:
- UserWarning
If the predictors in x are found to be multicollinear (i.e., the determinant of their correlation matrix is zero).
See also
explorica.interactions.correlation_metrics.corr_multipleThe underlying computation function.
Notes
The method renames columns as ‘X1’, ‘X2’, etc., to handling duplicate names
The method computes all possible combinationsof features of size 2 and larger.
This method can be computationally expensive for large datasets, as it evaluates all possible combinations of features.
Examples
>>> import pandas as pd >>> import numpy as np >>> from explorica.interactions.correlation_matrices import corr_vector_multiple >>> >>> >>> X = pd.DataFrame({ ... "X1": [1, 3, 5, 6, 1], ... "X2": [2, 3, 4, 1, 9], ... "X3": [7, 4, 2, 5, 1], ... }) >>> y = pd.Series([1, 2, 3, 4, 5], name="y") >>> result_df = corr_vector_multiple(X, y) >>> # Round coefficients for doctests reproducibility >>> result_df["corr_coef"] = np.round(result_df["corr_coef"], 4) >>> result_df corr_coef feature_combination target 0 0.8660 (X1, X2) y 1 0.7507 (X1, X3) y 2 0.7290 (X2, X3) y 3 0.9803 (X1, X2, X3) y
explorica.interactions.correlation_metrics
Statistical correlation and association measures.
This module provides a collection of functions for computing various statistical dependency measures between variables.
Functions
- cramer_v(x, y, bias_correction=True, yates_correction=False)
Calculates Cramér’s V statistic for measuring the association between two categorical variables.
- eta_squared(values, categories)
Calculate the eta-squared (η²) statistic for categorical and numeric variables. η² (eta squared) is a measure of effect size used to quantify the proportion of variance in a numerical variable that can be attributed to differences between categories of a categorical variable.
- corr_index(x, y, method=”linear”, custom_function=None, normalization_bounds=None)
Calculates a nonlinear correlation index between two series x and y, based on the proportion of variance explained by the fitted function.
- corr_multiple(x, y)
Computes the multiple correlation coefficient (R) between a set of predictor variables (x) and a response variable (y), based on the determinant of their correlation matrices.
- explorica.interactions.correlation_metrics.corr_index(x: Sequence[Number], y: Sequence[Number], method: str = 'linear', custom_function: Callable[[Number], Number] | None = None, normalization_bounds: Sequence[Number] = None) float[source]
Calculate a nonlinear correlation index.
Calculates a nonlinear correlation index between two series x and y, based on the proportion of variance explained by the fitted function.
The index is computed as:
\[R_I=\sqrt{\frac{Q_R}{Q}}= \sqrt{\frac{\sum{\left(y(x_i)-\overline y\right)^2}} {\sum(y_i-\overline y)^2}}\]where \(Q_R\) is the sum of squared errors between the true \(y_i\) and predicted \(y(x_i)\), and \(Q\) is the total sum of squares \(y\) with respect to its mean.
This method is designed to handle non-linear dependencies and is robust to monotonic transformations. In degenerate cases (e.g., constant \(x\) or \(y\)), a correlation index of 0 is returned.
Important: If the model fit yields \(Q_R\) > \(Q\) (i.e., performs worse than a mean-based predictor), the index is treated as 0. This reflects the model’s failure to explain any meaningful variance.
Supported methods:
‘linear’: \(y = ax + b\)
‘binomial’: \(y = ax^2 + bx + c\)
‘exp’: \(y = be^{ax}\)
‘ln’: \(y = a\ln(x) + b\)
‘hyperbolic’: \(y = \frac{a}{x} + b\)
‘power’: \(y = ax^b\)
‘custom’: \(y =\) your function
- Parameters:
- xSequence[Number]
The explanatory variable.
- ySequence[Number]
The response variable.
- methodstr, default=’linear’
Regression model to use. See supported methods above.
- custom_functionCallable, optional
A user-defined function that takes a Number values and returns Number predictions. Required when method=’custom’.
- normalization_boundsSequence[Number], optional
A pair of numeric values (
lower_bound,upper_bound) specifying the range to which the inputxvalues are rescaled before fitting. IfNone(default), automatic normalization is applied only for methods with domain restrictions:‘exp’ → scaled to [0, 5]
‘ln’ → scaled to [1, 10]
‘power’ → scaled to [1, 10]
‘hyperbolic’ → scaled to [2, 20]
For all other methods, no normalization is applied unless explicitly provided by the user.
- Returns:
- float
A nonlinear correlation index in the range (0 ≤ r ≤ 1).
- Raises:
- ValueError
If:
an unsupported method is passed
x and y are of unequal lengths
any NaNs are present in x or y
input values violate the domain of the selected function (e.g., log(0))
method=’custom’ is specified but no custom_function is provided.
Notes
If x or y is constant (i.e., has zero variance), the correlation index is defined as 0 to avoid division by zero or meaningless regression.
If the fitted model performs worse than the mean (\(Q_R\) > \(Q\)), \(R_I\) is also defined as 0.
User-provided
normalization_boundstake precedence and will be applied regardless of the chosen method.Automatic normalization is enabled selectively for models with restricted domains to:
Prevent numerical instability due to extremely large/small values.
Ensure compatibility with the domain restrictions of certain functions (e.g., logarithmic, power, or hyperbolic forms).
Parameter estimation is performed using scipy.optimize.curve_fit with a least-squares objective.
Examples
>>> import numpy as np >>> from explorica.interactions.correlation_metrics import corr_index >>> # Linear method usage >>> x = [1, 2, 3, 4, 5] >>> y = [2, 4, 6, 8, 10] >>> np.round(corr_index(x, y), 4) np.float64(1.0)
>>> # Non-linear method usage >>> x = [1, 2, 3, 4, 5] >>> y = [2.7, 7.4, 20.1, 54.6, 148.4] >>> np.round(corr_index(x, y, method='exp'), 4) np.float64(1.0)
- explorica.interactions.correlation_metrics.corr_multiple(x: Sequence[Sequence[Number]], y: Sequence[Number]) float[source]
Compute the multiple correlation coefficient.
Multiple correlation coefficient (\(R\)) between a set of predictor variables (\(x\)) and a response variable (\(y\)). In this case based on the determinant of their correlation matrices.
The function uses the formula:
\[R = \sqrt{R^2}=\sqrt{1-\frac{|r|}{|r_x|}}\]where \(r\) is the correlation matrix of predictors and the response, and \(r_x\) is the correlation matrix of the predictors alone.
This implementation is robust to singular correlation matrices. If \(r_x\) is singular (\(|r_x|=0\)), the function returns \(R = 0\).
- Parameters:
- xSequence[Sequence[Number]]
A sequence of predictor (explanatory) variables. Each column is assumed to represent one independent variable.
- ySequence[Number]
A series representing the response (dependent) variable.
- Returns:
- float
The multiple correlation coefficient (R), ranging from 0 to 1.
- Raises:
- ValueError
If the number of samples in x and y do not match. If any NaN values are present in x or y.
- Warns:
- UserWarning
If the predictors in x are found to be multicollinear (i.e., the determinant of their correlation matrix is zero).
Examples
>>> import numpy as np >>> from explorica.interactions.correlation_metrics import corr_multiple >>> # Simple usage >>> X = [[1, 2, 3, 4, 5], [2, 3, 5, 6, 8]] >>> y = [2, 4, 5, 7, 9] >>> result = corr_multiple(X, y) >>> np.round(result, 4) np.float64(0.9954)
- explorica.interactions.correlation_metrics.cramer_v(x: Sequence, y: Sequence, bias_correction: bool = True, yates_correction: bool = False) float[source]
Calculate Cramér’s V statistic.
Calculates Cramér’s V (\(V\)) for measuring the association between two categorical variables. Bias correction and Yates’ correction are available as options.
Calculated as:
Without bias correction:
\[V = \sqrt{ \frac{\phi^2}{\min(r - 1, k - 1)}}\]With bias correction (Bergsma, 2013):
\[V = \sqrt{\frac{\phi^2_{\text{corr}}}{\min(r - 1, k - 1)}}\]\(\phi^2_{\text{corr}}\) is:
\[\phi^2_{\text{corr}} = \max\left( 0, \phi^2 - \frac{(r-1)(k-1)}{n-1} \right)\]Where \(\phi^2\) is:
\[\phi^2 = \frac{\chi^2}{n}\]\(n\) is sample size. \(r\) and \(k\) are number of rows and columns in the contingency table respectively.
- Parameters:
- xSequence
First categorical variable.
- ySequence
Second categorical variable.
- bias_correctionbool, optional, default=True
Whether to apply bias correction (recommended for small samples).
- yates_correctionbool, optional, default=False
Whether to apply Yates’ correction for continuity (only applies to 2x2 tables; usually set to False when using Cramér’s V).
- Returns:
- float
Cramér’s V value, ranging from 0 (no association) to 1 (perfect association). Returns 0 if the statistic is undefined (e.g., due to zero denominator).
- Raises:
- ValueError
If ‘x’ or ‘y’ contains NaN values. If input sequences lengths mismatch.
Examples
>>> # Simple usage >>> from explorica.interactions.correlation_metrics import cramer_v >>> categories1 = ['A', 'A', 'A', 'B', 'B', 'B'] >>> categories2 = [1, 1, 1, 2, 2, 2] >>> cramer_v(categories1, categories2, bias_correction=False) np.float64(1.0)
>>> categories1 = ['A', 'A', 'A', 'B', 'B', 'B'] >>> categories2 = ['C', 'D', 'E', 'C', 'D', 'E'] >>> cramer_v(categories1, categories2) np.float64(0.0)
- explorica.interactions.correlation_metrics.eta_squared(values: Sequence[Number], categories: Sequence) float[source]
Calculate the eta-squared (\(\eta^2\)) statistic.
\(\eta^2\) (eta squared) is a measure of effect size used to quantify the proportion of variance in a numerical variable that can be attributed to differences between categories of a categorical variable.
Calculated as:
\[\eta^2=\frac{Q_A}{Q}=\frac{\sum{(\overline{x}_i-\overline{x})^2n_i}} {\sum(x_j-\overline x)^2}\]- Parameters:
- valuesSequence[Number]
A numerical sequence representing the dependent (response) variable.
- categoriesSequence
A categorical sequence representing the independent (grouping) variable.
- Returns:
- float
Eta-squared statistic in the range [0, 1], where:
0 means no association between variables,
1 means perfect association (all variance explained by groups).
- Raises:
- ValueError
If the lengths of
valuesandcategoriesdo not match, or if either of them contains NaN values.
Notes
If the total variance of values is zero, the function returns 0. NaN values should be handled before calling this function.
Examples
>>> import numpy as np >>> from explorica.interactions.correlation_metrics import eta_squared >>> # Simple usage >>> values = [2.3, 3.1, 2.8, 5.5, 6.0, 5.8] >>> categories = ['A', 'A', 'A', 'B', 'B', 'B'] >>> result = eta_squared(values, categories) >>> np.round(result, 4) np.float64(0.9682)