pearsonr#
- scipy.stats.pearsonr(x, y, *, alternative='two-sided', method=None, axis=0)[source]#
- Pearson correlation coefficient and p-value for testing non-correlation. - The Pearson correlation coefficient [1] measures the linear relationship between two datasets. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases. - This function also performs a test of the null hypothesis that the distributions underlying the samples are uncorrelated and normally distributed. (See Kowalski [3] for a discussion of the effects of non-normality of the input on the distribution of the correlation coefficient.) The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. - Parameters:
- xarray_like
- Input array. 
- yarray_like
- Input array. 
- axisint or None, default
- Axis along which to perform the calculation. Default is 0. If None, ravel both arrays before performing the calculation. - Added in version 1.13.0. 
- alternative{‘two-sided’, ‘greater’, ‘less’}, optional
- Defines the alternative hypothesis. Default is ‘two-sided’. The following options are available: - ‘two-sided’: the correlation is nonzero 
- ‘less’: the correlation is negative (less than zero) 
- ‘greater’: the correlation is positive (greater than zero) 
 - Added in version 1.9.0. 
- methodResamplingMethod, optional
- Defines the method used to compute the p-value. If method is an instance of - PermutationMethod/- MonteCarloMethod, the p-value is computed using- scipy.stats.permutation_test/- scipy.stats.monte_carlo_testwith the provided configuration options and other appropriate settings. Otherwise, the p-value is computed as documented in the notes.- Added in version 1.11.0. 
 
- Returns:
- resultPearsonRResult
- An object with the following attributes: - statisticfloat
- Pearson product-moment correlation coefficient. 
- pvaluefloat
- The p-value associated with the chosen alternative. 
 - The object has the following method: - confidence_interval(confidence_level, method)
- This computes the confidence interval of the correlation coefficient statistic for the given confidence level. The confidence interval is returned in a - namedtuplewith fields low and high. If method is not provided, the confidence interval is computed using the Fisher transformation [1]. If method is an instance of- BootstrapMethod, the confidence interval is computed using- scipy.stats.bootstrapwith the provided configuration options and other appropriate settings. In some cases, confidence limits may be NaN due to a degenerate resample, and this is typical for very small samples (~6 observations).
 
 
- result
- Raises:
- ValueError
- If x and y do not have length at least 2. 
 
- Warns:
- ConstantInputWarning
- Raised if an input is a constant array. The correlation coefficient is not defined in this case, so - np.nanis returned.
- NearConstantInputWarning
- Raised if an input is “nearly” constant. The array - xis considered nearly constant if- norm(x - mean(x)) < 1e-13 * abs(mean(x)). Numerical errors in the calculation- x - mean(x)in this case might result in an inaccurate calculation of r.
 
 - See also - spearmanr
- Spearman rank-order correlation coefficient. 
- kendalltau
- Kendall’s tau, a correlation measure for ordinal data. 
 - Notes - The correlation coefficient is calculated as follows: \[r = \frac{\sum (x - m_x) (y - m_y)} {\sqrt{\sum (x - m_x)^2 \sum (y - m_y)^2}}\]- where \(m_x\) is the mean of the vector x and \(m_y\) is the mean of the vector y. - Under the assumption that x and y are drawn from independent normal distributions (so the population correlation coefficient is 0), the probability density function of the sample correlation coefficient r is ([1], [2]): \[f(r) = \frac{{(1-r^2)}^{n/2-2}}{\mathrm{B}(\frac{1}{2},\frac{n}{2}-1)}\]- where n is the number of samples, and B is the beta function. This is sometimes referred to as the exact distribution of r. This is the distribution that is used in - pearsonrto compute the p-value when the method parameter is left at its default value (None). The distribution is a beta distribution on the interval [-1, 1], with equal shape parameters a = b = n/2 - 1. In terms of SciPy’s implementation of the beta distribution, the distribution of r is:- dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2) - The default p-value returned by - pearsonris a two-sided p-value. For a given sample with correlation coefficient r, the p-value is the probability that abs(r’) of a random sample x’ and y’ drawn from the population with zero correlation would be greater than or equal to abs(r). In terms of the object- distshown above, the p-value for a given r and length n can be computed as:- p = 2*dist.cdf(-abs(r)) - When n is 2, the above continuous distribution is not well-defined. One can interpret the limit of the beta distribution as the shape parameters a and b approach a = b = 0 as a discrete distribution with equal probability masses at r = 1 and r = -1. More directly, one can observe that, given the data x = [x1, x2] and y = [y1, y2], and assuming x1 != x2 and y1 != y2, the only possible values for r are 1 and -1. Because abs(r’) for any sample x’ and y’ with length 2 will be 1, the two-sided p-value for a sample of length 2 is always 1. - For backwards compatibility, the object that is returned also behaves like a tuple of length two that holds the statistic and the p-value. - References [1] (1,2,3)- “Pearson correlation coefficient”, Wikipedia, https://en.wikipedia.org/wiki/Pearson_correlation_coefficient [2]- Student, “Probable error of a correlation coefficient”, Biometrika, Volume 6, Issue 2-3, 1 September 1908, pp. 302-310. [3]- C. J. Kowalski, “On the Effects of Non-Normality on the Distribution of the Sample Product-Moment Correlation Coefficient” Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 21, No. 1 (1972), pp. 1-12. - Examples - >>> import numpy as np >>> from scipy import stats >>> x, y = [1, 2, 3, 4, 5, 6, 7], [10, 9, 2.5, 6, 4, 3, 2] >>> res = stats.pearsonr(x, y) >>> res PearsonRResult(statistic=-0.828503883588428, pvalue=0.021280260007523286) - To perform an exact permutation version of the test: - >>> rng = np.random.default_rng() >>> method = stats.PermutationMethod(n_resamples=np.inf, random_state=rng) >>> stats.pearsonr(x, y, method=method) PearsonRResult(statistic=-0.828503883588428, pvalue=0.028174603174603175) - To perform the test under the null hypothesis that the data were drawn from uniform distributions: - >>> method = stats.MonteCarloMethod(rvs=(rng.uniform, rng.uniform)) >>> stats.pearsonr(x, y, method=method) PearsonRResult(statistic=-0.828503883588428, pvalue=0.0188) - To produce an asymptotic 90% confidence interval: - >>> res.confidence_interval(confidence_level=0.9) ConfidenceInterval(low=-0.9644331982722841, high=-0.3460237473272273) - And for a bootstrap confidence interval: - >>> method = stats.BootstrapMethod(method='BCa', rng=rng) >>> res.confidence_interval(confidence_level=0.9, method=method) ConfidenceInterval(low=-0.9983163756488651, high=-0.22771001702132443) # may vary - If N-dimensional arrays are provided, multiple tests are performed in a single call according to the same conventions as most - scipy.statsfunctions:- >>> rng = np.random.default_rng() >>> x = rng.standard_normal((8, 15)) >>> y = rng.standard_normal((8, 15)) >>> stats.pearsonr(x, y, axis=0).statistic.shape # between corresponding columns (15,) >>> stats.pearsonr(x, y, axis=1).statistic.shape # between corresponding rows (8,) - To perform all pairwise comparisons between slices of the arrays, use standard NumPy broadcasting techniques. For instance, to compute the correlation between all pairs of rows: - >>> stats.pearsonr(x[:, np.newaxis, :], y, axis=-1).statistic.shape (8, 8) - There is a linear dependence between x and y if y = a + b*x + e, where a,b are constants and e is a random error term, assumed to be independent of x. For simplicity, assume that x is standard normal, a=0, b=1 and let e follow a normal distribution with mean zero and standard deviation s>0. - >>> rng = np.random.default_rng() >>> s = 0.5 >>> x = stats.norm.rvs(size=500, random_state=rng) >>> e = stats.norm.rvs(scale=s, size=500, random_state=rng) >>> y = x + e >>> stats.pearsonr(x, y).statistic 0.9001942438244763 - This should be close to the exact value given by - >>> 1/np.sqrt(1 + s**2) 0.8944271909999159 - For s=0.5, we observe a high level of correlation. In general, a large variance of the noise reduces the correlation, while the correlation approaches one as the variance of the error goes to zero. - It is important to keep in mind that no correlation does not imply independence unless (x, y) is jointly normal. Correlation can even be zero when there is a very simple dependence structure: if X follows a standard normal distribution, let y = abs(x). Note that the correlation between x and y is zero. Indeed, since the expectation of x is zero, cov(x, y) = E[x*y]. By definition, this equals E[x*abs(x)] which is zero by symmetry. The following lines of code illustrate this observation: - >>> y = np.abs(x) >>> stats.pearsonr(x, y) PearsonRResult(statistic=-0.05444919272687482, pvalue=0.22422294836207743) - A non-zero correlation coefficient can be misleading. For example, if X has a standard normal distribution, define y = x if x < 0 and y = 0 otherwise. A simple calculation shows that corr(x, y) = sqrt(2/Pi) = 0.797…, implying a high level of correlation: - >>> y = np.where(x < 0, x, 0) >>> stats.pearsonr(x, y) PearsonRResult(statistic=0.861985781588, pvalue=4.813432002751103e-149) - This is unintuitive since there is no dependence of x and y if x is larger than zero which happens in about half of the cases if we sample x and y.