Karl Pearson’s coefficient of correlation, Concept, Uses, Methods, Properties, Assumptions and Limitations

Karl Pearson’s Coefficient of Correlation is a statistical measure that evaluates the strength and direction of the linear relationship between two continuous variables. It is denoted by ‘r’ and ranges between –1 and +1. A value of +1 indicates a perfect positive linear correlation, meaning both variables increase together; –1 denotes a perfect negative linear correlation, where one variable increases while the other decreases. A value of 0 implies no linear relationship.

Developed by British statistician Karl Pearson, this method is one of the most widely used techniques in correlation analysis. The coefficient is calculated using either raw scores or deviations from the mean, and it considers all paired values in the dataset. It is particularly useful in fields like economics, business, psychology, and natural sciences for forecasting, hypothesis testing, and decision-making.

However, it assumes a linear relationship and is highly sensitive to outliers, which can distort results. Also, while it shows association, it does not imply causation. Despite these limitations, it remains a powerful and foundational tool for understanding relationships between variables in statistical analysis.

Uses of Karl Pearson’s Coefficient:

  • Analyzing the correlation between price and demand in economics

  • Understanding student performance across subjects

  • Measuring marketing expenditure vs. sales

  • Identifying trends in medical and social sciences

Methods of Karl Pearson’s Coefficient of Correlation:

1. Actual Mean Method (Deviation from Actual Mean)

Formula:

             ∑(x – x̄)(y – ȳ)
r =      ————————-
            √[∑(x – x̄)² × ∑(y – ȳ)²]

Where:

r = Karl Pearson’s correlation coefficient

= Mean of variable X

ȳ = Mean of variable Y

x, y = Individual values of variables X and Y

Use When:

  • You have small datasets

  • You can calculate the actual mean for both variables

Example Use Case: Used in classroom or exam performance correlation where averages are easily calculated.

2. Assumed Mean Method

Formula:

            ∑dx·dy – (∑dx)(∑dy)/n
r =    —————————————–
          √[∑dx² – (∑dx)²/n] · [∑dy² – (∑dy)²/n]

Where:

r = Karl Pearson’s correlation coefficient

dx = x – A (Deviation of X from assumed mean A)

dy = y – B (Deviation of Y from assumed mean B)

n = Number of observations

Use When:

  • Data values are large or awkward to compute exact means

  • You want to simplify calculations

Example Use Case: Used when data like income, population, or marks are large, and approximate means make calculations easier.

3. Direct Method (Raw Score Method)

Formula:

               n(∑xy) – (∑x)(∑y)
   —————————————–
            √[n(∑x²) – (∑x)²] · [n(∑y²) – (∑y)²]

Where:

r = Karl Pearson’s correlation coefficient

n = Number of data pairs

∑xy = Sum of the products of paired scores

∑x = Sum of X values

∑y = Sum of Y values

∑x² = Sum of squares of X

∑y² = Sum of squares of Y

Use When:

  • You have complete raw scores (not deviations)

  • Data is entered directly into software or spreadsheets

Example Use Case: Used in software-based or spreadsheet-based analysis like Excel, SPSS, or R, where summations can be automated.

Summary Table of Methods of Karl Pearson’s Coefficient

Method Formula Type Best For Advantage
Actual Mean Method Deviation from mean Small datasets Accurate, uses true central tendency
Assumed Mean Method Deviation from assumed mean Large datasets with large values Simplifies calculation with approximations
Direct Method Raw score formula When using software or tools Fastest with computing tools

Properties of Coefficient of Correlation:

1. Value Lies Between –1 and +1

The coefficient of correlation always ranges from –1 to +1.

  • r = +1: Perfect positive linear correlation
  • r = –1: Perfect negative linear correlation
  • r = 0: No linear correlation

2. Unit-Free (Dimensionless)

The coefficient of correlation is a pure number without units. It remains the same regardless of the scale or units of measurement, such as kilograms, dollars, or centimeters.

3. Symmetrical Between Variables

The correlation between X and Y is identical to the correlation between Y and X.

r(X,Y) = r(Y,X)

4. Unaffected by Origin and Scale (Except Multiplication by Negative Number)

If the variables are transformed linearly (e.g., u = aX + b), the value of r remains unchanged, provided a > 0.

  • Addition or subtraction (change in origin): no effect
  • Multiplication by a positive constant (change in scale): no effect
  • Multiplication by a negative constant: changes the sign of r

5. Indicates Direction of Relationship

  • If r > 0: X and Y increase together (positive relationship)
  • If r < 0: X increases as Y decreases (negative relationship)
  • If r = 0: No linear relationship

6. Sensitive to Outliers

Pearson’s r is highly sensitive to extreme values. A single outlier can significantly distort the value of the correlation coefficient, making the result unreliable.

7. Only Measures Linear Relationship

The coefficient measures only linear association between variables.
If the relationship is non-linear, Pearson’s r may be close to 0 even if a strong association exists in another form (e.g., quadratic, exponential).

8. Does Not Imply Causation

Even a strong correlation does not mean one variable causes the other. Correlation simply shows that the variables move together, not why they do so.

Assumptions of Karl Pearson’s Coefficient of Correlation:

  • Linearity

It assumes a linear relationship between the two variables. That means, the change in one variable results in a proportional change in the other. If the relationship is non-linear (e.g., curved), Pearson’s coefficient may give misleading results.

  • Quantitative and Continuous Data

Both variables must be quantitative (numerical) and measured on an interval or ratio scale. Pearson’s method is not suitable for categorical or ordinal data.

  • No Extreme Outliers

The data should be free from extreme outliers or influential values, as they can significantly distort the correlation coefficient and misrepresent the actual relationship.

  • Normal Distribution (for inference)

While not required for calculating correlation, a bivariate normal distribution is assumed when performing hypothesis tests or significance testing based on Pearson’s r.

  • Homoscedasticity

The variance of one variable should be relatively constant across levels of the other variable. In other words, the data points should form a roughly even “cloud” in a scatter plot rather than a funnel shape.

  • Independence of Observations

Each data pair (xi,yi) should be independent of others. Repeated or related observations violate this assumption and can bias the result.

  • Both Variables Should Be Random

Both variables should ideally be from random samples. If one or both are fixed or deterministic, the result may not reflect a general relationship.

Limitations of Karl Pearson’s coefficient of correlation

  • Assumes linear relationship only

  • Sensitive to extreme values (outliers)

  • Requires quantitative data

  • Can be misinterpreted without context or scatter plot

Leave a Reply

error: Content is protected !!