What Is Regression Analysis?
Regression analysis is a statistical method that models the relationship between a dependent variable (outcome) and one or more independent variables (predictors). It estimates how changes in the predictors are associated with changes in the outcome, producing an equation you can use for both explanation and prediction. Simple linear regression uses one predictor. Multiple regression uses two or more. The output tells you the direction of each relationship (positive or negative), its magnitude (how much the outcome changes per unit change in the predictor), and how well the overall model fits the data. Researchers, analysts, and data scientists use regression across virtually every field, from predicting customer churn to identifying which product features drive satisfaction.
Why Regression Analysis Matters
Cross-tabs and t-tests tell you whether a difference exists. Regression tells you how much each factor contributes and which ones matter most. That's a fundamentally different, and more actionable, level of insight.
Consider a brand tracking study. You know overall satisfaction dropped 4 points this quarter. A t-test confirms the drop is significant. But which aspect of the experience caused it? Regression can model satisfaction as a function of product quality, customer support, delivery speed, and price perception simultaneously. It might reveal that delivery speed has a coefficient of 0.45 while price perception's coefficient is only 0.08, meaning delivery is five times more influential. Now you know where to invest.
Regression also enables prediction. Once you've fit a model, you can estimate what would happen if you improved delivery speed by 1 point, or what satisfaction score to expect from a customer segment you haven't surveyed yet.
How Regression Analysis Works
Simple Linear Regression
Simple regression models the relationship between one predictor (X) and one outcome (Y) as a straight line.
The formula:
Y = b0 + b1 * X + e
Where Y is the predicted outcome, b0 is the intercept (the value of Y when X = 0), b1 is the slope (how much Y changes for each one-unit increase in X), and e is the error term (the variation in Y that X doesn't explain).
The coefficients are calculated as:
b1 = SUM((xi - x-bar)(yi - y-bar)) / SUM((xi - x-bar)^2)
b0 = y-bar - b1 * x-bar
Worked Example: Simple Regression
A retail chain wants to know how advertising spend relates to store traffic. Data from 6 months:
| Month | Ad Spend ($K) | Store Visits (K) |
|---|---|---|
| Jan | 10 | 22 |
| Feb | 15 | 28 |
| Mar | 12 | 25 |
| Apr | 20 | 35 |
| May | 18 | 32 |
| Jun | 25 | 40 |
Step 1: Calculate means.
x-bar = (10 + 15 + 12 + 20 + 18 + 25) / 6 = 100 / 6 = 16.67 y-bar = (22 + 28 + 25 + 35 + 32 + 40) / 6 = 182 / 6 = 30.33
Step 2: Calculate b1 (slope).
SUM((xi - x-bar)(yi - y-bar)): (10-16.67)(22-30.33) + (15-16.67)(28-30.33) + (12-16.67)(25-30.33) + (20-16.67)(35-30.33) + (18-16.67)(32-30.33) + (25-16.67)(40-30.33) = (-6.67)(-8.33) + (-1.67)(-2.33) + (-4.67)(-5.33) + (3.33)(4.67) + (1.33)(1.67) + (8.33)(9.67) = 55.56 + 3.89 + 24.89 + 15.56 + 2.22 + 80.56 = 182.68
SUM((xi - x-bar)^2): 44.49 + 2.79 + 21.81 + 11.09 + 1.77 + 69.39 = 151.34
b1 = 182.68 / 151.34 = 1.207
Step 3: Calculate b0 (intercept).
b0 = 30.33 - 1.207 * 16.67 = 30.33 - 20.12 = 10.21
The regression equation:
Store Visits = 10.21 + 1.207 * Ad Spend
Interpretation: for every additional $1,000 in advertising, the model predicts an increase of approximately 1,207 store visits. The intercept of 10,210 represents the baseline traffic with zero ad spend.
Prediction: If the chain spends $22K next month: Store Visits = 10.21 + 1.207 * 22 = 10.21 + 26.55 = 36.76K
R-Squared: How Good Is the Fit?
R-squared (R^2) measures the proportion of variance in the outcome that the model explains. It ranges from 0 to 1.
Formula:
R^2 = 1 - (SS_residual / SS_total)
Where SS_residual is the sum of squared differences between observed and predicted values, and SS_total is the sum of squared differences between observed values and the mean.
For the ad spend example:
SS_total = SUM((yi - y-bar)^2) = 69.39 + 5.43 + 28.41 + 21.81 + 2.79 + 93.51 = 221.34
Predicted values: 22.28, 28.32, 24.70, 34.35, 31.94, 40.39
SS_residual = (22-22.28)^2 + (28-28.32)^2 + (25-24.70)^2 + (35-34.35)^2 + (32-31.94)^2 + (40-40.39)^2 = 0.08 + 0.10 + 0.09 + 0.42 + 0.004 + 0.15 = 0.844
R^2 = 1 - (0.844 / 221.34) = 1 - 0.0038 = 0.996
An R^2 of 0.996 means ad spend explains 99.6% of the variation in store visits in this dataset. That's unusually high, in real-world research with survey data, R^2 values between 0.30 and 0.70 are more typical and still considered useful.
Multiple Regression
Multiple regression adds more predictors to the model:
Y = b0 + b1X1 + b2X2 +... + bk*Xk + e
For example, modeling customer satisfaction as:
Satisfaction = 1.2 + 0.45*(Product Quality) + 0.30*(Support Speed) + 0.15*(Price Fairness) + 0.08*(Website Ease)
Each coefficient represents the effect of that predictor while holding all other predictors constant. Product quality has the largest coefficient (0.45), meaning a one-unit improvement in perceived product quality is associated with a 0.45-point increase in satisfaction, assuming support speed, price fairness, and website ease stay the same.
Adjusted R-Squared
When you add predictors, R^2 always increases, even if the new variable adds no real explanatory power. Adjusted R^2 penalizes for model complexity:
Adjusted R^2 = 1 - ((1 - R^2)(n - 1) / (n - k - 1))
Where n is the sample size and k is the number of predictors. If adding a variable increases adjusted R^2, it's contributing meaningfully. If adjusted R^2 drops, that variable is just adding noise.
Interpreting Coefficients
Each coefficient tells you three things:
- Direction: positive means the predictor and outcome move together; negative means they move in opposite directions
- Magnitude: the unstandardized coefficient tells you the predicted change in Y for a one-unit change in X
- Significance: each coefficient has its own p-value; a non-significant coefficient (p > 0.05) means that predictor doesn't reliably contribute to the model
Standardized coefficients (beta weights) let you compare the relative importance of predictors measured on different scales. A beta of 0.42 for product quality vs. 0.18 for price fairness means quality has roughly twice the impact, regardless of how each variable was measured.
When to Use Regression Analysis
- Key driver analysis: identifying which attributes have the strongest influence on overall satisfaction, loyalty, or purchase intent
- Forecasting: predicting future sales, traffic, or response rates based on controllable inputs
- Controlling for confounds: isolating the effect of one variable while accounting for others that might explain the same outcome
- Segmentation validation: testing whether demographic or behavioral variables meaningfully predict membership in a customer segment
- Pricing research: modeling the relationship between price and demand across different product configurations
Common Mistakes
- Confusing correlation with causation: regression shows association, not causation, unless the study design supports causal inference (randomized experiment, natural experiment)
- Ignoring multicollinearity: when two predictors are highly correlated (e.g., product quality and build quality), coefficients become unstable and hard to interpret; check Variance Inflation Factors (VIF > 5 is a warning sign)
- Overfitting with too many predictors: a model with 20 predictors and 50 observations will fit the sample perfectly but predict new data poorly; maintain at least 10-20 observations per predictor
- Extrapolating beyond the data range: if your ad spend ranged from $10K to $25K, predicting the outcome at $100K is unreliable since the linear relationship may not hold
- Reporting R^2 without context: an R^2 of 0.35 is excellent for predicting human behavior from survey data; presenting it as "the model only explains 35%" misleads stakeholders
How Quali-Fi Supports Regression Analysis
Quali-Fi's Intelligence tier ($2,750+/project) includes key driver analysis powered by multiple regression, automatically identifying which survey attributes most strongly predict your outcome variable. The platform outputs standardized coefficients, R-squared values, and significance levels in a visual priority matrix, no statistical software required. For teams running their own analysis, Quali-Fi exports clean, labeled datasets compatible with SPSS, R, and Python.
Run key driver analysis with Quali-Fi
Frequently Asked Questions
What's the difference between regression and correlation?
Correlation measures the strength and direction of a linear relationship between two variables (ranging from -1 to +1). Regression goes further: it provides an equation that predicts one variable from another and quantifies how much the outcome changes per unit change in the predictor. Correlation is symmetric (X with Y = Y with X), while regression is directional (X predicts Y).
How large should my sample be for regression?
A common guideline is at least 10-20 observations per predictor variable. For a model with 5 predictors, you'd want at least 50-100 observations. With fewer, coefficient estimates become unstable and the model may overfit. For precise power calculations, Green's (1991) formula suggests N >= 50 + 8k (where k = number of predictors) for testing individual coefficients.
Can regression handle non-linear relationships?
Standard linear regression models straight-line relationships. But you can model curves by adding polynomial terms (X^2, X^3) or by transforming variables (log, square root). For more complex non-linear patterns, logistic regression (for binary outcomes), ordinal regression (for ranked outcomes), or machine learning methods may be more appropriate.
What does a negative coefficient mean?
A negative coefficient means the predictor and outcome move in opposite directions. For example, a coefficient of -0.3 for "price" in a purchase intent model means that for every one-unit increase in perceived price, predicted purchase intent drops by 0.3 points. This doesn't mean the variable is unimportant, it just means the relationship is inverse.