Building Models

Sociology 312/412/512, University of Oregon

Aaron Gullickson

The OLS Regression Line

The formula for a straight line

in high school

$y = a + b x$

$a$ is the y-intercept: the value of $y$ when $x$ is zero.
$b$ is the slope: the change in $y$ for a one-unit increase in $x$ (the rise over the run).
The values of $a$ and $b$ are called coefficients - a constant value that is multiplied by a variable.

How we do it in statistics

${\hat{y}}_{i} = b_{0} + b_{1} x_{i}$

${\hat{y}}_{i}$ : The predicted value of $y$ for $i$ th observation from the linear formula.
$b_{0}$ : The predicted value of $y$ when $x$ is zero.
$b_{1}$ : The predicted change in $y$ for a one-unit increase in $x$ .

How do we know which line is best?

We choose the line that minimizes the error in our prediction

For a given observation $i$ , the value $y_{i} - {\hat{y}}_{i}$ gives the residual or error in the prediction.

To get the total error in prediction, we can calculate the sum of squared residuals:

$S S R = \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2} = \sum_{i = 1}^{n} (y_{i} - b_{0} - b_{1} x_{i})^{2}$

The best-fitting line is the one with the smallest possible sum of squared residuals. This is called the Ordinary Least Squares (OLS) regression line.

Formulas for the best-fitting line

$b_{1} = r * \frac{s_{y}}{s_{x}}$ $b_{0} = \bar{y} - b_{1} * \bar{x}$

We can calculate by hand in R, although we will learn an easier way later:

slope <- cor(crimes$property_rate,crimes$unemploy_rate)*sd(crimes$property_rate)/sd(crimes$unemploy_rate)
slope

[1] 220.7796

intercept <- mean(crimes$property_rate)-slope*mean(crimes$unemploy_rate)
intercept

[1] 1238.073

$\hat{\texttt{property_crimes}}_i=1238.1+220.8(\texttt{unemployment_rate}_i)$

Interpreting the results

$\hat{\texttt{property_crimes}}_i=1238.1+220.8(\texttt{unemploy_rate}_i)$

Intercept

The model predicts that states with no unemployment will have a property crime rate of 1238.1 crimes per 100,000, on average.

Slope

The model predicts that a one percent increase in the unemployment rate is associated with an increase of 220.8 property crimes per 100,000, on average.

Language Matters!

The model predicts: We always preface results with this phrase, because we want to be clear that the results are driven by a model, which could also be a bad model.
is associated with: We use this to describe the association between variables while avoiding causal language.
on average: We don’t expect to see the exact same value for all cases with $x = 0$ or the same change in $y$ for all one-unit increases in $x$ . Rather we expect to see those values on average. If we don’t include this qualifier, our results seem to deterministic.

Try interpreting these numbers

Try interpreting these numbers from a regression model where the dependent variable is box office returns (in millions of dollars) and the independent variable is the metascore (from 0 to 100 in “points”).

$\hat{\texttt{box_office}}_i=4.98+0.77(\texttt{metascore}_i)$

Intercept

The model predicts that movies that receive a zero metascore rating will make $4.98 million, on average.

Slope

The model predicts that a one point increase in the metascore rating is associated with a $770,000 increase in box office returns, on average.

Getting meaningful intercepts

Lets subtract some constant $a$ from the variable $x$ :

$x^{*} = x - a$

The value for zero on our new re-centered $x^{*}$ will be $a$ on the original scale.

In the formula of the lm command in R, we can do this easily by surrounding our math with I() which tells R to apply the function inside and treat it as a new variable:

model <- lm(sexf~I(age-18), data=sex)
round(coef(model), 2)

(Intercept) I(age - 18) 
      73.45       -0.83

How good is $x$ as a predictor of $y$ ?

I pick a random observation from the dataset and tell you its value of $x$ , and then ask you to guess the value of $y$ . What is your best guess?

Choose ${\hat{y}}_{i}$ from the linear model

Assuming that a linear model is reasonable, the predicted value from this model will be your best guess.
The average error in your prediction will be equal to the average residual from the model, $| {\hat{y}}_{i} - y_{i} |$ .

Statistical inference for linear models

The population model is: ${\hat{y}}_{i} = β_{0} + β_{1} (x_{i})$

The null hypothesis of no relationship is given by: $H_{0} : β_{1} = 0$

How do we test?

summary(lm(property_rate~unemploy_rate, data=crimes))


Call:
lm(formula = property_rate ~ unemploy_rate, data = crimes)

Residuals:
     Min       1Q   Median       3Q      Max 
-1007.08  -529.05    32.59   409.89  1785.02 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)    1238.07     408.59   3.030  0.00390 **
unemploy_rate   220.78      72.04   3.065  0.00354 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 607.6 on 49 degrees of freedom
Multiple R-squared:  0.1608,    Adjusted R-squared:  0.1437 
F-statistic: 9.392 on 1 and 49 DF,  p-value: 0.003539

Just look at a summary of the model! 😎

The Power of Controlling for Other Variables

	Model 1	Model 2	Model 3
Intercept	1693.3***	1418.5***	6010.5***
	(356.0)	(327.8)	(1137.9)
Percent less than HS	71.4*	-79.4	-101.7*
	(32.0)	(50.6)	(44.3)
Poverty rate		198.3***	150.8**
		(54.8)	(49.3)
Unemployment rate			161.1
			(82.1)
Median age			-125.3***
			(28.3)
N	51	51	51
R-squared	0.092	0.287	0.512
* p < 0.05, p < 0.01, * p < 0.001
Standard errors in parenthesis.

	Model 1	Model 2
Intercept	53.46***	70.21***
	(0.79)	(0.90)
Married	-7.01***	7.33***
	(0.99)	(1.03)
Divorced	-5.58***	10.90***
	(1.21)	(1.25)
Widowed	-27.57***	4.99**
	(1.62)	(1.81)
age		-0.91***
		(0.03)
N	11785	11785
R-squared	0.024	0.113
* p < 0.05, p < 0.01, * p < 0.001
Standard errors in parenthesis. Age centered on 18 years. Married is reference category for marital status.

Additive models will miss context

model_add <- lm(wages~nchild+gender, data=earnings)
coef(model_add)

 (Intercept)       nchild genderFemale 
   25.239547     1.155477    -3.974105

${\hat{wages}}_{i} = 25.24 + 1.55 ({nchild}_{i}) - 3.97 ({female}_{i})$

What is the relationship between wages and number of children for men and women?

Men

The ${female}_{i}$ variable is an indicator variable that is zero for men, so:

$\begin{array}{rcl} {\hat{wages}}_{i} & = & 25.24 + 1.55 ({nchild}_{i}) - 3.97 (0) \\ {\hat{wages}}_{i} & = & 25.24 + 1.55 ({nchild}_{i}) \end{array}$

Women

The ${female}_{i}$ variable is an indicator variable that is one for women, so:

$\begin{array}{rcl} {\hat{wages}}_{i} & = & 25.24 + 1.55 ({nchild}_{i}) - 3.97 (1) \\ {\hat{wages}}_{i} & = & (25.24 - 3.97) + 1.55 ({nchild}_{i}) \\ {\hat{wages}}_{i} & = & 21.27 + 1.55 ({nchild}_{i}) \end{array}$

We need a multiplicative model

model_mult <- lm(wages~nchild*gender, data=earnings)
coef(model_mult)

        (Intercept)              nchild        genderFemale nchild:genderFemale 
          24.719748            1.778860           -2.839198           -1.334728

${\hat{wages}}_{i} = 24.72 + 1.78 ({nchild}_{i}) - 2.84 ({female}_{i}) - 1.33 ({nchild}_{i}) ({female}_{i})$

What is the relationship between wages and number of children for men and women?

Men

The ${female}_{i}$ variable is an indicator variable that is zero for men, so:

$\begin{array}{rcl} {\hat{wages}}_{i} & = & 24.72 + 1.78 ({nchild}_{i}) - 2.84 (0) \\ - 1.33 ({nchild}_{i}) (0) \\ {\hat{wages}}_{i} & = & 24.72 + 1.78 ({nchild}_{i}) \end{array}$

Women

The ${female}_{i}$ variable is an indicator variable that is one for women, so:

$\begin{array}{rcl} {\hat{wages}}_{i} & = & 24.72 + 1.78 ({nchild}_{i}) - 2.84 (1) \\ - 1.33 ({nchild}_{i}) (1) \\ {\hat{wages}}_{i} & = & (24.72 - 2.84) + (1.78 - 1.33) ({nchild}_{i}) \\ {\hat{wages}}_{i} & = & 21.82 + 0.45 ({nchild}_{i}) \end{array}$

Multiplicative models give non-parallel lines

$\begin{array}{rcl} {\hat{wages}}_{i} & = & 24.72 + 1.78 ({nchild}_{i}) - 2.84 ({female}_{i}) \\ - 1.33 ({nchild}_{i}) ({female}_{i}) \end{array}$

This model shows that men and women get different returns to wages for the number of children with men getting a much greater return ($1.78 to $0.45).
This models shows that the wage gap starts at $2.84 when men and women have no children and grows by $1.33 for every child.

Value	Separate models	Interaction terms
Intercept
Men’s wages with no children	$24.72	$24.72
Women’s wages with no children	$21.88
Difference in men’s and women’s wages with no children		-$2.84
Slope
Men’s return for an additional child	$1.78	$1.78
Women’s return for an additional child	$0.45
Difference in men’s and women’s return for an additional child		-$1.33

Interpretation

coef(lm(wages~nchild*gender, data=earnings))

        (Intercept)              nchild        genderFemale nchild:genderFemale 
          24.719748            1.778860           -2.839198           -1.334728

${\hat{wages}}_{i} = 24.72 + 1.78 ({nchild}_{i}) - 2.84 ({female}_{i}) - 1.33 ({nchild}_{i}) ({female}_{i})$

The model predicts that men with no children make $24.72/hour, on average.
The model predicts that among workers with no children, women make $2.84 less than men, on average.
The model predicts that among men, having an additional child at home is associated with at a $1.78 increase in hourly wages.
The model predicts that the gain in hourly wages from having an additional child at home is $1.33 smaller for women than it is for men.
The main effect of each variable in the interaction term is only the effect when the other variable in the interaction term is zero/the reference category.

	Model 1	Model 2
Intercept	24.720***	24.786***
	(0.072)	(0.071)
Number of children	1.779***	1.471***
	(0.050)	(0.049)
Female	-2.839***	-3.134***
	(0.105)	(0.103)
Number of children x Female	-1.335***	-1.075***
	(0.074)	(0.072)
Age		0.273***
		(0.003)
N	145647	145647
R-squared	0.024	0.069
* p < 0.05, p < 0.01, * p < 0.001
Standard errors in parenthesis. Age centered on 40 years.

	LHS	HS
Man	16.27	16.27+4.71
Woman	16.27-5.06	16.27+4.71-5.06
Gender difference	-5.06	-5.06

	LHS	HS	AA	BA	Grad
Man	16.27	16.27+4.71	16.27+8.40	16.27+17.13	16.27+24.63
Woman	16.27-5.06	16.27+4.71-5.06	16.27+8.40-5.06	16.27+17.13-5.06	16.27+24.63-5.06
Gender difference	-5.06	-5.06	-5.06	-5.06	-5.06

	LHS	HS
Man	15.79	15.79+4.82
Woman	15.79-3.84	15.79+4.82-3.84-0.41
Gender difference	-3.84	-4.25

	LHS	HS	AA	BA	Grad
Man	15.79	15.79+4.82	15.79+8.58	15.79+18.30	15.79+25.82
Woman	15.79-3.84	15.79+4.82-3.84-0.41	15.79+8.58-3.84-0.67	15.79+18.30-3.84-2.54	15.79+25.82-3.84-2.52
Gender difference	-3.84	-4.25	-4.51	-6.38	-6.36

Degree	Gender gap
None	-3.84
HS	-3.84-0.41
AA	-3.84-0.67
BA	-3.84-2.54
Grad	-3.84-2.52

Degree	Men’s return	Women’s return
HS	4.82	4.82-0.41
AA	8.58	8.58-0.67
BA	18.3	18.3-2.54
Grad	25.82	25.82-2.52

Building Models Sociology 312/412/512, University of Oregon Aaron Gullickson