Sociology 312/412/512, University of Oregon
\[y=a+bx\]
\[\hat{y}_i=b_0+b_1x_i\]
For a given observation \(i\), the value \(y_i-\hat{y}_i\) gives the residual or error in the prediction.
To get the total error in prediction, we can calculate the sum of squared residuals:
\[SSR=\sum_{i=1}^n (y_i-\hat{y}_i)^2=\sum_{i=1}^n (y_i-b_0-b_1x_i)^2\]
The best-fitting line is the one with the smallest possible sum of squared residuals. This is called the Ordinary Least Squares (OLS) regression line.
\[b_1=r * \frac{s_y}{s_x}\] \[b_0=\bar{y}-b_1*\bar{x}\]
We can calculate by hand in R, although we will learn an easier way later:
\[\hat{\texttt{property_crimes}}_i=1238.1+220.8(\texttt{unemployment_rate}_i)\]
The OLS regression line is often called a linear model because we are measuring the relationship between two variables by applying a linear function to characterize the relationship.
the lm
command can be used to create a model object in R:
The tilde (~) is used to indicate the relationship between the two variables with the dependent variable on the left hand side.
summary
for model TMI
Call:
lm(formula = property_rate ~ unemploy_rate, data = crimes)
Residuals:
Min 1Q Median 3Q Max
-1007.08 -529.05 32.59 409.89 1785.02
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1238.07 408.59 3.030 0.00390 **
unemploy_rate 220.78 72.04 3.065 0.00354 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 607.6 on 49 degrees of freedom
Multiple R-squared: 0.1608, Adjusted R-squared: 0.1437
F-statistic: 9.392 on 1 and 49 DF, p-value: 0.003539
geom_smooth
with the argument method="lm"
will add the OLS regression line to your scatterplot.
se=FALSE
will suppress a confidence band which I will show later.
\[\hat{\texttt{property_crimes}}_i=1238.1+220.8(\texttt{unemploy_rate}_i)\]
The model predicts that states with no unemployment will have a property crime rate of 1238.1 crimes per 100,000, on average.
The model predicts that a one percent increase in the unemployment rate is associated with an increase of 220.8 property crimes per 100,000, on average.
Try interpreting these numbers from a regression model where the dependent variable is box office returns (in millions of dollars) and the independent variable is the metascore (from 0 to 100 in “points”).
\[\hat{\texttt{box_office}}_i=4.98+0.77(\texttt{metascore}_i)\]
The model predicts that movies that receive a zero metascore rating will make $4.98 million, on average.
The model predicts that a one point increase in the metascore rating is associated with a $770,000 increase in box office returns, on average.
Try interpreting these numbers from a regression model where the dependent variable is sexual frequency (sexual encounters per year) and the independent variable is age in years.
\[\hat{\texttt{sex}}_i=88.32-0.83(\texttt{age}_i)\]
The model predicts that newborns will have sex 88.32 times per year, on average.
😮 Say what??!!
Lets subtract some constant \(a\) from the variable \(x\):
\[x^*=x-a\]
The value for zero on our new re-centered \(x^*\) will be \(a\) on the original scale.
\[\hat{\texttt{sex}}_i=73.45-0.83(\texttt{age}_i-18)\]
The model predicts that 18 year old individuals have sex 73.45 times per year, on average.
The model predicts that a one year increase in age is associated with 0.83 fewer sexual encounters per year, on average.
I pick a random observation from the dataset and ask you to guess the value of \(y\). What is your best guess?
I pick a random observation from the dataset and tell you its value of \(x\), and then ask you to guess the value of \(y\). What is your best guess?
Red: \(\sum_{i=1}^n (y_i-\bar{y})^2\)
Green: \(\sum_{i=1}^n (y_i-\hat{y}_i)^2\)
Proportion: \(1-\frac{\sum_{i=1}^n (y_i-\hat{y}_i)^2}{\sum_{i=1}^n (y_i-\bar{y})^2}\)
About 19.7% of the variation in property crime rates across states can be accounted for by variation in unemployment rates across states.
Call:
lm(formula = property_rate ~ unemploy_rate, data = crimes)
Residuals:
Min 1Q Median 3Q Max
-1007.08 -529.05 32.59 409.89 1785.02
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1238.07 408.59 3.030 0.00390 **
unemploy_rate 220.78 72.04 3.065 0.00354 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 607.6 on 49 degrees of freedom
Multiple R-squared: 0.1608, Adjusted R-squared: 0.1437
F-statistic: 9.392 on 1 and 49 DF, p-value: 0.003539
The population model is: \[\hat{y}_i=\beta_0+\beta_1(x_i)\]
The null hypothesis of no relationship is given by: \[H_0: \beta_1=0\]
How do we test?
Call:
lm(formula = property_rate ~ unemploy_rate, data = crimes)
Residuals:
Min 1Q Median 3Q Max
-1007.08 -529.05 32.59 409.89 1785.02
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1238.07 408.59 3.030 0.00390 **
unemploy_rate 220.78 72.04 3.065 0.00354 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 607.6 on 49 degrees of freedom
Multiple R-squared: 0.1608, Adjusted R-squared: 0.1437
F-statistic: 9.392 on 1 and 49 DF, p-value: 0.003539
Just look at a summary
of the model! 😎
(Intercept) percent_lhs
1693.3291 71.4142
The model predicts that a one percent increase in the percent of a state’s population without a high school diploma is associated with 71.4 more property crimes per 100,000.
Why might this be the case?
[1] 0.5000783
[1] 0.8241208
Just add the potential confounder to the model:
\[\hat{\texttt{crime_rate}}_i=b_0+b_1(\texttt{percent lhs}_i)+b_2(\texttt{poverty rate}_i)\]
😮 Thats right, you can have more than one independent variable in a linear model. But what does it mean?
(Intercept) percent_lhs poverty_rate
1418.48554 -79.44948 198.28423
Slopes and intercepts are chosen that minimize the sum of the squared residuals, just as for a bivariate OLS regression model.
Because both independent variables are in the model at the same time, the effect of each variable is net of the indirect effect of the other variable.
We can say this in different ways:
Model 1 | Model 2 | Model 3 | |
---|---|---|---|
Intercept | 1693.33*** | 1418.49*** | 6010.54*** |
(355.99) | (327.76) | (1137.94) | |
Percent no HS | 71.41* | -79.45 | -101.70* |
(32.00) | (50.60) | (44.33) | |
Poverty rate | 198.28*** | 150.80** | |
(54.81) | (49.26) | ||
Unemployment rate | 161.14 | ||
(82.13) | |||
Median age | -125.30*** | ||
(28.33) | |||
R-squared | 0.09 | 0.29 | 0.51 |
N | 51 | 51 | 51 |
***p < 0.001; **p < 0.01; *p < 0.05. Standard errors in parenthesis. |
Women report 2.17 fewer sexual encounters per year than men.
Warning
Note that I use the term report here because its not exactly clear why these numbers would be different. The difference could reflect differences by sexual orientation, or it could just be that either men over-report or women under-report sexual frequency. It could also be sampling error.
\[\texttt{female}_i=\begin{cases} 1 & \text{if female}\\ 0 & \text{otherwise} \end{cases}\]
tapply
vs. lm
lm
There is no need to create indicator variables. Just feed in categorical variables directly:
relevel
command to change the reference: (Intercept) maritalMarried maritalDivorced maritalWidowed
53.455737 -7.010258 -5.576478 -27.573078
(Intercept) maritalMarried maritalDivorced maritalWidowed
53.455737 -7.010258 -5.576478 -27.573078
In a model, we can calculate mean differences while holding constant other variables.
For example, how much of the differences in sexual frequency by marital status result from differences in age?
These results are very different!
Model 1 | Model 2 | |
---|---|---|
Intercept | 53.46*** | 70.21*** |
(0.79) | (0.90) | |
Married | -7.01*** | 7.33*** |
(0.99) | (1.03) | |
Divorced | -5.58*** | 10.90*** |
(1.21) | (1.25) | |
Widowed | -27.57*** | 4.99** |
(1.62) | (1.81) | |
Age | -0.91*** | |
(0.03) | ||
R-squared | 0.02 | 0.11 |
N | 11785 | 11785 |
***p < 0.001; **p < 0.01; *p < 0.05. Standard errors in parenthesis. Age centered on 18 years. Married is reference category for marital status |
(Intercept) nchild genderFemale
25.239547 1.155477 -3.974105
\[\hat{\texttt{wages}}_i=25.24+1.55(\texttt{nchild}_i)-3.97(\texttt{female}_i)\]
What is the relationship between wages and number of children for men and women?
The \(\texttt{female}_i\) variable is an indicator variable that is zero for men, so:
\[\begin{eqnarray*} \hat{\texttt{wages}}_i & = & 25.24+1.55(\texttt{nchild}_i)-3.97(0)\\ \hat{\texttt{wages}}_i & = & 25.24+1.55(\texttt{nchild}_i) \end{eqnarray*}\]
The \(\texttt{female}_i\) variable is an indicator variable that is one for women, so:
\[\begin{eqnarray*} \hat{\texttt{wages}}_i & = & 25.24+1.55(\texttt{nchild}_i)-3.97(1)\\ \hat{\texttt{wages}}_i & = & (25.24-3.97)+1.55(\texttt{nchild}_i)\\ \hat{\texttt{wages}}_i & = & 21.27+1.55(\texttt{nchild}_i)\\ \end{eqnarray*}\]
\[\begin{eqnarray*} \hat{\texttt{wages}}_i&=&25.24+1.55(\texttt{nchild}_i)\\ & & -3.97(\texttt{female}_i) \end{eqnarray*}\]
(Intercept) nchild genderFemale nchild:genderFemale
24.719748 1.778860 -2.839198 -1.334728
\[\hat{\texttt{wages}}_i=24.72+1.78(\texttt{nchild}_i)-2.84(\texttt{female}_i)-1.33(\texttt{nchild}_i)(\texttt{female}_i)\]
What is the relationship between wages and number of children for men and women?
The \(\texttt{female}_i\) variable is an indicator variable that is zero for men, so:
\[\begin{eqnarray*} \hat{\texttt{wages}}_i & = & 24.72+1.78(\texttt{nchild}_i)-2.84(0)\\ & & -1.33(\texttt{nchild}_i)(0)\\ \hat{\texttt{wages}}_i & = & 24.72+1.78(\texttt{nchild}_i) \end{eqnarray*}\]
The \(\texttt{female}_i\) variable is an indicator variable that is one for women, so:
\[\begin{eqnarray*} \hat{\texttt{wages}}_i & = & 24.72+1.78(\texttt{nchild}_i)-2.84(1)\\ & & -1.33(\texttt{nchild}_i)(1)\\ \hat{\texttt{wages}}_i & = & (24.72-2.84)+(1.78-1.33)(\texttt{nchild}_i)\\ \hat{\texttt{wages}}_i & = & 21.82+0.45(\texttt{nchild}_i)\\ \end{eqnarray*}\]
\[\begin{eqnarray*}\hat{\texttt{wages}}_i & = & 24.72+1.78(\texttt{nchild}_i)-2.84(\texttt{female}_i)\\ & & -1.33(\texttt{nchild}_i)(\texttt{female}_i)\\\end{eqnarray*}\]
(Intercept) nchild
21.8805499 0.4441326
Value | Separate models | Interaction terms |
---|---|---|
Intercept | ||
Men’s wages with no children | $24.72 | $24.72 |
Women’s wages with no children | $21.88 | |
Difference in men’s and women’s wages with no children | -$2.84 | |
Slope | ||
Men’s return for an additional child | $1.78 | $1.78 |
Women’s return for an additional child | $0.45 | |
Difference in men’s and women’s return for an additional child | -$1.33 |
(Intercept) nchild genderFemale nchild:genderFemale
24.719748 1.778860 -2.839198 -1.334728
\[\hat{\texttt{wages}}_i = 24.72+1.78(\texttt{nchild}_i)-2.84(\texttt{female}_i)-1.33(\texttt{nchild}_i)(\texttt{female}_i)\]
Model 1 | Model 2 | |
---|---|---|
Intercept | 24.72*** | 24.79*** |
(0.07) | (0.07) | |
number of children | 1.78*** | 1.47*** |
(0.05) | (0.05) | |
woman | -2.84*** | -3.13*** |
(0.10) | (0.10) | |
woman x number of children | -1.33*** | -1.08*** |
(0.07) | (0.07) | |
age | 0.27*** | |
(0.00) | ||
R-squared | 0.02 | 0.07 |
N | 145647 | 145647 |
***p < 0.001; **p < 0.01; *p < 0.05. Standard errors in parenthesis. Age centered on 40 years. |
An additive model:
(Intercept) genderFemale educationHS Diploma
16.27 -5.06 4.71
educationAA Degree educationBachelors Degree educationGraduate Degree
8.40 17.13 24.63
LHS | HS | AA | BA | Grad | |
---|---|---|---|---|---|
Man | |||||
Woman | |||||
Gender difference |
An additive model:
(Intercept) genderFemale educationHS Diploma
16.27 -5.06 4.71
educationAA Degree educationBachelors Degree educationGraduate Degree
8.40 17.13 24.63
LHS | HS | AA | BA | Grad | |
---|---|---|---|---|---|
Man | 16.27 | ||||
Woman | |||||
Gender difference |
An additive model:
(Intercept) genderFemale educationHS Diploma
16.27 -5.06 4.71
educationAA Degree educationBachelors Degree educationGraduate Degree
8.40 17.13 24.63
LHS | HS | AA | BA | Grad | |
---|---|---|---|---|---|
Man | 16.27 | ||||
Woman | 16.27-5.06 | ||||
Gender difference | -5.06 |
An additive model:
(Intercept) genderFemale educationHS Diploma
16.27 -5.06 4.71
educationAA Degree educationBachelors Degree educationGraduate Degree
8.40 17.13 24.63
LHS | HS | AA | BA | Grad | |
---|---|---|---|---|---|
Man | 16.27 | 16.27+4.71 | |||
Woman | 16.27-5.06 | ||||
Gender difference | -5.06 |
An additive model:
(Intercept) genderFemale educationHS Diploma
16.27 -5.06 4.71
educationAA Degree educationBachelors Degree educationGraduate Degree
8.40 17.13 24.63
LHS | HS | AA | BA | Grad | |
---|---|---|---|---|---|
Man | 16.27 | 16.27+4.71 | |||
Woman | 16.27-5.06 | 16.27+4.71-5.06 | |||
Gender difference | -5.06 | -5.06 |
An additive model:
(Intercept) genderFemale educationHS Diploma
16.27 -5.06 4.71
educationAA Degree educationBachelors Degree educationGraduate Degree
8.40 17.13 24.63
LHS | HS | AA | BA | Grad | |
---|---|---|---|---|---|
Man | 16.27 | 16.27+4.71 | 16.27+8.40 | 16.27+17.13 | 16.27+24.63 |
Woman | 16.27-5.06 | 16.27+4.71-5.06 | 16.27+8.40-5.06 | 16.27+17.13-5.06 | 16.27+24.63-5.06 |
Gender difference | -5.06 | -5.06 | -5.06 | -5.06 | -5.06 |
A multiplicative model:
(Intercept) genderFemale
15.79 -3.84
educationHS Diploma educationAA Degree
4.82 8.58
educationBachelors Degree educationGraduate Degree
18.30 25.82
genderFemale:educationHS Diploma genderFemale:educationAA Degree
-0.41 -0.67
genderFemale:educationBachelors Degree genderFemale:educationGraduate Degree
-2.54 -2.52
LHS | HS | AA | BA | Grad | |
---|---|---|---|---|---|
Man | |||||
Woman | |||||
Gender difference |
A multiplicative model:
(Intercept) genderFemale
15.79 -3.84
educationHS Diploma educationAA Degree
4.82 8.58
educationBachelors Degree educationGraduate Degree
18.30 25.82
genderFemale:educationHS Diploma genderFemale:educationAA Degree
-0.41 -0.67
genderFemale:educationBachelors Degree genderFemale:educationGraduate Degree
-2.54 -2.52
LHS | HS | AA | BA | Grad | |
---|---|---|---|---|---|
Man | 15.79 | ||||
Woman | |||||
Gender difference |
A multiplicative model:
(Intercept) genderFemale
15.79 -3.84
educationHS Diploma educationAA Degree
4.82 8.58
educationBachelors Degree educationGraduate Degree
18.30 25.82
genderFemale:educationHS Diploma genderFemale:educationAA Degree
-0.41 -0.67
genderFemale:educationBachelors Degree genderFemale:educationGraduate Degree
-2.54 -2.52
LHS | HS | AA | BA | Grad | |
---|---|---|---|---|---|
Man | 15.79 | ||||
Woman | 15.79-3.84 | ||||
Gender difference | -3.84 |
A multiplicative model:
(Intercept) genderFemale
15.79 -3.84
educationHS Diploma educationAA Degree
4.82 8.58
educationBachelors Degree educationGraduate Degree
18.30 25.82
genderFemale:educationHS Diploma genderFemale:educationAA Degree
-0.41 -0.67
genderFemale:educationBachelors Degree genderFemale:educationGraduate Degree
-2.54 -2.52
LHS | HS | AA | BA | Grad | |
---|---|---|---|---|---|
Man | 15.79 | 15.79+4.82 | |||
Woman | 15.79-3.84 | ||||
Gender difference | -3.84 |
A multiplicative model:
(Intercept) genderFemale
15.79 -3.84
educationHS Diploma educationAA Degree
4.82 8.58
educationBachelors Degree educationGraduate Degree
18.30 25.82
genderFemale:educationHS Diploma genderFemale:educationAA Degree
-0.41 -0.67
genderFemale:educationBachelors Degree genderFemale:educationGraduate Degree
-2.54 -2.52
LHS | HS | AA | BA | Grad | |
---|---|---|---|---|---|
Man | 15.79 | 15.79+4.82 | |||
Woman | 15.79-3.84 | 15.79+4.82-3.84-0.41 | |||
Gender difference | -3.84 | -4.25 |
A multiplicative model:
(Intercept) genderFemale
15.79 -3.84
educationHS Diploma educationAA Degree
4.82 8.58
educationBachelors Degree educationGraduate Degree
18.30 25.82
genderFemale:educationHS Diploma genderFemale:educationAA Degree
-0.41 -0.67
genderFemale:educationBachelors Degree genderFemale:educationGraduate Degree
-2.54 -2.52
LHS | HS | AA | BA | Grad | |
---|---|---|---|---|---|
Man | 15.79 | 15.79+4.82 | 15.79+8.58 | 15.79+18.30 | 15.79+25.82 |
Woman | 15.79-3.84 | 15.79+4.82-3.84-0.41 | 15.79+8.58-3.84-0.67 | 15.79+18.30-3.84-2.54 | 15.79+25.82-3.84-2.52 |
Gender difference | -3.84 | -4.25 | -4.51 | -6.38 | -6.36 |
Degree | Gender gap |
---|---|
None | -3.84 |
HS | -3.84-0.41 |
AA | -3.84-0.67 |
BA | -3.84-2.54 |
Grad | -3.84-2.52 |
Degree | Men’s return | Women’s return |
---|---|---|
HS | 4.82 | 4.82-0.41 |
AA | 8.58 | 8.58-0.67 |
BA | 18.3 | 18.3-2.54 |
Grad | 25.82 | 25.82-2.52 |
Sociology 312/412/512, University of Oregon