Sociology 312/412/512, University of Oregon
The primary goal of most social science statistical analysis is to establish whether there is an association between variables and to describe the strength and direction of this association.
We often think about the relationships we observe in data as being causally determined, but the simple measurement of association is insufficient to establish a necessary causal connection between the variables.
The association between two variables could be generated because they are both related to a third variable that is actually the cause.
We may think that one variable causes the other, but it is equally possible that the causal relationship is the other way.
The two-way table and comparative barplots
Mean differences and comparative boxplots
The correlation coefficient and scatterplots
The two-way table (or cross-tabulation) gives the joint distribution of two categorical variables.
We can create a two-way table in R using the table
command but this time we feed in two different variables. Here is an example using sex and survival on the titanic:
There were 339 female survivors, 127 female deaths, and so on.
Sex | Survived | Died |
---|---|---|
Female | 339 | 127 |
Male | 161 | 682 |
A first step in establishing the relationship is to calculate the marginal distributions of the row and column variables. The marginal distributions are simply the distributions of each categorical variable separately. We can calculate these from the tab
object I created using the margin.table
command in R:
Note that the the option 1
here gives me the row marginal and the option 2
gives me the column marginal.
Sex | Survived | Died | Total |
---|---|---|---|
Female | 339 | 127 | 466 |
Male | 161 | 682 | 843 |
Total | 500 | 809 | 1309 |
To get distribution of survival by gender, divide each row by row totals:
Sex | Survived | Died | Total |
---|---|---|---|
Female | 339/466 | 127/466 | 466 |
Male | 161/843 | 682/843 | 843 |
Sex | Survived | Died | Total |
---|---|---|---|
Female | 0.727 | 0.273 | 1.0 |
Male | 0.191 | 0.809 | 1.0 |
You can use prop.table
to calculate conditional distributions in R.
Survived Died
Female 0.7274678 0.2725322
Male 0.1909846 0.8090154
1
as the second argument in prop.table
. You must include this to get the distribution of the column variable conditional on the row.What changed?
2
in the prop.table
command, we are now looking at the distribution of gender conditional on survival.The code here is identical to that for a simple barplot except for the addition of facet_wrap
. The facet_wrap
command allows us to make separate panels of the same graph across the categories of some other variable.
We group
by sex and also add a fill
aesthetic that will apply different colors by sex.
We add position="dodge"
to geom_bar
so that bars are drawn side-by-side rather than stacked.
We add fill="gender"
to labs
so that our legend has a nice title.
# first command drops non-voters
temp <- droplevels(subset(politics,
president!="No Vote"))
tab <- table(temp$educ, temp$president)
# round and multiply prop.table by 100
# to get percents
props <- round(prop.table(tab, 1),3)*100
props
Clinton Trump Other
Less than HS 57.8 35.2 7.0
High school diploma 42.8 51.7 5.5
Some college 42.1 50.3 7.7
Bachelors degree 46.4 44.1 9.5
Graduate degree 65.0 30.4 4.5
ggplot(subset(politics, president!="No Vote"),
aes(x=president, y=..prop..,
group=educ, fill=educ))+
geom_bar(position = "dodge")+
labs(title="presidential choice by education",
x=NULL,
y="percent of education group",
fill="education")+
scale_y_continuous(label=scales::percent)+
scale_fill_brewer(palette="YlGn") #<<
ggplot(subset(politics, president!="No Vote" &
gender!="Other"),
aes(x=president, y=..prop..,
group=educ, fill=educ))+
geom_bar(position = "dodge")+
labs(title="presidential choice by education",
x=NULL,
y="percent of education group",
fill="education")+
scale_y_continuous(label=scales::percent)+
scale_fill_brewer(palette="YlGn")+
facet_wrap(~gender)
Just add a facet_wrap
to see how education affected presidential voting differently for men and women.
Survived Died
Female 72.75 27.25
Male 19.10 80.90
We could look at the difference (72.75-19.1=53.65), but this can be misleading because as the overall probability approaches either 0% or 100%, the difference must get smaller.
38% of passengers survived
Roughly 99.2% of passengers survived
The odds is the ratio of “successes” to “failures.” Convert probabilities to odds by taking \[\texttt{Odds}=\texttt{probability}/(1-\texttt{probability})\]
If 72.75% of women survived, then the odds of survival for women are \[0.7275/(1-0.7272)=2.67\]
About 2.67 women survived for every woman that died.
If 19.1% of men survived, then the odds of survival for men are \[0.191/(1-0.191)=0.236\]
About 0.236 men survived for every man that died. Alternatively, 0.236 is close to 0.25, so about one man survived for every four that died.
To determine the difference in our odds we take the odds ratio by dividing one of the odds by the other.
\[\texttt{Odds ratio}=\frac{O_1}{O_2}=\frac{2.67}{0.236}=11.31\]
The odds of surviving the Titanic were 11.31 times higher for women than for men.
Sex | Survived | Died |
---|---|---|
Female | 339 | 127 |
Male | 161 | 682 |
Multiply the diagonal bolded values together and divide by the product of the reverse-diagonal italicized values to get the same odds ratio.
\[\frac{339*682}{161*127}=11.31\]
We just need to add an x
aesthetic (in this case race) to the plot to get a comparative boxplot.
In this case, I have also used the reorder
command to reorder my categories so they go from smallest to largest median wage by race. This is not necessary but will add more information to the boxplot.
Use the tapply
command to get the mean income of respondents separately by who they voted for:
Clinton Trump Other No Vote
79.40396 77.37831 80.31188 60.69635
The mean difference is given by: \[80.23-77.33=2.9\] Clinton voters had a household income $2900 higher than Trump voters, on average.
Clinton voters had median household incomes $2000 lower than Trump voters. Why are the results different between the mean and median?
The income distribution of Clinton supporters is more right-skewed than Trump supporters so it has a higher mean but lower median. However, the differences are relatively small regardless.
geom_jitter
instead of geom_point
will add some randomness to x and y values so that points are not plotted on top of each other. The width
and height
arguments can be adjusted for more or less randomness (scale 0-1).alpha
argument will create semi-transparent points (scale 0-1). I have it set very low because of the large number of points, but you should adjust as needed.I just add a third aesthetic to the aes
command for color
. This will color the points by the category of the variable used (in this case, gender).
I am using the viridis color scheme here, but you can adjust the palette if you like.
The correlation coefficient (\(r\)) measures the association between two quantitative variables. The formula is:
\[r=\frac{1}{n-1}\sum^n_{i=1} (\frac{x_i-\bar{x}}{s_x}*\frac{y_i-\bar{y}}{s_y})\]
\((x_i-\bar{x})\) and \((y_i-\bar{y})\): Subtract the mean from each value of x and y to get distance above and below mean.
\((\frac{x_i-\bar{x}}{s_x}*\frac{y_i-\bar{y}}{s_y})\): Multiply x and y values together. The results provides evidence of negative or positive relationship.
\(\sum^n_{i=1} (\frac{x_i-\bar{x}}{s_x}*\frac{y_i-\bar{y}}{s_y})\): Sum up all the evidence, positive and negative.
\(\frac{1}{n-1}\sum^n_{i=1} (\frac{x_i-\bar{x}}{s_x}*\frac{y_i-\bar{y}}{s_y})\): Divide result by sample size to get final correlation coefficient.
Sociology 312/412/512, University of Oregon