Measuring Association

The two-way table

The two-way table (or cross-tabulation) gives the joint distribution of two categorical variables.

We can create a two-way table in R using the table command but this time we feed in two different variables. Here is an example using sex and survival on the titanic:

tab <- table(titanic$sex, titanic$survival)
tab

        
         Survived Died
  Female      339  127
  Male        161  682

There were 339 female survivors, 127 female deaths, and so on.

Raw numbers are never enough

Sex	Survived	Died
Female	339	127
Male	161	682

It might seem like the much higher number of male deaths is enough to claim that there is a relationship between gender and survival, but this comparison would be flawed. Why?
There were a lot more male passengers on the Titanic than female passengers. So even if they had the same probability of survival, we would expect to see more male deaths.
We need to compare the proportion of deaths among men to the proportion of deaths among women to make a proper comparison.
Never, ever compare raw numbers directly. Instead, we need to first calculate a conditional distribution using proportions. In this case, I want the distribution of survival conditional on gender.

Calculate maginal distributions

A first step in establishing the relationship is to calculate the marginal distributions of the row and column variables. The marginal distributions are simply the distributions of each categorical variable separately. We can calculate these from the tab object I created using the margin.table command in R:

margin.table(tab, 1)


Female   Male 
   466    843

margin.table(tab, 2)


Survived     Died 
     500      809

Note that the the option 1 here gives me the row marginal and the option 2 gives me the column marginal.

Distribution of survival conditional on sex

Sex	Survived	Died	Total
Female	339	127	466
Male	161	682	843
Total	500	809	1309

To get distribution of survival by gender, divide each row by row totals:

Sex	Survived	Died	Total
Female	339/466	127/466	466
Male	161/843	682/843	843

Sex	Survived	Died	Total
Female	0.727	0.273	1.0
Male	0.191	0.809	1.0

Read the distribution within the rows:
- 72.7% of women survived and 27.3% of women died.
- 19.1% of men survived and 80.9% of men died.
Men were much more likely to die on the Titanic than women.

Calculating conditional distributions in R

You can use prop.table to calculate conditional distributions in R.

tab <- table(titanic$sex, titanic$survival)
prop.table(tab, 1)

        
          Survived      Died
  Female 0.7274678 0.2725322
  Male   0.1909846 0.8090154

Take note of the 1 as the second argument in prop.table. You must include this to get the distribution of the column variable conditional on the row.
Make sure that the proportions sum up to one within the rows to check yourself.

The other conditional distribution

prop.table(tab, 2)

        
          Survived      Died
  Female 0.6780000 0.1569839
  Male   0.3220000 0.8430161

What changed?

Notice that the rows do not sum to one anymore. However, the columns do sum to one.
Because of the 2 in the prop.table command, we are now looking at the distribution of gender conditional on survival.

Comparative barplot by faceting

Code and output for comparative barplot

ggplot(titanic, aes(x=survival, y=..prop..,
                    group=1))+
  geom_bar()+
  scale_y_continuous(label=scales::percent)+
  labs(y="percent surviving", x=NULL,
       title="Distribution of Titanic survival by gender")+
  facet_wrap(~sex)+
  theme_bw()

The code here is identical to that for a simple barplot except for the addition of facet_wrap. The facet_wrap command allows us to make separate panels of the same graph across the categories of some other variable.

Comparative barplot by fill aesthetic

ggplot(titanic, aes(x=survival, y=..prop..,
                    group=sex, fill=sex))+
  geom_bar(position="dodge")+
  scale_y_continuous(label=scales::percent)+
  labs(y="percent surviving", x=NULL,
       title="Distribution of Titanic survival by gender",
       fill="gender")+
  theme_bw()

We group by sex and also add a fill aesthetic that will apply different colors by sex.

We add position="dodge" to geom_bar so that bars are drawn side-by-side rather than stacked.

We add fill="gender" to labs so that our legend has a nice title.

Presidential choice by education

# first command drops non-voters
temp <- droplevels(subset(politics, 
                          president!="No Vote")) 
tab <- table(temp$educ, temp$president)
# round and multiply prop.table by 100
# to get percents
props <- round(prop.table(tab, 1),3)*100
props

                     
                      Clinton Trump Other
  Less than HS           57.8  35.2   7.0
  High school diploma    42.8  51.7   5.5
  Some college           42.1  50.3   7.7
  Bachelors degree       46.4  44.1   9.5
  Graduate degree        65.0  30.4   4.5

ggplot(subset(politics, president!="No Vote"), 
       aes(x=president, y=..prop.., 
           group=educ, fill=educ))+
  geom_bar(position = "dodge")+
  labs(title="presidential choice by education",
       x=NULL,
       y="percent of education group",
       fill="education")+
  scale_y_continuous(label=scales::percent)+
  scale_fill_brewer(palette="YlGn") #<<

Super fancy three-way table

ggplot(subset(politics, president!="No Vote" &
                gender!="Other"), 
       aes(x=president, y=..prop.., 
           group=educ, fill=educ))+
  geom_bar(position = "dodge")+
  labs(title="presidential choice by education",
       x=NULL,
       y="percent of education group",
       fill="education")+
  scale_y_continuous(label=scales::percent)+
  scale_fill_brewer(palette="YlGn")+
  facet_wrap(~gender)

Just add a facet_wrap to see how education affected presidential voting differently for men and women.

How to compare differences in probabilities?

round(prop.table(table(titanic$sex, titanic$survival), 1)*100,2)

        
         Survived  Died
  Female    72.75 27.25
  Male      19.10 80.90

We could look at the difference (72.75-19.1=53.65), but this can be misleading because as the overall probability approaches either 0% or 100%, the difference must get smaller.

Titanic

38% of passengers survived

Costa Concordia

Roughly 99.2% of passengers survived

Calculate the odds

The odds is the ratio of “successes” to “failures.” Convert probabilities to odds by taking $Odds = probability / (1 - probability)$

Women

If 72.75% of women survived, then the odds of survival for women are $0.7275 / (1 - 0.7272) = 2.67$

About 2.67 women survived for every woman that died.

Men

If 19.1% of men survived, then the odds of survival for men are $0.191 / (1 - 0.191) = 0.236$

About 0.236 men survived for every man that died. Alternatively, 0.236 is close to 0.25, so about one man survived for every four that died.

Calculate the odds ratio

Odds ratio

To determine the difference in our odds we take the odds ratio by dividing one of the odds by the other.

$Odds ratio = \frac{O_{1}}{O_{2}} = \frac{2.67}{0.236} = 11.31$

The odds of surviving the Titanic were 11.31 times higher for women than for men.

Cross-product method

Sex	Survived	Died
Female	339	127
Male	161	682

Multiply the diagonal bolded values together and divide by the product of the reverse-diagonal italicized values to get the same odds ratio.

$\frac{339 * 682}{161 * 127} = 11.31$