table(politics$party)
Democrat Republican Independent Other
1472 1236 1381 148
Sociology 312/412/512, University of Oregon
When we refer to the distribution of a variable, we are referring to how the different values of that variable are distributed across the given observations.
In order to display the distribution of a categorical variable, we first need to calculate the frequency which is the number of observations that belong to each possible category. We can do this easily in R with the table
command:
Lets convert these frequencies into proportions by dividing through by the total number of observations. We can also do this easily in R by adding the sum
command to the previous command:
R also has a built-in function called prop.table
that will calculate proportions automatically. We just need to feed the output of the table
command into it.
Democrat Republican Independent Other
0.34741562 0.29171584 0.32593816 0.03493038
We can employ this “wrapping” feature of R to do some more tidying up. In this case, I want to round
the number of digits and multiply by 100 to turn my proportions into percents. I also use the sort
command to sort values from highest to lowest.
We will use ggplot
to construct graphs.
In a ggplot
, multiple commands are linked together with +
signs.
The first command of ggplot
takes two arguments. The first argument is the data we want to use (in this case, the politics dataset). The second argument is the aes
command that defines aesthetics for the full plot.
The second command is geom_bar
.All plots require some kind of “geometry” command which in this case makes bars.
scale_y_continuous(label=scales::percent)
causes my proportions on the y-axis to be reported as percents.
The labs
command can be used to add nice labeling of axes and to create titles and captions.
theme_bw
defines a theme for the overall plot. I prefer theme_bw
to the default theme in ggplot
.
The mean is the balancing point of a distribution. Imagine trying to put a column underneath a histogram so that it does not tip one direction or the other. This balancing point is the mean.
The median is the midpoint of the distribution. At this point, 50% of the observations have lower values, and 50% have higher values.
The mode is the high point of the distribution, or the peak. It is typically much less useful than the other two measures.
The mean (represented mathematically as \(\bar{x}\)) is calculated by taking the sum of the variable divided by the number of observations, or in math speak: \[\bar{x}=\frac{\sum_{i=1}^n x_i}{n}\]
Don’t panic! We will walk through what these symbols mean.
\[\bar{x}=\frac{\sum_{i=1}^n x_i}{n}\]
To calculate the mean we just sum up all the values of \(x\) and divide by the number of observations. The sum
command will sum up a variable and the nrow
command will give us the number of observations, so:
The mean move runtime is 106.8 minutes.
We just need to sort the observations from smallest to largest and pick the exact middle value of the distribution.
With an odd number of 4343 movies, the exact midpoint is the 2172nd movie. We can use the sort
command to sort and then extract the 2172nd movie by using square brackets:
The quantile
command in R will calculate a given percentile.
To calculate the percentile with this command, we need to add a second argument called probs
where we feed in a list of proportions. So if we wanted to calculate the 13th and 76th percentile of movie runtime:
13% of movies are 90 minutes or shorter and 76% of movies are 116 minutes or shorter.
If I run the quantile
command without the probs
argument, I get:
The x=""
in the aesthetics is not necessary but does create a nicer looking x axis. Your variable is assigned to y
.
You can use fill
to determine color of the box and outlier.color
to determine color of individual points.
The Interquartile Range (IQR) is the distance between the 25th and 75th percentile. It can be calculated by the IQR
command in R:
The 75th percentile of movie runtime is 21 minutes longer than the 25th percentile.
Standard deviation (SD) is the most common measure of spread in a variable. Loosely, standard deviation measures the average distance from the mean of all observations. Variance is the squared value of standard deviation.
The standard deviation \((s)\) is calculated with the following formula:
\[s=\sqrt{\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}}\]
Don’t panic! We will go through it one step at a time.
\((x_i-\bar{x})\): The distance between each observation’s value and the mean. Some of these values are positive and some are negative. If we summed these values up across all observations, the sum would equal zero by definition.
\(\sum_{i=1}^n(x_i-\bar{x})^2\): The sum of the squared distance, sometimes abbreviated SSX.
\(\sum_{i=1}^n(x_i-\bar{x})^2/(n-1)\): Dividing through by the number of observations gives us the “average” squared distance from the mean. This number is the variance.
\(\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2/(n-1)}\): We square root to get back to distance.
The average movie is about 16.3 minutes away from the mean movie runtime.
Sociology 312/412/512, University of Oregon