Photo by Marleena Garris on Unsplash
The first step in the process of turning information into knowledge process is to summarize and describe the raw information - the data. In this assignment we explore data on college majors and earnings, specifically the data begind the FiveThirtyEight story “The Economic Guide To Picking A College Major”.
These data originally come from the American Community Survey (ACS) 2010-2012 Public Use Microdata Series. While this is outside the scope of this assignment, if you are curious about howraw data from the ACS were cleaned and prepared, see the code FiveThirtyEight authors used.
We should also note that there are many considerations that go into picking a major. Earnings potential and employment prospects are two of them, and they are important, but they don’t tell the whole story. Keep this in mind as you analyze the data.
In this assignment we will work with the tidyverse
as usual. In addition, we’ll use the scales
package for formatting numerical values, and the fivethirtyeight
package for data. tidyverse
and scales
packages are already installed for you, so you can load them with the following:
You’ll first need to install the fivethirtyeight
package by running the following in the console once:
and then you can load it as usual with:
Note that these packages are also loaded in your R Markdown document.
The data frame we will be working with today is called college_recent_grads
and it’s in the fivethirtyeight
package.
To find out more about the dataset, type the following in your Console: ?college_recent_grads
. A question mark before the name of an object will always bring up its help file. This command must be ran in the Console.
college_recent_grads
is a tidy data frame, with each row representing an observation and each column representing a variable.
To view the data, click on the name of the data frame in the Environment tab.
You can also take a quick peek at your data frame and view its dimensions with the glimpse
function.
## Observations: 173
## Variables: 21
## $ rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1…
## $ major_code <int> 2419, 2416, 2415, 2417, 2405, 2418, …
## $ major <chr> "Petroleum Engineering", "Mining And…
## $ major_category <chr> "Engineering", "Engineering", "Engin…
## $ total <int> 2339, 756, 856, 1258, 32260, 2573, 3…
## $ sample_size <int> 36, 7, 3, 16, 289, 17, 51, 10, 1029,…
## $ men <int> 2057, 679, 725, 1123, 21239, 2200, 2…
## $ women <int> 282, 77, 131, 135, 11021, 373, 1667,…
## $ sharewomen <dbl> 0.1205643, 0.1018519, 0.1530374, 0.1…
## $ employed <int> 1976, 640, 648, 758, 25694, 1857, 29…
## $ employed_fulltime <int> 1849, 556, 558, 1069, 23170, 2038, 2…
## $ employed_parttime <int> 270, 170, 133, 150, 5180, 264, 296, …
## $ employed_fulltime_yearround <int> 1207, 388, 340, 692, 16697, 1449, 24…
## $ unemployed <int> 37, 85, 16, 40, 1672, 400, 308, 33, …
## $ unemployment_rate <dbl> 0.018380527, 0.117241379, 0.02409638…
## $ p25th <dbl> 95000, 55000, 50000, 43000, 50000, 5…
## $ median <dbl> 110000, 75000, 73000, 70000, 65000, …
## $ p75th <dbl> 125000, 90000, 105000, 80000, 75000,…
## $ college_jobs <int> 1534, 350, 456, 529, 18314, 1142, 17…
## $ non_college_jobs <int> 364, 257, 176, 102, 4440, 657, 314, …
## $ low_wage_jobs <int> 193, 50, 0, 0, 972, 244, 259, 220, 3…
The description of the variables, i.e. the codebook, is given below.
Header | Description |
---|---|
rank |
Rank by median earnings |
major_code |
Major code, FO1DP in ACS PUMS |
major |
Major description |
major_category |
Category of major from Carnevale et al |
total |
Total number of people with major |
sample_size |
Sample size (unweighted) of full-time, year-round ONLY (used for earnings) |
men |
Male graduates |
women |
Female graduates |
sharewomen |
Women as share of total |
employed |
Number employed (ESR == 1 or 2) |
employed_full_time |
Employed 35 hours or more |
employed_part_time |
Employed less than 35 hours |
employed_full_time_yearround |
Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35) |
unemployed |
Number unemployed (ESR == 3) |
unemployment_rate |
Unemployed / (Unemployed + Employed) |
median |
Median earnings of full-time, year-round workers |
p25th |
25th percentile of earnigns |
p75th |
75th percentile of earnings |
college_jobs |
Number with job requiring a college degree |
non_college_jobs |
Number with job not requiring a college degree |
low_wage_jobs |
Number in low-wage service jobs |
The college_recent_grads
data frame is a trove of information. Let’s think about some questions we might want to answer with these data:
In the next section we aim to answer these questions.
In order to answer this question all we need to do is sort the data. We use the arrange
function to do this, and sort it by the unemployment_rate
variable. By default arrange
sorts in ascending order, which is what we want here – we’re interested in the major with the lowest unemployment rate.
## # A tibble: 173 x 21
## rank major_code major major_category total sample_size men women
## <int> <int> <chr> <chr> <int> <int> <int> <int>
## 1 53 4005 Math… Computers & M… 609 7 500 109
## 2 74 3801 Mili… Industrial Ar… 124 4 124 0
## 3 84 3602 Bota… Biology & Lif… 1329 9 626 703
## 4 113 1106 Soil… Agriculture &… 685 4 476 209
## 5 121 2301 Educ… Education 804 5 280 524
## 6 15 2409 Engi… Engineering 4321 30 3526 795
## 7 20 3201 Cour… Law & Public … 1148 14 877 271
## 8 120 2305 Math… Education 14237 123 3872 10365
## 9 1 2419 Petr… Engineering 2339 36 2057 282
## 10 65 1100 Gene… Agriculture &… 10399 158 6053 4346
## # … with 163 more rows, and 13 more variables: sharewomen <dbl>,
## # employed <int>, employed_fulltime <int>, employed_parttime <int>,
## # employed_fulltime_yearround <int>, unemployed <int>,
## # unemployment_rate <dbl>, p25th <dbl>, median <dbl>, p75th <dbl>,
## # college_jobs <int>, non_college_jobs <int>, low_wage_jobs <int>
This gives us what we wanted, but not in an ideal form. First, the name of the major barely fits on the page. Second, some of the variables are not that useful (e.g. major_code
, major_category
) and some we might want front and center are not easily viewed (e.g. unemployment_rate
).
We can use the select
function to choose which variables to display, and in which order:
Note how easily we expanded our code with adding another step to our pipeline, with the pipe operator: %>%
.
## # A tibble: 173 x 3
## rank major unemployment_rate
## <int> <chr> <dbl>
## 1 53 Mathematics And Computer Science 0
## 2 74 Military Technologies 0
## 3 84 Botany 0
## 4 113 Soil Science 0
## 5 121 Educational Administration And Supervision 0
## 6 15 Engineering Mechanics Physics And Science 0.00633
## 7 20 Court Reporting 0.0117
## 8 120 Mathematics Teacher Education 0.0162
## 9 1 Petroleum Engineering 0.0184
## 10 65 General Agriculture 0.0196
## # … with 163 more rows
Ok, this is looking better, but do we really need to display all those decimal places in the unemployment variable? Not really!
We can use the percent()
function to clean up the display a bit.
college_recent_grads %>%
arrange(unemployment_rate) %>%
select(rank, major, unemployment_rate) %>%
mutate(unemployment_rate = percent(unemployment_rate))
## # A tibble: 173 x 3
## rank major unemployment_rate
## <int> <chr> <chr>
## 1 53 Mathematics And Computer Science 0.0%
## 2 74 Military Technologies 0.0%
## 3 84 Botany 0.0%
## 4 113 Soil Science 0.0%
## 5 121 Educational Administration And Supervision 0.0%
## 6 15 Engineering Mechanics Physics And Science 0.6%
## 7 20 Court Reporting 1.2%
## 8 120 Mathematics Teacher Education 1.6%
## 9 1 Petroleum Engineering 1.8%
## 10 65 General Agriculture 2.0%
## # … with 163 more rows
To answer such a question we need to arrange the data in descending order. For example, if earlier we were interested in the major with the highest unemployment rate, we would use the following:
The desc
function specifies that we want unemployment_rate
in descending order.
college_recent_grads %>%
arrange(desc(unemployment_rate)) %>%
select(rank, major, unemployment_rate)
## # A tibble: 173 x 3
## rank major unemployment_rate
## <int> <chr> <dbl>
## 1 6 Nuclear Engineering 0.177
## 2 90 Public Administration 0.159
## 3 85 Computer Networking And Telecommunications 0.152
## 4 171 Clinical Psychology 0.149
## 5 30 Public Policy 0.128
## 6 106 Communication Technologies 0.120
## 7 2 Mining And Mineral Engineering 0.117
## 8 54 Computer Programming And Data Processing 0.114
## 9 80 Geography 0.113
## 10 59 Architecture 0.113
## # … with 163 more rows
top_n(3)
at the end of the pipeline.A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value below which 20% of the observations may be found. (Source: Wikipedia
There are three types of incomes reported in this data frame: p25th
, median
, and p75th
. These correspond to the 25th, 50th, and 75th percentiles of the income distribution of sampled individuals for a given major.
The question we want to answer “How do the distributions of median income compare across major categories?”. We need to do a few things to answer this question: First, we need to group the data by major_category
. Then, we need a way to summarize the distributions of median income within these groups. This decision will depend on the shapes of these distributions. So first, we need to visualize the data.
We use the ggplot()
function to do this. The first argument is the data frame, and the next argument gives the mapping of the variables of the data to the aes
thetic elements of the plot.
Let’s start simple and take a look at the distribution of all median incomes, without considering the major categories.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Along with the plot, we get a message:
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is telling us that we might want to reconsider the binwidth we chose for our histogram – or more accurately, the binwidth we didn’t specify. It’s good practice to always think in the context of the data and try out a few binwidths before settling on a binwidth. You might ask yourself: “What would be a meaningful difference in median incomes?” $1 is obviously too little, $10000 might be too high.
geom_histogram
function. So to specify a binwidth of $1000, you would use geom_histogram(binwidth = 1000)
.We can also calculate summary statistics for this distribution using the summarise
function:
college_recent_grads %>%
summarise(min = min(median), max = max(median),
mean = mean(median), med = median(median),
sd = sd(median),
q1 = quantile(median, probs = 0.25),
q3 = quantile(median, probs = 0.75))
## # A tibble: 1 x 7
## min max mean med sd q1 q3
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 22000 110000 40151. 36000 11470. 33000 45000
Based on the shape of the histogram you created in the previous exercise, determine which of these summary statistics is useful for describing the distribution. Write up your description (remember shape, center, spread, any unusual observations) and include the summary statistic output as well.
Plot the distribution of median
income using a histogram, faceted by major_category
. Use the binwidth
you chose in the earlier exercise.
Now that we’ve seen the shapes of the distributions of median incomes for each major category, we should have a better idea for which summary statistic to use to quantify the typical median income.
count
, which first groups the data and then counts the number of observations in each category (see below). Add to the pipeline appropriately to arrange the results so that the major with the lowest observations is on top.## # A tibble: 16 x 2
## major_category n
## <chr> <int>
## 1 Agriculture & Natural Resources 10
## 2 Arts 8
## 3 Biology & Life Science 14
## 4 Business 13
## 5 Communications & Journalism 4
## 6 Computers & Mathematics 11
## 7 Education 16
## 8 Engineering 29
## 9 Health 12
## 10 Humanities & Liberal Arts 15
## 11 Industrial Arts & Consumer Services 7
## 12 Interdisciplinary 1
## 13 Law & Public Policy 5
## 14 Physical Sciences 10
## 15 Psychology & Social Work 9
## 16 Social Science 9
One of the sections of the FiveThirtyEight story is “All STEM fields aren’t the same”. Let’s see if this is true.
First, let’s create a new vector called stem_categories
that lists the major categories that are considered STEM fields.
stem_categories <- c("Biology & Life Science",
"Computers & Mathematics",
"Engineering",
"Physical Sciences")
Then, we can use this to create a new variable in our data frame indicating whether a major is STEM or not.
college_recent_grads <- college_recent_grads %>%
mutate(major_type = ifelse(major_category %in% stem_categories, "stem", "not stem"))
Let’s unpack this: with mutate
we create a new variable called major_type
, which is defined as "stem"
if the major_category
is in the nector called stem_categories
we created earlier, and as "not stem"
otherwise.
%in%
is a logical operator. Other logical operators that are commonly used are
Operator | Operation |
---|---|
x < y |
less than |
x > y |
greater than |
x <= y |
less than or equal to |
x >= y |
greater than or equal to |
x != y |
not equal to |
x == y |
equal to |
x %in% y |
contains |
x | y |
or |
x & y |
and |
!x |
not |
We can use the logical operators to also filter
our data for STEM majors whose median earnings is less than median for all majors’s median earnings, which we found to be $36,000 earlier.
## # A tibble: 10 x 22
## rank major_code major major_category total sample_size men women
## <int> <int> <chr> <chr> <int> <int> <int> <int>
## 1 93 1301 Envi… Biology & Lif… 25965 225 10787 15178
## 2 98 5098 Mult… Physical Scie… 62052 427 27015 35037
## 3 102 3608 Phys… Biology & Lif… 22060 99 8422 13638
## 4 106 2001 Comm… Computers & M… 18035 208 11431 6604
## 5 109 3611 Neur… Biology & Lif… 13663 53 4944 8719
## 6 111 5002 Atmo… Physical Scie… 4043 32 2744 1299
## 7 123 3699 Misc… Biology & Lif… 10706 63 4747 5959
## 8 124 3600 Biol… Biology & Lif… 280709 1370 111762 168947
## 9 133 3604 Ecol… Biology & Lif… 9154 86 3878 5276
## 10 169 3609 Zool… Biology & Lif… 8409 47 3050 5359
## # … with 14 more variables: sharewomen <dbl>, employed <int>,
## # employed_fulltime <int>, employed_parttime <int>,
## # employed_fulltime_yearround <int>, unemployed <int>,
## # unemployment_rate <dbl>, p25th <dbl>, median <dbl>, p75th <dbl>,
## # college_jobs <int>, non_college_jobs <int>, low_wage_jobs <int>,
## # major_type <chr>