The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years.

The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.

In this assignment we analyze data from the 2016 GSS, using it to estimate values of population parameters of intertest about US adults.11 Smith, Tom W, Peter Marsden, Michael Hout, and Jibum Kim. General Social Surveys, 1972-2016 [machine-readable data file] /Principal Investigator, Tom W. Smith; Co-Principal Investigator, Peter V. Marsden; Co-Principal Investigator, Michael Hout; Sponsored by National Science Foundation. -NORC ed.- Chicago: NORC at the University of Chicago [producer and distributor]. Data accessed from the GSS Data Explorer website at gssdataexplorer.norc.org.

Getting started

By now you should be familiar with instructions for getting started with a new assignment in RStudio Cloud and setting up your git configuration. If not, you can refer to one of the earlier assignments.

Packages

In this assignment we will work with the following packaes. They should already be installed in your project, and you can load them with the following:

library(tidyverse)

Data

As we mentioned above, we will work with the 2016 GSS data. The public release of this data contains 935 variables (only a few of which we will use today) and 2867 observations. This is not big data in the sense of the “big data” that everyone seems to be obsessed with nowadays. But it is a fairly large dataset that we need to consider how we handle it in our workflow.

The size of the data file we’re working with it 34.3 MB. For perspective, the professor evaluations data in the previous lab was 45KB, which means the GSS data is a little over 750 times the size of the evaluations data. That’s a big difference! GitHub will warn you when pushing files larger than 50 MB, and you will not be allowed to push files larger than 100 MB.22 GitHub Help - Working with large files While our file is smaller than these limits, it’s still large enough to not push to GitHub.

Enter .gitignore! The .gitignore file contains a list of the files you don’t want to to commit to Git or push to GitHub. If you open the .gitignore file in your project, you’ll see that our data file, gss2016.csv, is already listed there.

Click here to download the data. The file is called gss2016.csv.
Create a data folder in your project – in the Files pane, click on New Folder.
Navigate to the data folder you just created and upload the gss2016.csv file.
Note that even though you made a change in your files by adding the data, gss2016.csv does not appear in your Git pane. This is because it’s being ignored by git.

gss <- read_csv("data/gss2016.csv", 
                na = c("", "Don't know",
                       "No answer", "Not applicable"),
                guess_max = 2867) %>%
  select(harass5, emailmin, emailhr, educ, born, 
         polviews, advfront, snapchat, instagrm, wrkstat)

Note that we’re doing two new things here:

New argument: guess_max. If you look in the documentation for read_csv you’ll see that the function uses the first 1,000 observations in a data frame to determine the classes of each variable (column) in the data frame. It so turns out we have some variables in this data frame that have numeric data within the first 1,000 rows, and then something like "8 or more" (numeric in spirit, but character in nature) data in later rows. So without specifically asking R to scan all rows to determine the variable class, we end up with some warnings when loading the data. You could see this for yourself by removing the guess_max argument.
Selecting columns of interest: We know which variables we will be using from the data, so we can just select those and not load the entire dataset. This is a helpful tip for working with large data. You might be wondering – but how would I know ahead of time which variables I’ll be working with. Good question! You probably won’t know. But, once you make up your mind, you can go back and add the select() function so that from that point onwards in your analysis you can benefit from faster computation.

Exercises

Part 1: Harrassment at work

In 2016, the GSS added a new question on harrassment at work. The question is phrased as the following.

Over the past five years, have you been harassed by your superiors or co-workers at your job, for example, have you experienced any bullying, physical or psychological abuse?

Answers to this question are stored in the harass5 variable in our dataset.

What are the possible responses to this question and how many respondents chose each of these answers?
What percent of the respondents for whom this question is applicable
(i.e. excluding NAs and Does not applys) havebeen harassed by their superiors or co-workers at their job.

✅ ⬆️ Now is a good time to knit your document, and commit and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

Part 2: Time spent on email

The 2016 GSS also asked respondents how many hours and minutes they spend on email weekly. The responses to these questions are recorded in the emailhr and emailmin variables. For example, if the response is 2.5 hrs, this would be recorded as emailhr = 2 and emailmin = 30.

Create a new variable called email that combines these two variables to reports the number of minutes the respondents spend on email weekly.
Visualize the distribution of this new variable. Find the mean and the median number of minutes respondents spend on email weekly. Is the mean or the median a better measure of the typical amoung of time Americans spend on email weekly? Why?
Create another new variable, snap_insta that is coded as “Yes” if the respondent reported using any of Snapchat (snapchat) or Instagram (instagrm), and “No” if not. If the recorded value was NA for both of these questions, the value in your new variable should also be NA.
Calculate the percentage of Yes’s for snap_insta among those who answered the question, i.e. excluding NAs.
What are the possible responses to the question Last week were you working full time, part time, going to school, keeping house, or what? and how many respondents chose each of these answers? Note that this information is stored in the wrkstat variable.
Fit a model predicting email (number of minutes per week spent on email) from educ (number of years of education), wrkstat, and snap_insta. Interpret the slopes for each of these variables.
Create a predicted values vs. residuals plot for this model. Are there any issues with the model? If yes, describe them.

Part 3: Political views and science research

The 2016 GSS also asked respondents whether they think of themselves as liberal or conservative (polviews) and whether they think science research is necessary and should be supported by the federal government (advfront).

The question on science research is worded as follows:

Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government.

And possible responses to this question are Strongly agree, Agree, Disagree, Strongly disagree, Dont know, No answer, Not applicable.

The question on political views is worded as follows:

We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal–point 1–to extremely conservative–point 7. Where would you place yourself on this scale?

⊕Note that the levels of this variables are spelled inconsistently: “Extremely liberal” vs. “Extrmly conservative”. Since this is the spelling that shows up in the data, you need to make sure this is how you spell the levels in your code.

And possible responses to this question are Extremely liberal, Liberal, Slightly liberal, Moderate, Slghtly conservative, Conservative, Extrmly conservative. Responses that were originally Don’t know, No answer and Not applicable are already mapped to NAs upon data import.

In a new variable, recode advfront such that Strongly Agree and Agree are mapped to "Yes", and Disagree and Strongly disagree are mapped to "No". The remaining levels can be left as is. Don’t overwrite the existing advfront, instead pick a different, informative name for your new variable.
In a new variable, recode polviews such that Extremely liberal, Liberal, and Slightly liberal, are mapped to "Liberal", and Slghtly conservative, Conservative, and Extrmly conservative disagree are mapped to "Conservative". The remaining levels can be left as is. Make sure that the levels are in a reasonable order. Don’t overwrite the existing polviews, instead pick a different, informative name for your new variable.
Create a visualization that displays the relationship between these two new variables and interpret it.

✅ ⬆️ Now is definitely a good time to knit your document, and commit and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards, and review the .md document on GitHub to make sure you’re happy with the final state of your work. Then go get some rest!

HW 09 - Exploring the GSS