Learn to explore, visualize, and analyze data to understand natural phenomena, investigate patterns, model outcomes, and make predictions, and do so in a reproducible and shareable manner. Gain experience in data collection, wrangling, and visualization, exploratory data analysis, predictive modeling, and effective communication of results while working on problems and case studies inspired by and based on real-world questions. The course will focus on the R statistical computing language. No statistical or computing background is necessary.
Note: This is the Fall 2019 version of the course. To see the current version, please visit introds.org.
Lectures on Mondays and Wednesdays, workshops on Tuesdays. Access official course information here.
11:10 - 12:00, Appleton Tower LT1
12:10 - 14:00, Murchison House LG12
11:10 - 12:00, Robson Building LT
It is my intent that students from all diverse backgrounds and perspectives be well-served by this course, that students’ learning needs be addressed both in and out of class, and that the diversity that the students bring to this class be viewed as a resource, strength and benefit. It is my intent to present materials and activities that are respectful of diversity: gender identity, sexuality, disability, age, socioeconomic status, ethnicity, race, nationality, religion, and culture. Your suggestions are encouraged and appreciated. Please let me know ways to improve the effectiveness of the course for you personally, or for other students or student groups.
Furthermore, I would like to create a learning environment for my students that supports a diversity of thoughts, perspectives and experiences, and honors your identities (including gender identity, sexuality, disability, age, socioeconomic status, ethnicity, race, nationality, religion, and culture.) To help accomplish this:
The University takes academic misconduct very seriously and is committed to ensuring that so far as possible it is detected and dealt with appropriately. Find out more about the University’s official policies around academic misconduct here.
Cheating or plagiarising on assignments, lying about an illness or absence and other forms of academic dishonesty are a breach of trust with classmates and faculty, violate the University policies, and will not be tolerated. Such incidences will result in a 0 grade for all parties involved. Additionally, there may be penalties to your final class grade along with being reported to the School Academic Misconduct Office.
A note on sharing / reusing code: I am well aware that a huge volume of code is available on the web to solve any number of problems. Unless I explicitly tell you not to use something the course’s policy is that you may make use of any online resources (e.g. StackOverflow) but you must explicitly cite where you obtained any code you directly use (or use as inspiration). Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism. On individual assignments you may not directly share code with another student in this class, and on team assignments you may not directly share code with another team in this class. Except for the take home exams, you are welcome to discuss the problems together and ask for advice, but you may not send or make use of code from another team. On the take home exams all communication with classmates is explicitly forbidden.
I will regularly send course announcements via email and Learn, make sure to check one or the other of these daily. We will be using Piazza to facilitate course communication, particularly around questions and answers. If you have a question outside of class or office hours, first check if it has already been asked on Piazza and if not post there. If you have a question or concern you don’t feel confortable posting of Piazza feel free to reach out via email.
Please refrain from texting or using your computer for anything other than coursework during class.
Class time is designed to be as interactive as possible. My role as instructor is to introduce you new tools and techniques, but it is up to you to take them and make use of them. Programming is a skill that is best learned by doing, so as much as possible you will be working on a variety of tasks and activities throughout each class. Attendance will not be be used to determine your mark but you are expected to attend all lectures and meaningfully contribute to in-class exercises and homework assignments.
Class sessions are recorded, and the recordings will be available on Learn after class. Think of the recordings not as a replacement for attending class, but as a supplement!
For all of the team based assignments in this class you will be randomly assigned to teams of 3 or 4 students - these teams will change after each assignment. You will work in these teams during class and on the homework assignment. For team based assignments, all team members are expected to contribute equally to the completion of each assignment and you will be asked to evaluate your team members after each assignment is due. Failure to adequately contribute to an assignment will result in a penalty to your mark relative to the team’s overall mark.
Students are expected to make use of the provided GitHub repository as their central collaborative platform. Commits to this repository will be used as a metric (one of several) of each team member’s relative contribution for each homework.
Beyond the in class activities, you will be assigned weekly larger programming tasks throughout the semester. These assignments will be completed individually, and submitted as GitHub repositories. Homework with the lowest score for each student will be dropped.
You will also complete weekly computing labs in teams during the Tuesday workshops. They are designed to be completed in class, and submitted as GitHub repositories. Lab with the lowest score for each student will be dropped.
You will be responsible for the completion of an open ended final project for this course, the goal of which is to tackle an “interesting” problem using the tools and techniques covered in this class. Additional details on the project will be provided as the course progresses. Each team’s work will also be shared with and evaluated by at least one other team at an earlier stage in order to provide feedback in the form of code review. You must complete the final project and be in class to present it in order to pass this course.
These weekly multiple choice quizzes will help you evaluate your learning continuously. Online quiz with the lowest score for each student will be dropped.
These weekly R tutorials are designed to help with your learning. They are optional, and not graded. You can complete them individually, online.
Your overall course grade will be comprised of the following components, and their weights:
Please review the official University and School policies here.
Regrade requests must be made within one week of when the assignment is returned, and must be typed up, printed, and submitted in person to me. These will be honored if points were tallied incorrectly, or if you feel your answer is correct but it was marked wrong. No regrade will be made to alter the number of points deducted for a mistake. There will be no grade changes after the final project presentations.
Please review the official University and School policies here. Late work will not be accepted for homework assignments, labs, and online quizzes. Only the work you have completed by the deadline will be marked for those assignments. The projects will follow the standard University late work penalty of 5% of the maximum obtainable mark per calendar day up to seven calendar days after the deadline. If you intend to submit work late for the project, you must notify the course organizer before the original deadline as well as as soon as the completed work is submitted on GitHub.
Note: If you’ve read this far in the syllabus, email me a puppy or kitten picture! Could be yours, or one you found online.
Showcase your inner data scientist
Pick a dataset, any dataset…
…and do something with it. That is your final project in a nutshell. More details below.
The final project for this class will consist of analysis on a dataset of your own choosing. The dataset may already exist, or you may collect your own data using a survey or by conducting an experiment. You can choose the data based on your interests or based on work in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like) and apply them to a novel dataset in a meaningful way.
The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather let me know that you are proficient at asking meaningful questions and answering them with results of data analysis, that you are proficient in using R, and that you are proficient at interpreting and presenting the results. Focus on methods that help you begin to answer your research questions. You do not have to apply every statistical procedure we learned. Also, critique your own methods and provide suggestions for improving your analysis. Issues pertaining to the reliability and validity of your data, and appropriateness of the statistical analysis should be discussed here.
The project is very open ended. You should create some kind of compelling visualization(s) of this data in R. There is no limit on what tools or packages you may use, but
sticking to packages we learned in class (tidyverse
) is required. You do not
need to visualize all of the data at once. A single high quality visualization
will receive a much higher grade than a large number of poor quality visualizations.
Also pay attention to your presentation. Neatness, coherency, and clarity will
count. All analyses must be done in RStudio, using R.
In order for you to have the greatest chance of success with this project it is important that you choose a manageable dataset. This means that the data should be readily accessible and large enough that multiple relationships can be explored. As such, your dataset must have at least 50 observations and between 10 to 20 variables (exceptions can be made but you must speak with me first). The dataset’s variables should include categorical variables, discrete numerical variables, and continuous numerical variables.
If you are using a dataset that comes in a format that we haven’t encountered in class, make sure that you are able to load it into R as this can be tricky depending on the source. If you are having trouble ask for help before it is too late.
Note on reusing datasets from class: Do not reuse datasets used in examples,
homework assignments, or labs in the class.
Below are a list of data repositories that might be of interest to browse. You’re not limited to these resources, and in fact you’re encouraged to venture beyond them. But you might find something interesting there:
Section 1 - Introduction: The introduction should introduce your general research question and your data (where it came from, how it was collected, what are the cases, what are the variables, etc.).
Section 2 - Data: Place your data in the /data
folder, and add dimensions
and codebook to the README in that folder. Then print out the output of
glimpse()
or skim()
of your data frame.
Section 3 - Data analysis plan:
Each section should be no more than 1 page (excluding figures). You can check a print preview to confirm length. You will turn in your proposal as your HW 05 in the course.
5 minutes maximum, and each team member should say something substantial.
Prepare a slide deck using the template in your repo. This template uses a
package called xaringan
, and allows you to make presentation slides using
R Markdown syntax. There isn’t a limit to how many slides you can use, just a
time limit (5 minutes total). Each team member should get a chance to speak
during the presentation. Your presentation should not just be an account of
everything you tried (“then we did this, then we did this, etc.”), instead it
should convey what choices you made, and why, and what you found.
Before you finalize your presentation, make sure your chunks are turned off
with echo = FALSE
.
Presentation schedule: Presentations will take place in two shifts during the workshop on Tuesday, 27 Nov (last workshop of the semester). Teams will be assigned to shifts randomly, and within each shift order of presentations will be determined randomly as well. You only need to be there for your shift, and during that hour you get to watch 10 other presentations and provide feedback in the form of peer evaluations. While the schedule will be created randomly, we will make allowances for teams with students who have class in George Square and can’t always make it to workshop on time. If your team meets these criteria let me know asap!
Presentation location: Presentations will be held in JCMB 5327.
Along with your presentation slides, we want you to provide a brief summary of your project in the README of your repository.
This summary should provide information on the dataset you’re using, your research question(s), your methodology, and your findings.
The following folders and files in your project repository:
presentation.Rmd
+ presentation.html
: Your presentation slidesREADME.md
: Your summary/data/*
: Your dataset in csv or RDS format, in the /data
folder./proposal
: Your proposal from earlier in the semesterStyle and format does count for this assignment, so please take the time to make sure everything looks good and your data and code are properly formated.
echo = FALSE
) so that your
document is neat and easy to read. However your document should include
all your code such that if I re-knit your R Markdown file I should be
able to obtain the results you presented. Exception: If you want to
highlight something specific about a piece of code, you’re welcomed to show
that portion.Total | 100 pts |
---|---|
Presentation | 50 pts |
Summary | 25 pts |
Reproducibility and organization | 10 pts |
Team peer evaluation | 10 pts |
Classmates’ evaluation | 5 pts |
You will be asked to fill out a survey where you rate the contribution and teamwork of each team member out of 10 points. You will additionally report a contribution percentage for each team member. Filling out the survey is a prerequisite for getting credit on the team member evaluation. If you are suggesting that an individual did less than 20% of the work, please provide some explanation. If any individual gets an average peer score indicating that they did less than 10% of the work, this person will receive half the grade of the rest of the group.
There is no late submission / make up for the presentation. You must be in class on the day of the presentation to get credit for it.
The late work policy for the summary is 5% of the maximum obtainable mark per calendar day up to seven calendar days after the deadline. If you intend to submit work late for the project, you must notify the course organizer before the original deadline as well as as soon as the completed work is submitted on GitHub.