STAT 19000: Project 3 — Fall 2020
Motivation: data.frame`s are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a `data.frame
.
Context: In the previous project we got our feet wet, and ran our first R code, and learned about accessing data inside vectors. In this project we will continue to reinforce what we’ve already learned and introduce a new, flexible data structure called `data.frame`s.
Scope: r, data.frames, recycling, factors
Questions
Question 1
Read the dataset /class/datamine/data/disney/splash_mountain.csv
into a data.frame called splash_mountain
. How many columns, or features are in each dataset? How many rows or observations?
-
R code used to solve the problem.
-
How many columns or features in each dataset?
Question 2
Splash Mountain is a fan favorite ride at Disney World’s Magic Kingdom theme park. splash_mountain
contains a series of dates and datetimes. For each datetime, splash_mountain
contains a posted minimum wait time, SPOSTMIN
, and an actual minimum wait time, SACTMIN
. What is the average posted minimum wait time for Splash Mountain? What is the standard deviation? Based on the fact that SPOSTMIN
represents the posted minimum wait time for our ride, does our mean and standard deviation make sense? Explain. (You might look ahead to Question 3 before writing the answer to Question 2.)
If you got |
-
R code used to solve this problem.
-
The results of running the R code.
-
1-2 sentences explaining why or why not the results make sense.
Question 3
In (2), we got some peculiar values for the mean and standard deviation. If you read the "attractions" tab in the file /class/datamine/data/disney/touringplans_data_dictionary.xlsx
, you will find that -999 is used as a value in SPOSTMIN
and SACTMIN
to indicate the ride as being closed. Recalculate the mean and standard deviation of SPOSTMIN
, excluding values that are -999. Does this seem to have fixed our problem?
-
R code used to solve this problem.
-
The result of running the R code.
-
A statement indicating whether or not the value look reasonable now.
Question 4
SPOSTMIN
and SACTMIN
aren’t the greatest feature/column names. An outsider looking at the data.frame wouldn’t be able to immediately get the gist of what they represent. Change SPOSTMIN
to posted_min_wait_time
and SACTMIN
to actual_wait_time
.
Hint: You can always use hard-coded integers to change names manually, however, if you use which
, you can get the index of the column name that you would like to change. For data.frames like splash_mountain
, this is a lot more efficient than manually counting which column is the one with a certain name.
-
R code used to solve the problem.
-
The output from executing
names(splash_mountain)
orcolnames(splash_mountain)
.
Question 5
Use the cut
function to create a new vector called quarter
that breaks the date
column up by quarter. Use the labels
argument in the factor
function to label the quarters "q1", "q2", …, "qX" where X
is the last quarter. Add quarter
as a column named quarter
in splash_mountain
. How many quarters are there?
If you have 2 years of data, this will result in 8 quarters: "q1", …, "q8". |
We can generate sequential data using
or
|
-
R code used to solve the problem.
-
The
head
andtail
ofsplash_mountain
. -
The number of quarters in the new
quarter
column.
Question 5 is intended to be a little more challenging, so we worked through the exact same steps, with two other data sets. That way, if you work through these, all you will need to do, to solve Question 5, is to follow the example, and change two things, namely, the data set itself (in the read.csv
file) and also the format of the date.
This basically steps you through everything in Question 5.
We hope that these are helpful resources for you! We appreciate you very much and we are here to support you! You would not know how to solve this question on your own—because we are just getting started—but we like to sometimes put in a question like this, in which you get introduced to several new things, and we will dive deeper into these ideas as we push ahead.
Question 6
Please include a statement in Project 3 that says, "I acknowledge that the STAT 19000/29000/39000 1-credit Data Mine seminar will be recorded and posted on Piazza, for participants in this course." or if you disagree with this statement, please consult with us at datamine@purdue.edu for an alternative plan.