STAT 29000: Project 1 — Fall 2020
Motivation: In this project we will jump right into an R review. In this project we are going to break one larger data-wrangling problem into discrete parts. There is a slight emphasis on writing functions and dealing with strings. At the end of this project we will have greatly simplified a dataset, making it easy to dig into.
Context: We just started the semester and are digging into a large dataset, and in doing so, reviewing R concepts we’ve previously learned.
Scope: data wrangling in R, functions
Make sure to read about, and use the template found here, and the important information about projects submissions here.
You can find useful examples that walk you through relevant material in The Examples Book:
It is highly recommended to read through, search, and explore these examples to help solve problems in this project.
It is highly recommended that you use rstudio.scholar.rcac.purdue.edu/. Simply click on the link and login using your Purdue account credentials. |
We decided to move away from ThinLinc and away from the version of RStudio used last year (desktop.scholar.rcac.purdue.edu). That version of RStudio is known to have some strange issues when running code chunks.
Remember the very useful documentation shortcut ?
. To use, simply type ?
in the console, followed by the name of the function you are interested in.
You can also look for package documentation by using help(package=PACKAGENAME)
, so for example, to see the documentation for the package ggplot2
, we could run:
help(package=ggplot2)
Sometimes it can be helpful to see the source code of a defined function. A function is any chunk of organized code that is used to perform an operation. Source code is the underlying R
or c
or c++
code that is used to create the function. To see the source code of a defined function, type the function’s name without the ()
. For example, if we were curious about what the function Reduce
does, we could run:
Reduce
Occasionally this will be less useful as the resulting code will be code that calls c
code we can’t see. Other times it will allow you to understand the function better.
Dataset(s)
/class/datamine/data/airbnb
Often times (maybe even the majority of the time) data doesn’t come in one nice file or database. Explore the datasets in /class/datamine/data/airbnb
.
Questions
Question 1
You may have noted that, for each country, city, and date we can find 3 files: calendar.csv.gz
, listings.csv.gz
, and reviews.csv.gz
(for now, we will ignore all files in the "visualisations" folders).
Let’s take a look at the data in each of the three types of files. Pick a country, city and date, and read the first 50 rows of each of the 3 datasets (calendar.csv.gz
, listings.csv.gz
, and reviews.csv.gz
). Provide 1-2 sentences explaining the type of information found in each, and what variable(s) could be used to join them.
|
Depending on the country that you pick, the listings and/or the reviews might not display properly in RMarkdown. So you do not need to display the first 50 rows of the listings and/or reviews, in your RMarkdown document. It is OK to just display the first 50 rows of the calendar entries. |
-
Chunk of code used to read the first 50 rows of each dataset.
-
1-2 sentences briefly describing the information contained in each dataset.
-
Name(s) of variable(s) that could be used to join them.
To read a compressed csv, simply use the read.csv
function:
dat <- read.csv("/class/datamine/data/airbnb/brazil/rj/rio-de-janeiro/2019-06-19/data/calendar.csv.gz")
head(dat)
Let’s work towards getting this data into an easier format to analyze. From now on, we will focus on the listings.csv.gz
datasets.
Question 2
Write a function called get_paths_for_country
, that, given a string with the country name, returns a vector with the full paths for all listings.csv.gz
files, starting with /class/datamine/data/airbnb/…
.
For example, the output from get_paths_for_country("united-states")
should have 28 entries. Here are the first 5 entries in the output:
[1] "/class/datamine/data/airbnb/united-states/ca/los-angeles/2019-07-08/data/listings.csv.gz" [2] "/class/datamine/data/airbnb/united-states/ca/oakland/2019-07-13/data/listings.csv.gz" [3] "/class/datamine/data/airbnb/united-states/ca/pacific-grove/2019-07-01/data/listings.csv.gz" [4] "/class/datamine/data/airbnb/united-states/ca/san-diego/2019-07-14/data/listings.csv.gz" [5] "/class/datamine/data/airbnb/united-states/ca/san-francisco/2019-07-08/data/listings.csv.gz"
|
Use |
-
Chunk of code for your
get_paths_for_country
function.
Question 3
Write a function called get_data_for_country
that, given a string with the country name, returns a data.frame containing the all listings data for that country. Use your previously written function to help you.
Use |
Use |
-
Chunk of code for your
get_data_for_country
function.
Question 4
Use your get_data_for_country
to get the data for a country of your choice, and make sure to name the data.frame listings
. Take a look at the following columns: host_is_superhost
, host_has_profile_pic
, host_identity_verified
, and is_location_exact
. What is the data type for each column? (You can use class
or typeof
or str
to see the data type.)
These columns would make more sense as logical values (TRUE/FALSE/NA).
Write a function called transform_column
that, given a column containing lowercase "t"s and "f"s, your function will transform it to logical (TRUE/FALSE/NA) values. Note that NA values for these columns appear as blank (""
), and we need to be careful when transforming the data. Test your function on column host_is_superhost
.
-
Chunk of code for your
transform_column
function. -
Type of
transform_column(listings$host_is_superhost)
.
Question 5
Apply your function transform_column
to the columns instant_bookable
and is_location_exact
in your listings
data.
Based on your listings
data, if you are looking at an instant bookable listing (where instant_bookable
is TRUE
), would you expect the location to be exact (where is_location_exact
is TRUE
)? Why or why not?
Make a frequency table, and see how many instant bookable listings have exact location. |
-
Chunk of code to get a frequency table.
-
1-2 sentences explaining whether or not we would expect the location to be exact if we were looking at a instant bookable listing.