STAT 19000: Project 1 — Spring 2022
Motivation: Last semester each project was completed in Jupyter Lab from ondemand.brown.rcac.purdue.edu. Although our focus this semester is on the use of Python to solve data-driven problems, we still get to stay in the same environment. In fact, Jupyter Lab is Python-first. Now instead of using the f2021-s2022-r
kernel, instead select the f2021-s2022
kernel.
Context: In this project we will re-familiarize ourselves with Jupyter Lab and its capabilities. We will also introduce Python and begin learning some of the syntax.
Scope: Python, Jupyter Lab
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/noaa/*.csv
Questions
Question 1
Please review our updated submission guidelines before submitting your project. |
Let’s start the semester by doing some basic Python work, and compare and contrast with R.
First, let’s learn how to run R code in our regular, non-R kernel, f2021-s2022
. In a cell, run the following code in order to load the extension that allows us to run R code.
%load_ext rpy2.ipython
You can see these intructions here. |
Next, in order to actually run R code, we need to place the following in the first line of every cell where we want to run R code: %%R
. You can think of this as declaring the code cell as an R code cell. For example, the following will successfully run in our f2021-s2022
kernel.
%%R
my_vector <- c(1,2,3,4,5)
my_vector
Great! Run the cell in your notebook to see the output.
Now, let’s perform the equivalent operations in Python!
my_vector = (1,2,3,4,5)
my_vector
The |
As you can see — the output is essentially the same. However, in Python, there are actually a few "primary" ways you could do this.
my_tuple = (1,2,3,4,5,)
my_tuple
my_list = [1,2,3,4,5,]
my_list
import numpy as np
my_array = np.array([1,2,3,4,5])
my_array
The first two options are part of the Python standard library. The first option is a tuple, which is a list of values. A tuple is immutable, meaning that once you create it, you cannot change it.
my_tuple = (1,2,3,4,5)
my_tuple[0] = 10 # error!
The second option is a list, which is a list of values. A list is mutable, meaning that once you create it, you can change it.
my_list = [1,2,3,4,5]
my_list[0] = 10
my_list # [10,2,3,4,5]
The third option is a numpy array, which is a list of values. A numpy array is mutable, meaning that once you create it, you can change it. In order to use numpy arrays, you must import the numpy package first. numpy
is a numerical computation library that is optimized for a lot of the work we will be doing this semester. With that being said, its best to get to learn about the basics in Python first, as a lot can be accomplished without using numpy
.
For this question, read as much as you can about tuples and lists, and run the examples we provided above.
-
Code used to solve this problem.
-
Output from running the code.
Question 2
In general, tuples are used when you have a set of known values that you want to store and access efficiently. Lists are used when you want to do the same, but you have the need to manipulate the data within. Most often, lists will be your go-to.
In Python, lists are an object. Objects have methods. Methods are most simply defined as functions that are associated with and operate on the data (usually) within the object itself.
Here you can find a list of the list methods. For example, the append method adds an item to the end of a list.
Methods are called using dot notation. The following is an example of using the append method and dot notation to add the number 99 to the end of our list, my_list
.
my_list = [1,2,3,4,5]
my_list.append(99)
my_list # [1,2,3,4,5,99]
Create a list called my_list
with the values 1,2,3,4,5. Then, use the list methods to change my_list
to contain the following values, in order: 7,5,4,3,2,1,6. Do not manually set values using indexing — just use the list methods.
-
Code used to solve this problem.
-
Output from running the code.
my_list = [1,2,3,4,5]
my_list.append(7)
my_list.reverse()
my_list.append(6)
my_list
[7, 5, 4, 3, 2, 1, 6]
Question 3
Great! You may have noticed (or already know) that to get the first value in a list (or tuple) we would do my_list[0]
. Recall that in R, we would do my_list[1]
. This is because Python has 0-based indexing instead of 1-based indexing. While at first this may be confusing, many people find it much easier to use 0-based indexing than 1 based indexing.
Use indexing to print the values 7,4,2,6 from the modified my_list
in the previous question.
Use indexing to print the values in reverse order without using the reverse
method.
Use indexing to print the second through 4th values in my_list
(5,4,3).
The "jump" feature of Python indexing will be useful here! |
Relevant topics: indexing
-
Code used to solve this problem.
-
Output from running the code.
my_list[::2]
my_list[::-1]
my_list[1:4]
[7, 4, 2, 6] [6, 1, 2, 3, 4, 5, 7] [5, 4, 3]
Question 4
Great! If you have 1 takeaway from the previous 3 questions it should be that when you see []
think lists. When you see ()
think tuples (or generators, but ignore this for now).
Its not a Data Mine project without data. After we get through some basics of Python, we will be primarily working with data using the pandas
and numpy
libraries.With that being said, there is no reason not to do some work manually in the meantime!
Python does not have the data frame concept in its standard library like R does. This will most likely make things that would be simple to do in R much more complicated in Python. The |
Okay! Let’s get started with our noaa weather data. The following is a very small sample of the /depot/datamine/data/noaa/2020.csv
dataset.
AE000041196,20200101,TMIN,168,,,S, AE000041196,20200101,PRCP,0,D,,S, AE000041196,20200101,TAVG,211,H,,S, AEM00041194,20200101,PRCP,0,,,S, AEM00041194,20200101,TAVG,217,H,,S, AEM00041217,20200101,TAVG,205,H,,S, AEM00041218,20200101,TMIN,148,,,S, AEM00041218,20200101,TAVG,199,H,,S, AFM00040938,20200101,PRCP,23,,,S, AFM00040938,20200101,TAVG,54,H,,S,
You can read here about what the data means.
-
11 character station ID
-
8 character date in YYYYMMDD format
-
4 character element code (you can see the element codes here in section III)
-
value of the data (varies based on the element code)
-
1 character M-flag (10 possible values, see section III here)
-
1 character Q-flag (14 possible values, see section III here)
-
1 character S-flag (30 possible values, see section III here)
-
4 character observation time (HHMM) (0700 = 7:00 AM) — may be blank
Since we aren’t using the pandas
library, we need to use something in order to bring the data into Python. In this case, we will use the csv
library — a library used for reading and writing dsv (data separated value) data.
The official documentation for this library is here. |
If you read the first example in the csv.reader
section here, you will find the following quick and succinct example.
import csv (1)
with open('eggs.csv', newline='') as csvfile: (2)
spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|') (3)
for row in spamreader: (4)
print(', '.join(row)) (5)
Spam, Spam, Spam, Spam, Spam, Baked Beans Spam, Lovely Spam, Wonderful Spam
You do not need to understand everything that is happening in this example (yet). With that being said, the following is an explanation for each part.
1 | We are importing the csv library. If we didn’t have this line, the program would crash when we try and call csv.reader(…) in the fourth line. |
2 | We are opening the eggs.csv file. This is the file we will be reading. Here, eggs.csv is assumed to be in the same directory where we are running the code. It could just as easily be in a folder called "my_data" in the data depot, in which case we would replace eggs.csv with the absolute path to our file of interest: /depot/datamine/data/my_data/eggs.csv . In addition, we call our opened file csvfile . |
3 | Here, we create a csv.reader object called spamreader . This object is a generator that will yield one row at a time. We can loop through this "generator" to get a single row of data at a time. |
4 | Here, we are looping through each row of data from the spamreader object. For each loop, we save the data into a variable called row . Specifically, row is a list, where the first value is the first space-separated value in the row, the second is the second space separated value in the row, etc. We then use a string method called join on the ", " string, which takes each value in the row and puts a ", " between them. This results in "Spam, Spam, Spam, …, Baked Beans" that we see in the output. |
This code could have been written like this:
But we have to close the file — otherwise, it could cause issues down the road. The with statement, among other things, handles this automatically for you. |
One important part of learning a new language is jumping right in and trying things out! Modify the provided code to read in the 2020.csv
file and print the 4th column only.
We do not want you to print out every row of data — that would be a lot and cause your notebook to crash! Instead, in the line following the In general, we never want more than 10 or so lines — maybe 100 at the maximum. When in doubt, just print 10 lines. |
-
Code used to solve this problem.
-
Output from running the code.
import csv
with open('/depot/datamine/data/noaa/2020.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
print(row[3])
break
168
Question 5
Below we’ve provided you with code that we would like you to fill in. Print the first 10 rows of the data.
import csv
with open('/depot/datamine/data/noaa/2020.csv') as my_file:
reader = csv.reader(my_file)
# TODO: create variable to store how many rows we've printed so far
for row in reader:
print(row)
# TODO: increment the variable storing our count, since we've printed a row
# TODO: if we've printed 10 rows, run the break statement
break
You will need to indent the |
If you want to try and solve this another way, Google "enumerate Python" and see if you can figure out how to do this without using the counting variable you create. |
-
Code used to solve this problem.
-
Output from running the code.
import csv
with open('/depot/datamine/data/noaa/2020.csv') as my_file:
reader = csv.reader(my_file)
for ct, row in enumerate(reader):
print(row)
if ct == 9:
break
['AE000041196', '20200101', 'TMIN', '168', '', '', 'S', ''] ['AE000041196', '20200101', 'PRCP', '0', 'D', '', 'S', ''] ['AE000041196', '20200101', 'TAVG', '211', 'H', '', 'S', ''] ['AEM00041194', '20200101', 'PRCP', '0', '', '', 'S', ''] ['AEM00041194', '20200101', 'TAVG', '217', 'H', '', 'S', ''] ['AEM00041217', '20200101', 'TAVG', '205', 'H', '', 'S', ''] ['AEM00041218', '20200101', 'TMIN', '148', '', '', 'S', ''] ['AEM00041218', '20200101', 'TAVG', '199', 'H', '', 'S', ''] ['AFM00040938', '20200101', 'PRCP', '23', '', '', 'S', ''] ['AFM00040938', '20200101', 'TAVG', '54', 'H', '', 'S', '']
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |