TDM 10200: Project 2 — 2023
Motivation: Pandas will enable us to work with data in Python (in a similar way to the data frames that we learned in R in the fall semester).
Context: This is our second project and we will continue to introduce some basic data types and go thru some similar control flow concepts like we did in R
.
Scope: tuples, lists, if statements, opening files
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/death_records/DeathRecords.csv
Questions
ONE
pandas
is an integral tool for various data science tasks in Python. You can read a quick intro here.
Let’s first learn how to create a simple dataframe.
If we have a store that we only sell hats that are blue and white and shoes that are red and purple.
Imagine its a Saturday and we wanted to keep track of our customer’s purchase.
Our first sale is a red shoe and a blue hat. Second sale is a purple shoe and a blue hat. Third is a red shoe and a blue hat, fourth is a purple shoe and a white hat, fifth is a red shoe and a white hat, the sixth and seventh sale are red shoes and blue hats.
So it looks a bit like this:
data = {
'shoes':['red', 'purple', 'red', 'purple', 'red', 'red', 'red'],
'hats': ['blue', 'blue', 'blue', 'white', 'white', 'blue', 'blue']
}
-
Create a data set named 'data'
-
Take the data you created and make it into a dataframe named
store
-
Now change the index numbers 0-6 to customers Jay, Mary, Bill, Chris, Martha, Karen, Rob
Helpful Hint
store = pd.DataFrame(data, index=['Jay', 'Mary', 'Bill', 'Chris', 'Martha','Karen', 'Rob'])
store
Insider Knowledge
Pandas
allows you to extract data from a CSV (comma-separated values) file. Pandas
is a great way to get acquainted with your data, including the ability to clean, transform, and analyze data.
The two main components of pandas are the series
and DataFrame
. A series
is one dimensional (you can think of it as a column of data) and a DataFrame
is a table made up of a collection of series
.
Notice that the indexing for our dataframe starts at 0. In python
, the indexing starts at 0, as compared to R
in the fall semester, where the indexing began at 1. This is an important fact to remember.
-
Code used to answer the question.
-
Result of code.
TWO
Open a new notebook in Jupyter Lab, and select the f2022-2023 kernel.
We want to go ahead and read in the dataset /anvil/projects/tdm/data/death_records/DeathRecords.csv into a pandas
DataFrame called myDF
.
-
Find the information for the 11th row in the dataframe
-
Find the last five rows of the data frame
-
Find how many rows and columns there are in the entire dataframe
-
Print just the column names
Helpful Hints
.head()
.tail()
.shape
-
Code used to answer the question.
-
Result of code.
THREE
Let’s look for specific information in our dataframe so we can become a bit more familiar with what it contains.
-
How many people over the age of 52 are on this list?
-
How many males vs how many females
-
How many females that are over the age of 70 on this list?
-
Code used to answer the question.
-
Result of code.
FOUR
Now that we have a bit of familiarity with the data, let’s introduce another common python
package called matplotlib
Let’s create a graphic using this package.
-
Create a graphic that illustrates then number of people who are divorced, married, single, unmarried, or widowed.
-
Create another graphic that illustrates the distribution of the age of the person at the time of death.
Helpful Hint
import matplotlib.pyplot as plt
Insider Knowledge
Matplotlib is a data visualization and plotting library for Python
. It provides easy ways to visualize data.
-
Code used to answer the question.
-
Result of code.
FIVE
Now that you are familiar with the data and have an introduction to plotting, create a plot of your choice, to summarize something that you find interesting about the data.
-
Use
pandas
and your investigative skills to look thru the data and find an interesting fact, and then create a graphic that summarizes some of the data from our dataset!
-
Code used to answer the question.
-
Result of code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |