STAT 19000: Project 13 — Spring 2022
Motivation: We covered a lot this year! When dealing with data driven projects, it is useful to explore the data, and answer different questions to get a feel for it. There are always different ways one can go about this. Proper preparation prevents poor performance, in this project we are going to practice using some of the skills you’ve learned, and review topics and languages in a generic way.
Context: We are on the final stretch of two projects where there will be an assortment of "random" questions that may involve various datasets (and languages/tools). We may even ask a question that asks you to use a tool you haven’t used before — but don’t worry, if we do, we will provide you with extra guidance.
Scope: Python, R, bash, unix, computers
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/flights/subset/2008.parquet
-
/depot/datamine/data/coco/unlabeled2017/000000000008.jpg
-
/depot/datamine/data/movies_and_tv/imdb.db
Questions
Answer the questions below using the language of your choice (R, Python, bash, awk, etc.). Don’t feel limited by one language, you can use different languages to answer different questions (maybe one language would be easier to do something in). If you are feeling bold, you can also try answering the questions using all languages! |
Question 1
Read in the file 2008.parquet
into a pandas
dataframe and convert the column DepTime
to a datetime. Print the first 50 converted values. They should match the following.
0 13:43:00 1 11:25:00 2 20:09:00 3 09:03:00 4 14:23:00 5 20:24:00 6 17:53:00 7 06:22:00 8 19:44:00 9 14:53:00 10 20:30:00 11 07:08:00 12 17:49:00 13 12:17:00 14 09:54:00 15 17:58:00 16 22:10:00 17 07:40:00 18 10:11:00 19 16:12:00 20 11:24:00 21 08:24:00 22 21:12:00 23 06:41:00 24 17:13:00 25 14:14:00 26 19:15:00 27 09:29:00 28 12:14:00 29 13:18:00 30 17:35:00 31 09:04:00 32 15:55:00 33 08:07:00 34 09:26:00 35 15:44:00 36 19:31:00 37 22:06:00 38 06:58:00 39 12:54:00 40 13:43:00 41 00:13:00 42 NaT 43 08:40:00 44 20:58:00 45 10:35:00 46 15:29:00 47 17:13:00 48 06:29:00 49 13:15:00 Name: DepTime, dtype: object
The |
The |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Use the PIL
package to loop through the pixels in this image, /depot/datamine/data/coco/unlabeled2017/000000000008.jpg
, and use f-strings to report the percentage of pixels that are "mostly green", "mostly blue", and "mostly red". The output should look like the following.
red: 2.66% green: 9.88% blue: 87.46%
To view the image:
|
These links should be helpful: |
-
Code used to solve this problem.
-
Output from running the code.
Question 3
List the number of titles by year premiered
in the imdb.db
database. Don’t know SQL? That is 100% fine! Read the documentation here, and work with pandas
dataframes.
Can’t figure out how to go through all of the data without having the kernel crash? That’s okay! If you don’t want to do this right now, it is okay to simply give the results for the first 10k movies: sample of expected results for first 10k
type premiered 1892 3 1893 1 1894 6 1895 19 1896 104 1897 37 1898 45 1899 47 1900 82 1901 35 1902 36 1903 57 1904 21 1905 32 1906 41 1907 49 1908 157 1909 306 1910 362 1911 508 1912 600 1913 978 1914 1225 1915 1465 1916 1235 1917 1200 1918 1015 1919 307 1920 15 1921 5 1922 2 1925 4 1936 1 |
If you want to process the entire table of the database, great! The key is to use the chunksize argument. This returns an iterator — something you can loop over. If you set sample of results
premiered 1874.0 1.0 1877.0 1.0 1878.0 2.0 1881.0 1.0 1883.0 1.0 ... 2024.0 66.0 2025.0 14.0 2026.0 9.0 2027.0 6.0 2028.0 3.0 |
Want to use SQL? Okay! You can run sql queries on this database from within a Jupyter Notebook cell. For example:
This section will be helpful! |
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Check out the following two datasets:
-
/depot/datamine/data/okcupid/filtered/users.csv
-
/depot/datamine/data/okcupid/filtered/questions.csv
How many men (as defined by the gender2
column) believe and don’t believe in ghosts? How about women (as defined by the gender2
column)?
-
Code used to solve this problem.
-
Output from running the code.
Question 5
Get the total dollar amount of liquor sold in the /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt
dataset.
This dataset is about 3.5 GB in size — this is more than you will be able to load in our Jupyter Notebooks in a |
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |