STAT 29000: Project 1 — Spring 2022
Motivation: Extensible Markup Language or XML is a very important file format for storing structured data. Even though formats like JSON, and csv tend to be more prevalent, many, many legacy systems still use XML, and it remains an appropriate format for storing complex data. In fact, JSON and csv are quickly becoming less relevant as new formats and serialization methods like parquet and protobufs are becoming more common.
Context: This is the first project in a series of 5 projects focused on web scraping in Python, with a focus on XML.
Scope: Python, XML
Dataset(s)
The following questions will use the following dataset(s):
-
/depot/datamine/data/otc/hawaii.xml
Questions
Question 1
Please review our updated submission guidelines before submitting your project. |
For this project, you may find the questions and solutions of an old project found here useful. |
You should read through the small XML section of the book. |
You should read through the 10 minute Pandas tutorial. |
One of the challenges of XML is that it can be hard to get a feel for how the data is structured — especially in a large XML file. A good first step is to find the name of the root node. Use the lxml
package to find and print the name of the root node.
Interesting! If you took a look at the previous project, you probably weren’t expecting the extra {urn:hl7-org:v3}
part in the root node name. This is because the previous project’s dataset didn’t have a namespace! Namespaces in XML are a way to prevent issues where a document may have multiple sets of node names that are identical but have different meanings. The namespaces allow them to exist in the same space without conflict.
Practically what does this mean? It makes XML parsing ever-so-slightly more annoying to perform. Instead of being able to enter XPath expressions and return elements, we have to define a namespace as well. This will be made more clear later.
-
Code used to solve this problem.
-
Output from running the code.
Question 2
XML can be nested — there can be elements that contain other elements that contain other elements. In the previous question, we identified the root node AND the namespace. Just like in the previous Spring 290 project 1 (linked in the "tip" in question 1), we would like you to find the names of the next "tier" of elements.
This will not be a copy/paste of the previous solution. Why? Because of the namespace!
First, try to use the same method from question (2) from this project to find the next tier of names. What happens?
hawaii.xpath("/document") # won't work
hawaii.xpath("{urn:hl7-org:v3}document") # still won't work with the namespace there
How do we fix this? We must define our namespace, and reference it in our XPath expression. For example, the following will work.
hawaii.xpath("/ns:document", namespaces={'ns': 'urn:hl7-org:v3'})
Here, we are passing a dict to the namespaces argument. The key is whatever we want to call the namespace, and the value is the namespace itself. For example, the following would work too.
hawaii.xpath("/happy:document", namespaces={'happy': 'urn:hl7-org:v3'})
So, unfortunately, every time we want to use an XPath expression, we have to prepend namespace:
before the name of the element we are looking for. This is a pain, and unfortunately we cannot just define it once and move on.
Okay, given this new information, please find the next "tier" of elements.
There should be 8. |
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Okay, lucky for you, this XML file is not so big! Use your UNIX skills you refined last semester to print the content of the XML file. You can print the entirety in a bash
cell if you wish.
You will be able to see that it contains information about a drug of some sort.
Knowing now that there are ingredient
elements in the XML file. Write Python code, and an XPath expression to get a list of all of the ingredient
elements. Print the list of elements.
When we say "print the list of elements", we mean to print the list of elements. For example, the first element would be: <ingredient classCode="IACT"> <ingredientSubstance> <code code="O7TSZ97GEP" codeSystem="2.16.840.1.113883.4.9"/> <name>DIBASIC CALCIUM PHOSPHATE DIHYDRATE</name> </ingredientSubstance> </ingredient> |
To print an Element
object, see the following.
print(etree.tostring(my_element, pretty_print=True).decode('utf-8'))
-
Code used to solve this problem.
-
Output from running the code.
Question 4
At this point in time you may be wondering how to actually access the bits and pieces of data in the XML file.
There is data between tags.
<name>DIBASIC CALCIUM PHOSPHATE DIHYDRATE</name>
To access such data from the "name" Element
(which we will call my_element
below) you would do the following.
my_element.text # DIABASIC CALCIUM PHOSPHATE DIHYDRATE
There is also data tucked away in a tag’s attributes.
<code code="O7TSZ97GEP" codeSystem="2.16.840.1.113883.4.9"/>
To access such data from the "name" Element
(which we will call my_element
below) you would do the following.
my_element.attrib['code'] # O7TSZ97GEP
my_element.attrib['codeSystem'] # 2.16.840.1.113883.4.9
The aspect of XML that we are interested in learning about are XPath expressions. XPath expressions are a clear and effective way to extract elements from an XML document (or HTML document — think extracting data from a webpage!).
In the previous question you used an XPath expression to find all of the ingredient
elements, regardless where they were or how they were nested in the document. Let’s practice more.
If you look at the XML document, you will see that there are a lot of code
attributes. Use lxml
and XPath expressions to first extract all elements with a code
attribute. Print all of the values of the code
attributes.
Repeat the process, but modify your XPath expression so that it only keeps elements that have a code
attribute that starts with a capital "C". Print all of the values of the code
attributes.
You can use the |
This link may help you when figuring out how to select the elements where the |
-
Code used to solve this problem.
-
Output from running the code.
Question 5
The quantity
element contains a numerator
and a denominator
element. Print all of the quantities in the XML file, where a quantity is defined as the value of the value
attribute of the numerator
element divided by the value of the value
attribute of the corresponding denominator
element. Lastly, print the unit
(part of the numerator
element afterwards.
The results should read as follows: 1.0 1 5.0 g 7.6 mg 5.0 g 4.0 g 230.0 mg 4.0 g |
You may need to use the |
You can use the |
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connect ion, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |