Python to CUAHSI WaterML & WaterOneFlow web service, Pt. 1

CUAHSI HIS has developed cyberinfrastructure tools for inland waters that’s looking pretty impressive (at least for the US, when it comes to datasets that have been ingested). As I’m starting to get involved in this, even if peripherally, I wanted to get direct, hands-on experience with the use of their WaterOneFlow (“WOF”) web service and WaterML data encoding format. For me, this means that I want to get comfortable using Python to interact with this data stream.

Here, I’ll present simple examples of accessing WOF SOAP services to access time series from HIS servers. Then, I’ll parse the WaterML data (including metadata), convert the time series to Numpy arrays, and plot them using Matplotlib. Much of this is new to me, so it’s been fun and somewhat headache-inducing at the same time. But I was able to get running quickly thanks to help from Jon Goodall, who provided sample Python code and tips and answered several questions; and David Tarboton, who pointed me to Matlab samples he had already developed CUAHSI HIS training materials. I’ll point out distinctions and similarities to Jon’s code at the end. I specifically wanted to use Python standard libraries as much as possible (so the code can be easily re-used by others); use natural Pythonic constructs rather than array-looping and verbose object-access syntax; and reproduce most of the workflow that David demonstrates in his Matlab tutorial.

For reference, I’m doing this with Python 2.5 on Windows Vista. Some instructions may not apply to earlier versions of Python.

Getting set to parse XML from WaterOneFlow or WaterML

Once we’re ready to process the WaterML data in memory as XML, using standard XML parsing tools, it shouldn’t matter whether the data came from a remote server via WOF or from a local WaterML file (e.g., downloaded through Jon’s FetchWaterML tool). Parsing from that point on should be identical and generic regardless of data origin. That’s what I’ll show in this section: from data source (WOF or WaterML) to the starting point for XML parsing.

Use of SOAP web services and XML parsing are common programming tasks these days, so there should be plenty of tools out there to make that easy and generic (especially XML, as XML is everywhere). First, web services. Jon pointed me in the right direction, with the suds package (“Suds is a lightweight SOAP python client for consuming Web Services”). This is not part of the standard library but it’s well maintained. So, it’s an easy install, literally: easy_install suds

Throughout this exercise, I will use two different types of time series as examples: USGS-NWIS daily data (“NWISDV”) and sporadic or irregular instantaneous data (“NWISIID”). The first one will be discharge from a stream (Big Rock Creek, California), and the second one nitrate concentration (water quality) from the Mississippi mainstem in Louisiana. The time period requested will be the same, 2000 to 2006.

import suds.client as SC

# NWISDV (Discharge at Big Rock Creek)
wsdlurl = ""
site, variable   = ("NWIS:10263500", "NWIS:00060")
# NWISIID: Nitrate concentration in the Mississippi mainstem
#wsdlurl = ""
#site, variable   = ("NWIS:07373420", "NWIS:00618")

dt_begin, dt_end = ("2000-08-01T00:00:00", "2006-08-01T00:00:00")

# Access the WOF service WSDL, then issue WOF GetValues request
client = SC.Client(wsdlurl)
wmlvalues_resp = client.service.GetValues(site,variable,dt_begin,dt_end)
# Convert response to a plain string
wmlvalues_resp_xml_str = SC.tostr(wmlvalues_resp)

The suds GetValues response is an object, not a string. tostr() converts this to a plain string literal, the WaterML XML data. Time to parse with ElementTree, ET. Starting with Python 2.5, ElementTree comes as part of the standard library, pre-installed. ElementTree “treats XML data as a lists of lists”, and is widely considered a more intituitive and pythonic way of processing XML, working seamlessly with Python constructs and data types.

Back to our example. We’ll apply the ET’s XML function on the XML string to create an “element” object, wmlroot_el, that points to the root or top level of the XML hierarchy:

import xml.etree.ElementTree as ET
wmlroot_el = ET.XML(wmlvalues_resp_xml_str)

Now we’re really ready to parse and extract data and attributes. But before doing that, let’s back out. If we already have a WaterML local file (i.e., suds not needed), we use ET’s parse() function to read the file into an ET “element tree”, then apply the getroot method on that element tree to create an element object, wmlroot_el, pointing to the root of the XML hierarchy:

waterml_file = "C:\data\MyWaterMLfile.xml"
wmletree  = ET.parse(waterml_file)
wmlroot_el = wmletree.getroot()

We can now get an ET root element object for the WaterML XML data stream, regardless of the source. Neat. But before extracting data, I have to point out a glitch in the NWISIID data stream (not present in the NWISDV stream) that needs to be handled. The queryInfo tag has a child, <note title=”USGS URL”>, that holds a URL (requesting data from the existing USGS system, but that’s besides the point) that includes the “&” character. This character causes errors with the ET parser. I noticed that Jon’s FetchWaterML app replaced & with the equivalent HTML character code “&amp;”, so I applied this substring replacement on the XML string extracted with suds, wmlvalues_resp_xml_str, before the ET.XML(wmlvalues_resp_xml_str) line described earlier:

if '&' in wmlvalues_resp_xml_str:
    if '&' not in wmlvalues_resp_xml_str:
        wmlvalues_resp_xml_str = wmlvalues_resp_xml_str.replace('&', '&')

From WaterML XML to array data and pretty plots

*Finally*, we’re ready to parse. First, note what we get from print wmlroot_el:
<Element {}timeSeriesResponse at 640a918>
This is a hint that ElementTree adds the waterML namespace as a prefix to all tags. This is the ugliest part of using ElementTree, as we’ll see.

You can use XPath with ET to find one or all tags at a specific hierarchy level. Each WaterML time series individual observation is stored in a value tag; in principle, we can read all these values in their original sequence like this:

elvalueall = wmlroot_el.findall('./timeSeries/values/value')

We can then iterate through elvalueall, like any other Python iterable object. Except I neglected the annoying namespace issue that’s somewhat particular to ET. All tags above the root in the WaterML are implicitly qualified with the waterML namespace, To deal with this, we add this namespace prefix to every tag:

wmlxmlns = ""
elvalueall = wmlrootel.findall('./{%s}timeSeries/{%s}values/{%s}value' %

Now we read the time series into a numpy floating-point array. Numpy is widely used in scientific applications. It provides a generic array object of a single data type (integer, float, string, objects, etc). It enables array-oriented operations that are similar to what Matlab provides. But we’ll take advantage of Python’s list comprehension to iterate over elvalueall in a compact, readable way, to create our numpy time series value array (we need to convert to float because the XML element text is read as a string, by necessity):

import numpy as NY
val_a = NY.asarray([float(val.text) for val in elvalueall])

Numpy must be installed first, but I won’t cover that here. Now we’ll do the same for the date-time value corresponding to each observation value. Except, the date-time is stored as an element attribute (“dateTime“) in ISO 8601 string representation (e.g., “2007-03-04T20:32:17”). Jochen Voss provides a very nice summary of date-time handling and conversions in Python using the standard library. After reading the datetime string values into a list, we’ll use the strptime function from the datetime standard library to convert ISO date strings into a datetime objects, and create a numpy array of datetime objects:

import datetime as DT
dtstr_lst = [val.get('dateTime') for val in elvalueall]
isodtformat = "%Y-%m-%dT%H:%M:%S"
dt_a = NY.asarray([DT.datetime.strptime(dtstr, isodtformat) for dtstr in dtstr_lst])

(The dateutil package has powerful, easy-to-use date-time parsers, but I wanted to stick to the standard library). Element attributes are loaded into a dictionary, and ET provides the get method that’s just like the standard dictionary method. Nice.

We’re done! But we want to see evidence that something really happened, and pretty pictures, so here’s a brief listing of the val_a and dt_a arrays from the Python interactive prompt:

array([  7.1,   6.5,   5.7, ...,  13. ,  13. ,  12. ])

array([2000-08-01 00:00:00, 2000-08-02 00:00:00, 2000-08-03 00:00:00, ...,
       2006-07-31 00:00:00, 2006-08-01 00:00:00], dtype=object)

Using matplotlib, we create a simple time series plot (but first convert dt_a objects to a form that can be used for nice date formatting of the x-axis, using the date2num function). “pylab” is a sort-of wrapper around matplotlib, but it can get confusing, so I won’t go into the distinctions. Here’s the code and the plot:

import pylab
pylab.plot_date(pylab.date2num(dt_a), val_a, '-o', markersize=2.5)
matplotlib time series plot

matplotlib time series plot

Wrap-up, and what’s next

In the second and final part, I create some nice packaging for this code, to make it easier to extract additional time series. I’ll also extract useful metadata from the WaterML file and use it to label the plot. Before closing, I want to get back to Jon’s Python code. While I used ElementTree, Jon relied on minidom and xpath. Nothing wrong with that — it works. The main advantage I see to his approach is that with xpath, elements don’t need to be prefixed with the waterML namespace, so parsing is cleaner. But there are two disadvantages: 1, xpath is part of PyXML, a package that needs to be installed (I ran into an error while trying to install it — something rather rare, in my experience); 2, while minidom is part of the standard library, it inherits and imposes the DOM syntax and language, for obvious reasons, but that introduces a bit of an additional skillset baggage to users. DOM traversing is not clean, straightforward Python.

This entry was posted in Python and tagged , , , . Bookmark the permalink.

7 Responses to Python to CUAHSI WaterML & WaterOneFlow web service, Pt. 1

  1. Pingback: Python to CUAHSI WaterML & WaterOneFlow web service, Pt. 2 « Mi estero

  2. Jon Goodall says:

    Nice post, Emilo! etree looks like a very handy XML parser. I need to take a look. Did you try the GetValuesObject method? In principle, this should return an object that is automatically deserialized (i.e. no need to parse xml yourself). Be interesting to see if suds can handle it.

    • emayorga says:

      Jon: Thanks. Actually, I was curious about the *Object methods. I don’t quite know how these objects are mapped into data structures and whether this is a well-defined, generic SOAP scheme that should “just work” in all proper implementations. So, XML seemed more straightforward and transparent in this case. If you can point me to a document that describes this, I can take a look. I started out by trying the GetSites method, but could never get it to fully work, so I gave up and switched to trying GetValues (I also tried GetSitesXml and was getting somewhere before I switched over). Parsing the XML isn’t bad at all once I got past the namespace issue with etree’s XPath.

  3. Pingback: virtual water » Using Python’s suds package to connect to the CUAHSI HIS

  4. Jon Goodall says:

    I did a quick test to see if I could extract the site name from a GetSiteInfo call. It seems to work, at least for this example.

  5. Pingback: virtual water » Querying from Python

  6. Pingback: Python to CUAHSI WaterML & WaterOneFlow web service, Pt. 3 (HIS Central Web Services) « Mi estero

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s