June 14, 2015

Libraries

When using libraries, I often find myself torn between the standard ones versus importing a firehose like numpy. The former gives me the base upon which I build functions and get the task at hand done, which is often tedious, but offers a sense of accomplishment in the end. Whereas with the latter, all the heavy lifting is done for you by this powerful library. (There are, of course, occasions where I do not hesitate to use numpy, like generating plots for example, where I am less interested in understanding graphics, more in getting results for further realization.) This I suppose rarely crosses a full-time programmer’s or a data scientist’s mind, who may be doing this all day with a key objective of writing error-free code quickly and painlessly.

For someone who programs occasionally — not as my primary work routine, I find developing muscle memory valuable, as it helps me reapply techniques across a variety of tasks. Failing often forces me to pay attention to details, and seek clarity on control flow. I realized this when I was writing a block of code to process vessel motions data from sea-transporting assets recently. The list (or array) looked like this (saved in a file named motions_data.py):

topside_assets = [{'asset': 'A1', 'ves': 'S1', 'rangle': 10.5, 'rperiod': 12.2, 'pangle': 5.7, 'pperiod': 12.2, 'heave_accl': 0.2, 'heave_period': 12.2, 'cargo_wt': 2231.1}, 
    {'asset': 'A2', 'ves': 'P1', 'rangle': 13.7, 'rperiod': 12.2, 'pangle': 7.4, 'pperiod': 12.2, 'heave_accl': 0.2, 'heave_period': 12.2, 'cargo_wt': 2231.1}, 
    ... longlist skipped for brevity ...
    {'asset': 'B1', 'ves': 'M1', 'rangle': 15.3, 'rperiod': 6.9, 'pangle': 4.8, 'pperiod': 7.3, 'heave_accl': 0.2, 'heave_period': 6.9, 'cargo_wt': 1676.7}, 
    {'asset': 'B2', 'ves': 'M1', 'rangle': 15.3, 'rperiod': 6.9, 'pangle': 4.8, 'pperiod': 7.3, 'heave_accl': 0.2, 'heave_period': 6.9, 'cargo_wt': 1383.8}]

I’ve taken a simple CSV file and added a structure to it in the above — with anonymized data, as you can tell. Of course, most people I think won’t bother to do this by hand, and instead read a file, and get it as a list with the same virtual structure in a few lines of code. But, I like seeing data structure beforehand — as a personal preference and where practically possible, to avoid overloading my mind with too many abstractions. In that regard, I find comfort in Rob Pike’s Rule 5 of programming:

Data dominates. If you’ve chosen the right data structures and organized things well, [then] the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

If desired, a dictionary like structure can easily be added to raw csv data using the code below, whose output would look much like the list shown above.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import csv

with open('motions_rawdata.csv', 'rU') as f:
    columns = ['asset', 'ves', 'rangle', 'rperiod', 'pangle', 'pperiod', 'heave_accl', 'heave_period', 'cargo_wt']
    for row in csv.DictReader(f, fieldnames=columns):
        print row,','

Using standard library

Now to get mean, standard deviation, and max. values for, say, roll angles from the above data set, here’s a hard way. In this method, I use map(), filter() and reduce() functions from standard library to get results (If you’re interested in Functional Programming, then Mary Rose Cook offers a wonderful primer at Recurse):

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

from operator import add
from math import sqrt

# Import motion data
from motions_data import *

# Map and filter roll angles and turn them into a list
roll_angles_topside = map(lambda x: x['rangle'], filter(lambda x: 'rangle' in x, topside_assets))

if len(roll_angles_topside) > 0:
    average_roll_angle_topside = reduce(add, roll_angles_topside) / len(roll_angles_topside)
    vfn = map(lambda x: (x - average_roll_angle_topside)**2, roll_angles_topside)
    variance_roll_angle_topside = reduce(add, vfn) / len(roll_angles_topside)
    print "Mean (roll):", average_roll_angle_topside, "deg"
    print "Max (roll):", max(roll_angles_topside), "deg"
    print "SD (roll)", sqrt(variance_roll_angle_topside), "deg"

Here’s how it works:

  1. On line 10, map() takes an anonymous inline function (via lambda) to pick values corresponding to rangle (roll angle) and form a list subset from the larger motions_data list. The result of line 10 looks like this:

    [10.5, 13.7, ..., 15.3, 15.3]
    

    Notice from above that only values corresponding to roll angles (those with rangle labels) appear in this list — this is via x['rangle'], which would still have output like 'rangle': 10.5. Thereafter, filter() is further used to remove the remaining labels in the list, namely, all instances of rangle. The code on line 10, therefore, is essential in either the standard library method, or the numpy method that we’ll see later.

  2. With if len(roll_angles_topside) > 0:, overflow error is avoided, and we then want to progress further to calculate the average (or mean) of these roll angles. This is done with reduce(), which takes a function (add, an operator in this case) to sum all values in the list, roll_angles_topside. The sum is then divided by the number of values in the list to get mean roll angle.

  3. Determining variance is similar to finding mean, but just a little more complex, as it requires calculating the deviation of each value in the list, deduct it from the mean, square the result, and then convert these in to a list of deviations. This is done by the function, vfn:

    vfn = map(lambda x: (x - average_roll_angle_topside)**2, roll_angles_topside)
    

    The output of vfn looks like this:

    [35.02487603305781, 7.388512396694199, ..., 1.2503305785123868, 1.2503305785123868]
    

    Then as in step 2, we employ reduce() to get variance as below:

    variance_roll_angle_topside = reduce(add, vfn) / len(roll_angles_topside)
    
  4. Lastly, standard deviation is calculated as the square root of the variance, from step 3.

Using Numerical Python

With numpy, the whole code looks like this:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import numpy

# Import motion data
from motions_data import *

# Map and filter roll angles and turn them into a list
roll_angles_topside = map(lambda x: x['rangle'], filter(lambda x: 'rangle' in x, topside_assets))

if len(roll_angles_topside) > 0:
    print "Mean (roll):", numpy.mean(numpy.array(roll_angles_topside)), "deg"
    print "SD (roll):", numpy.std(numpy.array(roll_angles_topside)), "deg"
    print "Max (roll): ", max(roll_angles_topside), "deg"

All the functional acrobatics are performed invisibly by this magical library, and so, just by invoking numpy.mean() and numpy.std() and feeding an array, numpy.array(roll_angles_topside), to these functions, one can determine the mean and standard deviation in one simple call each. So, as one can see, it’s easy to get high on numpy. (Python 3.x has statistics module that caters specifically to these functions.)

Making the script generic

The above examples take a hardcoded motion parameter (rangle) and dataset (topside_assets), so to extend them to allow the user to choose a dataset as well as a motion parameter desired (without having the edit the script), I’ve modified to code below slightly — again for both the methods:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

from operator import add
from math import sqrt

# Import motion data
from motions_data import *

def main():
    print "Available motion datasets: topside_assets, jacket_assets, misc_assets"
    print "Available parameters: 'rangle', 'rperiod', 'pangle', 'pperiod', 'heave_accl', 'heave_period', 'cargo_wt'"
    dataset, param = input("Enter dataset, 'param' to use: ")

    params = map(lambda x: x[param], filter(lambda x: param in x, dataset))

    if len(params) > 0:
        mean = reduce(add, params) / len(params)
        vfn = map(lambda x: (x - mean)**2, params)
        variance = reduce(add, vfn) / len(params)
        print "Mean    :", mean
        print "Std.dev.:", sqrt(variance)
        print "Maximum :", max(params)
        print "Minimum :", min(params)
        pass

if __name__ == '__main__':
    main()

To do the same using numpy, the code is as follows:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import numpy

# Import motion data
from motions_data import *

def main():
    print "Available motion datasets: topside_assets, jacket_assets, misc_assets"
    print "Available parameters: 'rangle', 'rperiod', 'pangle', 'pperiod', 'heave_accl', 'heave_period', 'cargo_wt'"
    dataset, param = input("Enter dataset, 'param' to use: ")

    params = map(lambda x: x[param], filter(lambda x: param in x, dataset))

    if len(params) > 0:
        print "Mean    :", numpy.mean(numpy.array(params))
        print "Std.dev.:", numpy.std(numpy.array(params))
        print "Maximum :", max(params)
        print "Minimum :", min(params)
        pass

if __name__ == '__main__':
    main()

When run, the output with either method would be as illustrated below:

$ python motion-stats.py
Available motion datasets: topside_assets, jacket_assets, misc_assets
Available parameters: 'rangle', 'rperiod', 'pangle', 'pperiod', 'heave_accl', 'heave_period', 'cargo_wt'
Enter dataset, 'param' to use: topside_assets, 'pangle'
Mean    : 6.00454545455
Std.dev.: 1.56973572551
Maximum : 9.2
Minimum : 3.2

Zip (Jul 3, 15)

A couple of days ago, I was stuck with a problem calculating wave lengths: I had three lists (of equal lengths, or number of items) — generated from a tuple, which consisted of significant wave heights (Hs) & peak wave periods (Tp) — both corresponding to 1-month return period, and water depths (d).

# Make filtered lists of Hs, Tp, and d from the chosen dataset
hs_all = map(lambda x: x['hs'], filter(lambda x: 'hs' in x, dataset))
tp_all = map(lambda x: x['tp'], filter(lambda x: 'tp' in x, dataset))
d_all = map(lambda x: x['d'], filter(lambda x: 'd' in x, dataset))

Due to the need to perform depth criteria check — as shallow water non-linearity kicks-in for intermediate and shallow depths, I could not use the map() function. This is because the anonymous function, lambda(), is too simple for multiple conditional checks.

After scouting around built-in functions, I found zip(), and it did not disappoint. Before I proceed to show how I used it, here’s the depth criteria check for wave length calculations (Table 3.3, Deepwater criterion and wave length, Subrata K. Chakrabarti, Handbook of Offshore Engineering, Volume I, 2005):

Wave length table

To get this in code, I calculated wave lengths for all types first, and then mapped them into lists each, Ld_all, Li_all, Ls_all as below:

# Wave length (Deepwater)
if len(tp_all) > 0:
    Ld_all = map(lambda x: g * x**2 / (2 * pi), tp_all)

# Wave length (Intermediate)
if len(Ld_all) > 0:
    Li_all = map(lambda x, y: y * sqrt(tanh(2 * pi * x / y)), d_all, Ld_all)

# Wave length (Shallow)
if len(tp_all) > 0:
    Ls_all = map(lambda x, y: x * ((g * y)**0.5), tp_all, d_all)

Then, to select and populate the appropriate wave length based on criteria, I did the following:

# WAVE LENGTHS
# Populate wavelengths in to a list: L_all

# Start with an empty list L_all
L_all = []

# Call four separate lists using zip(), compare and 
# cherry pick conditionally, and then populate L_all
# list with the chosen ones.
for i, j, k, l in zip(d_all, Ld_all, Li_all, Ls_all):
    if ((i / j) >= 0.5):
        L = j
    if (0.05 < (i / k) < 0.5):
        L = k
    if ((i / l) <= 0.05):
        L = l
    L_all.append(L)

So, the task that I could not do in one line with map() function, I split it into creating an empty list (L_all = []) first, and then populate a criteria-matching wave length for each value in lists.

zip() is a function that lets me perform acrobatics across multiple tuples, and I like it a lot.