June 14, 2015
Libraries
When using libraries, I often find myself torn between the standard ones versus importing a firehose like numpy. The former gives me the base upon which I build functions and get the task at hand done, which is often tedious, but offers a sense of accomplishment in the end. Whereas with the latter, all the heavy lifting is done for you by this powerful library. (There are, of course, occasions where I do not hesitate to use numpy, like generating plots for example, where I am less interested in understanding graphics, more in getting results for further realization.) This I suppose rarely crosses a fulltime programmer’s or a data scientist’s mind, who may be doing this all day with a key objective of writing errorfree code quickly and painlessly.
For someone who programs occasionally — not as my primary work routine, I find developing muscle memory valuable, as it helps me reapply techniques across a variety of tasks. Failing often forces me to pay attention to details, and seek clarity on control flow. I realized this when I was writing a block of code to process vessel motions data from seatransporting assets recently. The list (or array) looked like this (saved in a file named motions_data.py
):
topside_assets = [{'asset': 'A1', 'ves': 'S1', 'rangle': 10.5, 'rperiod': 12.2, 'pangle': 5.7, 'pperiod': 12.2, 'heave_accl': 0.2, 'heave_period': 12.2, 'cargo_wt': 2231.1},
{'asset': 'A2', 'ves': 'P1', 'rangle': 13.7, 'rperiod': 12.2, 'pangle': 7.4, 'pperiod': 12.2, 'heave_accl': 0.2, 'heave_period': 12.2, 'cargo_wt': 2231.1},
... longlist skipped for brevity ...
{'asset': 'B1', 'ves': 'M1', 'rangle': 15.3, 'rperiod': 6.9, 'pangle': 4.8, 'pperiod': 7.3, 'heave_accl': 0.2, 'heave_period': 6.9, 'cargo_wt': 1676.7},
{'asset': 'B2', 'ves': 'M1', 'rangle': 15.3, 'rperiod': 6.9, 'pangle': 4.8, 'pperiod': 7.3, 'heave_accl': 0.2, 'heave_period': 6.9, 'cargo_wt': 1383.8}]
I’ve taken a simple CSV file and added a structure to it in the above — with anonymized data, as you can tell. Of course, most people I think won’t bother to do this by hand, and instead read a file, and get it as a list with the same virtual structure in a few lines of code. But, I like seeing data structure beforehand — as a personal preference and where practically possible, to avoid overloading my mind with too many abstractions. In that regard, I find comfort in Rob Pike’s Rule 5 of programming:
Data dominates. If you’ve chosen the right data structures and organized things well, [then] the algorithms will almost always be selfevident. Data structures, not algorithms, are central to programming.
If desired, a dictionary like structure can easily be added to raw csv data using the code below, whose output would look much like the list shown above.
#!/usr/bin/env python
# * coding: UTF8 *
import csv
with open('motions_rawdata.csv', 'rU') as f:
columns = ['asset', 'ves', 'rangle', 'rperiod', 'pangle', 'pperiod', 'heave_accl', 'heave_period', 'cargo_wt']
for row in csv.DictReader(f, fieldnames=columns):
print row,','
Using standard library
Now to get mean, standard deviation, and max. values for, say, roll angles from the above data set, here’s a hard way. In this method, I use map()
, filter()
and reduce()
functions from standard library to get results (If you’re interested in Functional Programming, then Mary Rose Cook offers a wonderful primer at Recurse):
#!/usr/bin/env python
# * coding: UTF8 *
from operator import add
from math import sqrt
# Import motion data
from motions_data import *
# Map and filter roll angles and turn them into a list
roll_angles_topside = map(lambda x: x['rangle'], filter(lambda x: 'rangle' in x, topside_assets))
if len(roll_angles_topside) > 0:
average_roll_angle_topside = reduce(add, roll_angles_topside) / len(roll_angles_topside)
vfn = map(lambda x: (x  average_roll_angle_topside)**2, roll_angles_topside)
variance_roll_angle_topside = reduce(add, vfn) / len(roll_angles_topside)
print "Mean (roll):", average_roll_angle_topside, "deg"
print "Max (roll):", max(roll_angles_topside), "deg"
print "SD (roll)", sqrt(variance_roll_angle_topside), "deg"
Here’s how it works:

On line 10,
map()
takes an anonymous inline function (vialambda
) to pick values corresponding torangle
(roll angle) and form a list subset from the largermotions_data
list. The result of line 10 looks like this:[10.5, 13.7, ..., 15.3, 15.3]
Notice from above that only values corresponding to roll angles (those with
rangle
labels) appear in this list — this is viax['rangle']
, which would still have output like'rangle': 10.5
. Thereafter,filter()
is further used to remove the remaining labels in the list, namely, all instances ofrangle
. The code on line 10, therefore, is essential in either the standard library method, or the numpy method that we’ll see later. 
With
if len(roll_angles_topside) > 0:
, overflow error is avoided, and we then want to progress further to calculate the average (or mean) of these roll angles. This is done withreduce()
, which takes a function (add
, an operator in this case) to sum all values in the list,roll_angles_topside
. The sum is then divided by the number of values in the list to get mean roll angle. 
Determining variance is similar to finding mean, but just a little more complex, as it requires calculating the deviation of each value in the list, deduct it from the mean, square the result, and then convert these in to a list of deviations. This is done by the function,
vfn
:vfn = map(lambda x: (x  average_roll_angle_topside)**2, roll_angles_topside)
The output of
vfn
looks like this:[35.02487603305781, 7.388512396694199, ..., 1.2503305785123868, 1.2503305785123868]
Then as in step 2, we employ
reduce()
to get variance as below:variance_roll_angle_topside = reduce(add, vfn) / len(roll_angles_topside)

Lastly, standard deviation is calculated as the square root of the variance, from step 3.
Using Numerical Python
With numpy
, the whole code looks like this:
#!/usr/bin/env python
# * coding: UTF8 *
import numpy
# Import motion data
from motions_data import *
# Map and filter roll angles and turn them into a list
roll_angles_topside = map(lambda x: x['rangle'], filter(lambda x: 'rangle' in x, topside_assets))
if len(roll_angles_topside) > 0:
print "Mean (roll):", numpy.mean(numpy.array(roll_angles_topside)), "deg"
print "SD (roll):", numpy.std(numpy.array(roll_angles_topside)), "deg"
print "Max (roll): ", max(roll_angles_topside), "deg"
All the functional acrobatics are performed invisibly by this magical library, and so, just by invoking numpy.mean()
and numpy.std()
and feeding an array, numpy.array(roll_angles_topside)
, to these functions, one can determine the mean and standard deviation in one simple call each. So, as one can see, it’s easy to get high on numpy
. (Python 3.x has statistics module that caters specifically to these functions.)
Making the script generic
The above examples take a hardcoded motion parameter (rangle
) and dataset (topside_assets
), so to extend them to allow the user to choose a dataset as well as a motion parameter desired (without having the edit the script), I’ve modified to code below slightly — again for both the methods:
#!/usr/bin/env python
# * coding: UTF8 *
from operator import add
from math import sqrt
# Import motion data
from motions_data import *
def main():
print "Available motion datasets: topside_assets, jacket_assets, misc_assets"
print "Available parameters: 'rangle', 'rperiod', 'pangle', 'pperiod', 'heave_accl', 'heave_period', 'cargo_wt'"
dataset, param = input("Enter dataset, 'param' to use: ")
params = map(lambda x: x[param], filter(lambda x: param in x, dataset))
if len(params) > 0:
mean = reduce(add, params) / len(params)
vfn = map(lambda x: (x  mean)**2, params)
variance = reduce(add, vfn) / len(params)
print "Mean :", mean
print "Std.dev.:", sqrt(variance)
print "Maximum :", max(params)
print "Minimum :", min(params)
pass
if __name__ == '__main__':
main()
To do the same using numpy
, the code is as follows:
#!/usr/bin/env python
# * coding: UTF8 *
import numpy
# Import motion data
from motions_data import *
def main():
print "Available motion datasets: topside_assets, jacket_assets, misc_assets"
print "Available parameters: 'rangle', 'rperiod', 'pangle', 'pperiod', 'heave_accl', 'heave_period', 'cargo_wt'"
dataset, param = input("Enter dataset, 'param' to use: ")
params = map(lambda x: x[param], filter(lambda x: param in x, dataset))
if len(params) > 0:
print "Mean :", numpy.mean(numpy.array(params))
print "Std.dev.:", numpy.std(numpy.array(params))
print "Maximum :", max(params)
print "Minimum :", min(params)
pass
if __name__ == '__main__':
main()
When run, the output with either method would be as illustrated below:
$ python motionstats.py
Available motion datasets: topside_assets, jacket_assets, misc_assets
Available parameters: 'rangle', 'rperiod', 'pangle', 'pperiod', 'heave_accl', 'heave_period', 'cargo_wt'
Enter dataset, 'param' to use: topside_assets, 'pangle'
Mean : 6.00454545455
Std.dev.: 1.56973572551
Maximum : 9.2
Minimum : 3.2
Zip (Jul 3, 15)
A couple of days ago, I was stuck with a problem calculating wave lengths: I had three lists (of equal lengths, or number of items) — generated from a tuple, which consisted of significant wave heights (Hs) & peak wave periods (Tp) — both corresponding to 1month return period, and water depths (d).
# Make filtered lists of Hs, Tp, and d from the chosen dataset
hs_all = map(lambda x: x['hs'], filter(lambda x: 'hs' in x, dataset))
tp_all = map(lambda x: x['tp'], filter(lambda x: 'tp' in x, dataset))
d_all = map(lambda x: x['d'], filter(lambda x: 'd' in x, dataset))
Due to the need to perform depth criteria check — as shallow water nonlinearity kicksin for intermediate and shallow depths, I could not use the map()
function. This is because the anonymous function, lambda()
, is too simple for multiple conditional checks.
After scouting around builtin functions, I found zip(), and it did not disappoint. Before I proceed to show how I used it, here’s the depth criteria check for wave length calculations (Table 3.3, Deepwater criterion and wave length, Subrata K. Chakrabarti, Handbook of Offshore Engineering, Volume I, 2005):
To get this in code, I calculated wave lengths for all types first, and then mapped them into lists each, Ld_all
, Li_all
, Ls_all
as below:
# Wave length (Deepwater)
if len(tp_all) > 0:
Ld_all = map(lambda x: g * x**2 / (2 * pi), tp_all)
# Wave length (Intermediate)
if len(Ld_all) > 0:
Li_all = map(lambda x, y: y * sqrt(tanh(2 * pi * x / y)), d_all, Ld_all)
# Wave length (Shallow)
if len(tp_all) > 0:
Ls_all = map(lambda x, y: x * ((g * y)**0.5), tp_all, d_all)
Then, to select and populate the appropriate wave length based on criteria, I did the following:
# WAVE LENGTHS
# Populate wavelengths in to a list: L_all
# Start with an empty list L_all
L_all = []
# Call four separate lists using zip(), compare and
# cherry pick conditionally, and then populate L_all
# list with the chosen ones.
for i, j, k, l in zip(d_all, Ld_all, Li_all, Ls_all):
if ((i / j) >= 0.5):
L = j
if (0.05 < (i / k) < 0.5):
L = k
if ((i / l) <= 0.05):
L = l
L_all.append(L)
So, the task that I could not do in one line with map()
function, I split it into creating an empty list (L_all = []
) first, and then populate a criteriamatching wave length for each value in lists.
zip()
is a function that lets me perform acrobatics across multiple tuples, and I like it a lot.