4 Jun 2022

Slice lists

Recently, I received a document register consisting of hundreds of line items. Given the structure, it wasn’t easy to sort items, say for a particular month, or about a certain asset-type. The raw data contained over 25 columns with all sorts of information, of which I required only four. So, instead of using grep, I figured I’d slice the raw data with Pandas first.

The demo below runs through the large register and gets me, say, a list of deliverables for April. The script (a) reads a csv file into memory, (b) sorts it by a specific column, (c) selects only the columns I need, (d) writes it all to a new csv file, and (e) also prints this list to screen.

Slice through a large list with Pandas and grep.

Notice items occurring in the register — from the first column of line numbers. This indicates how deliverables to-be-issued (e.g. in April) are all over the place, but easy track them this way. Using pipe (|) and grep with regular expressions at command line, I can do things to the list, e.g., filter stuff out to see only what I need to.

# filter for all reports
python3 mdr.py | grep -i "report"

# filter reports expected in, say, Apr 2022 (when date is in yyyy-mm-dd format)
python3 mdr.py | grep "2022-04-" | grep -i "report"

# get a list of expected deliverables for, say, Apr and May 2022
python3 mdr.py | grep -E "2022\-0[4-5]-"

# get a list of expected deliverables for, say, CG, CS, CX tags
python3 mdr.py | grep -E "\-C[G|S|X]-"

# get a list of expected deliverables for, say, CG, CS, CX, and NZ tags
python3 mdr.py | grep -E "\-C[G|S|X]-|\-NZ-"

# get the above for say Q3 and Q4 of Year 2022
python3 mdr.py | grep -E "\-C[G|S|X]-|\-NZ-" | grep -E "2022\-0[6-9]|1[0-2]-"

And so on. Here’s the script for slicing the raw CSV data.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
Query master document register for info
mdr.py 2022 ckunte

"""
import pandas as pd


def main():
    # Read data
    df = pd.read_csv("./mdr.csv")

    # Limit "TITLE" column width
    df["TITLE"] = df["TITLE"].str[:30]

    # Sort data by a specific column
    df.sort_values(by="IFR FORECAST", inplace=True)

    # Select fewer columns to export
    df_sel = df[["TITLE", "NUMBER", "CAT", "IFR FORECAST"]]

    # Export data of selected columns to a csv file
    df_sel.to_csv("./mdr-ifr_fc.csv")

    # Print filtered data to screen
    print(df_sel.to_string())
    pass


if __name__ == "__main__":
    main()

The raw data I am trying to slice has these column labels, viz., TITLE, NUMBER, CAT, etc., which is why they are used in the script.