Slicing lists

Recently, I received a document register consisting of with hundreds of line items. Given the structure, it wasn’t easy to sort items, say for a particular month, or about a certain asset-type. The raw data contained over 25 columns with all sorts of information, of which I required only four. So, instead of using grep, I figured I’d slice the raw data with Pandas first.

The demo below runs through the large register and gets me, say, a list of deliverables for April. The script (a) reads a csv file into memory, (b) sorts it by a specific column, (c) selects only the columns I need, (d) writes it all to a new csv file, and (e) also prints this list to screen.

Slicing through a large list with Pandas and grep.

Notice items occurring in the register — from the first column of line numbers. This indicates how deliverables to-be-issued (e.g. in April) are all over the place, but easy track them this way. Using pipe (|) and grep with regular expressions at command line, I can do things to the list, e.g., filter stuff out to see only what I need to.

# filter for all reports
python3 | grep -i "report"

# filter reports expected in, say, Apr 2022 (when date is in yyyy-mm-dd format)
python3 | grep "2022-04-" | grep -i "report"

# get a list of expected deliverables for, say, Apr and May 2022
python3 | grep -E "2022\-0[4-5]-"

# get a list of expected deliverables for, say, CG, CS, CX tags
python3 | grep -E "\-C[G|S|X]-"

# get a list of expected deliverables for, say, CG, CS, CX, and NZ tags
python3 | grep -E "\-C[G|S|X]-|\-NZ-"

# get the above for say Q3 and Q4 of Year 2022
python3 | grep -E "\-C[G|S|X]-|\-NZ-" | grep -E "2022\-0[6-9]|1[0-2]-"

And so on. Here’s the script for slicing the raw csv data.1

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

Query master document register for info 2022 ckunte

import pandas as pd

def main():
    # Read data
    df = pd.read_csv("./mdr.csv")

    # Limit "TITLE" column width
    df["TITLE"] = df["TITLE"].str[:30]

    # Sort data by a specific column
    df.sort_values(by="IFR FORECAST", inplace=True)

    # Select fewer columns to export
    df_sel = df[["TITLE", "NUMBER", "CAT", "IFR FORECAST"]]

    # Export data of selected columns to a csv file

    # Print filtered data to screen

if __name__ == "__main__":

  1. The raw data I am trying to slice has these column labels, viz., TITLE, NUMBER, CAT, etc., which is why they are used in the script.