Slice lists
Recently, I received a document register consisting of hundreds of line items. Given the structure, it wasn’t easy to sort items, say for a particular month, or about a certain asset-type. The raw data contained over 25 columns with all sorts of information, of which I required only four. So, instead of using grep, I figured I’d slice the raw data with Pandas first.
The demo below runs through the large register and gets me, say, a list of deliverables for April. The script (a) reads a csv file into memory, (b) sorts it by a specific column, (c) selects only the columns I need, (d) writes it all to a new csv file, and (e) also prints this list to screen.
Notice items occurring in the register — from the first column of line numbers. This indicates how deliverables to-be-issued (e.g. in April) are all over the place, but easy track them this way. Using pipe (|) and grep with regular expressions at command line, I can do things to the list, e.g., filter stuff out to see only what I need to.
# filter for all reports
python3 mdr.py | grep -i "report"
# filter reports expected in, say, Apr 2022 (when date is in yyyy-mm-dd format)
python3 mdr.py | grep "2022-04-" | grep -i "report"
# get a list of expected deliverables for, say, Apr and May 2022
python3 mdr.py | grep -E "2022\-0[4-5]-"
# get a list of expected deliverables for, say, CG, CS, CX tags
python3 mdr.py | grep -E "\-C[G|S|X]-"
# get a list of expected deliverables for, say, CG, CS, CX, and NZ tags
python3 mdr.py | grep -E "\-C[G|S|X]-|\-NZ-"
# get the above for say Q3 and Q4 of Year 2022
python3 mdr.py | grep -E "\-C[G|S|X]-|\-NZ-" | grep -E "2022\-0[6-9]|1[0-2]-"
And so on. Here’s the script for slicing the raw CSV data.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Query master document register for info
mdr.py 2022 ckunte
"""
import pandas as pd
def main():
# Read data
df = pd.read_csv("./mdr.csv")
# Limit "TITLE" column width
df["TITLE"] = df["TITLE"].str[:30]
# Sort data by a specific column
df.sort_values(by="IFR FORECAST", inplace=True)
# Select fewer columns to export
df_sel = df[["TITLE", "NUMBER", "CAT", "IFR FORECAST"]]
# Export data of selected columns to a csv file
df_sel.to_csv("./mdr-ifr_fc.csv")
# Print filtered data to screen
print(df_sel.to_string())
pass
if __name__ == "__main__":
main()
The raw data I am trying to slice has these column labels, viz., TITLE
, NUMBER
, CAT
, etc., which is why they are used in the script.