Recently, I received a document register consisting of with hundreds of line items. Given the structure, it wasn’t easy to sort items, say for a particular month, or about a certain asset-type. The raw data contained over 25 columns with all sorts of information, of which I required only four. So, instead of using grep, I figured I’d slice the raw data with Pandas first.
The demo below runs through the large register and gets me, say, a list of deliverables for April. The script (a) reads a csv file into memory, (b) sorts it by a specific column, (c) selects only the columns I need, (d) writes it all to a new csv file, and (e) also prints this list to screen.
Notice items occurring in the register — from the first column of line numbers. This indicates how deliverables to-be-issued (e.g. in April) are all over the place, but easy track them this way. Using pipe (
|) and grep with regular expressions at command line, I can do things to the list, e.g., filter stuff out to see only what I need to.
# filter for all reports python3 mdr.py | grep -i "report" # filter reports expected in, say, Apr 2022 (when date is in yyyy-mm-dd format) python3 mdr.py | grep "2022-04-" | grep -i "report" # get a list of expected deliverables for, say, Apr and May 2022 python3 mdr.py | grep -E "2022\-0[4-5]-" # get a list of expected deliverables for, say, CG, CS, CX tags python3 mdr.py | grep -E "\-C[G|S|X]-" # get a list of expected deliverables for, say, CG, CS, CX, and NZ tags python3 mdr.py | grep -E "\-C[G|S|X]-|\-NZ-" # get the above for say Q3 and Q4 of Year 2022 python3 mdr.py | grep -E "\-C[G|S|X]-|\-NZ-" | grep -E "2022\-0[6-9]|1[0-2]-"
And so on. Here’s the script for slicing the raw csv data.1
#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Query master document register for info mdr.py 2022 ckunte """ import pandas as pd def main(): # Read data df = pd.read_csv("./mdr.csv") # Limit "TITLE" column width df["TITLE"] = df["TITLE"].str[:30] # Sort data by a specific column df.sort_values(by="IFR FORECAST", inplace=True) # Select fewer columns to export df_sel = df[["TITLE", "NUMBER", "CAT", "IFR FORECAST"]] # Export data of selected columns to a csv file df_sel.to_csv("./mdr-ifr_fc.csv") # Print filtered data to screen print(df_sel.to_string()) pass if __name__ == "__main__": main()
The raw data I am trying to slice has these column labels, viz.,
CAT, etc., which is why they are used in the script. ↩