Command Line Provenance

Introduction

cmdline_provenance is a Python package for keeping track of your data processing steps.

It was inspired by the popular NCO and CDO command line tools, which are used to manipulate the data and/or metadata contained in netCDF files (a self-describing file format that is popular in the weather and climate sciences). These tools generate a record of what was executed at the command line, append that record to the history attribute from the input data file, and then set this command log as the history attribute of the output data file.

For example, after using NCO and CDO to select the 2001-2005 time period from a rainfall data file and then delete the long_name file attribute, the command log would look as follows:

Fri Dec 08 10:05:47 2017: ncatted -O -a long_name,pr,d,, rainfall_data_2001-2005.nc
Fri Dec 01 07:59:16 2017: cdo seldate,2001-01-01,2005-12-31 rainfall_data_1850-2005.nc rainfall_data_2001-2005.nc

Following this simple approach to data provenance, it is possible maintain a record of all data processing steps from intial download/creation of the data files to the end result (e.g. a .png image).

cmdline_provenance contains a function for generating command logs in the NCO/CDO format, in addition to simple functions for reading and writing log files.

Installation

$ pip install cmdline-provenance

or

$ conda install cmdline_provenance

Usage

creating a log

Let’s say we have a script ocean_analysis.py, which takes two files as input (an ocean temperature and ocean salinity file) and outputs a single file. These input files could be CSV (.csv) or netCDF (.nc) or some other format, and the output could be another data file (e.g. .csv or .nc) or an image (e.g. .png, .eps). For the sake of example, we will simply use an unspecified format (.fmt) for now.

The script can be run at the command line as follows:

$ python ocean_analysis.py temperature_data.fmt salinity_data.fmt output.fmt

To create a new log that captures this command line entry, we can add a couple of lines to our script to make use of the new_log function:

>>> import cmdline_provenance as cmdprov
>>> my_log = cmdprov.new_log()
>>> print(my_log)
Tue Jun 26 11:24:46 2018: /Users/username/opt/anaconda3/bin/python ocean_analysis.py temperature_data.fmt salinity_data.fmt output.fmt

If our script is available via an online code hosting platform (e.g. GitHub, Bitbucket) or has been published with a DOI, we can include the relevant URL in the command log:

>>> my_log = cmdprov.new_log(code_url='https://doi.org/10.6084/m9.figshare.12143562.v3')
>>> print(my_log)
Tue Jun 26 11:24:46 2018: /Users/username/opt/anaconda3/bin/python ocean_analysis.py temperature_data.fmt salinity_data.fmt output.fmt (https://doi.org/10.6084/m9.figshare.12143562.v3)

If the code you’ve written to produce a given output file is in a Jupyter Notebook rather than a command line script, the new_log function will recognise it’s being executed in a notebook and adjust the log accordingly:

In [1]:  my_log = cmdprov.new_log()
         print(my_log)

Out [1]: Tue Jun 26 11:24:46 2018: /Users/username/opt/anaconda3/bin/jupyter notebook ocean_analysis.ipynb

writing a log to file

If our output file is a self-describing file format (i.e. a format that carries its metadata with it), we can include our new log in the file metadata. For instance, a common convention in weather and climate science is to include the command log in the global history attribute of netCDF data files. If we were using the iris or xarray libraries (for instance) to read and write netCDF files, the process would look something like this:

>>> import iris
>>> import xarray as xr
>>> import cmdline_provenance as cmdprov
>>> my_log = cmdprov.new_log()
...

>>> type(cube)
iris.cube.Cube
>>> cube.attributes['history'] = my_log
>>> iris.save(cube, 'output.nc')

>>> type(ds)
xarray.core.dataset.Dataset
>>> ds.attrs['history'] = my_log
>>> ds.to_netcdf('output.nc')

Similarly, if our output file was a .png file created using matplotlib, we could append the command log to the image metadata:

>>> import matplotlib.pyplot as plt
>>> my_log = cmdprov.new_log()
...
>>> plt.savefig('output.png', metadata={'History': my_log})

The PyAOS Data Carpentry lesson on data provenance covers writing metadata to different image formats in more detail.

If our output file is not a self-describing format (e.g. output.csv), then we can write a separate log file (i.e. a simple text file with the log in it) using the write_log function.

>>> cmdprov.write_log('output.log', my_log)

While it’s not a formal requirement of the write_log function, it’s good practice to make the name of the log file exactly the same as the name of the output file, just with a different file extension (such as .log or .txt).

including logs from input files

In order to capture the complete provenance of the output file, we need to include the command logs from the input files in our new log. We can do this using the infile_logs keyword argument associated with the new_log function.

If our input files are a self-describing format, then similar to the iris example above, we can extract the input file logs from the metadata of the input files:

>>> inlogs = {}
>>> inlogs['temperature_data.nc'] = temperature_cube.attributes['history']
>>> inlogs['salinity_data.nc'] = salinity_cube.attributes['history']
>>> my_log = cmdprov.new_log(infile_logs=inlogs)
>>> print(my_log)
Tue Jun 26 11:24:46 2018: /Users/username/opt/anaconda3/bin/python ocean_analysis.py temperature_data.nc salinity_data.nc output.nc
History of temperature_data.nc:
Tue Jun 26 09:24:03 2018: cdo daymean temperature_data.nc
History of salinity_data.nc:
Tue Jun 26 09:22:10 2018: cdo daymean salinity_data.nc
Tue Jun 26 09:21:54 2018: ncatted -O -a standard_name,so,o,c,"ocean_salinity" salinity_data.nc

If the input files aren’t self-describing, we can use the read_log function to read the log files associated with the input data files (these logs files may have been previously written using the write_log function):

>>> inlogs = {}
>>> inlogs['temperature_data.csv'] = cmdprov.read_log('temperature_data.log')
>>> inlogs['salinity_data.csv'] = cmdprov.read_log('salinity_data.log')
>>> my_log = cmdprov.new_log(infile_logs=inlogs)

For scripts that take many input files, the resulting log files can become very long and unwieldy. To help prevent this, think about ways to avoid repetition. For instance, if you’ve got one input file that contains data from the year 1999-2003 and another equivalent file with data from 2004-2008, it’s probably only necessary to include the log from the 1999-2003 file (i.e. because essentially identical data processing steps were taken to produce both files).