Skip to content

Implementation Details for the "Overwrite" Option

Marcus Fedarko edited this page Sep 9, 2019 · 5 revisions

Background: which auxiliary files this applies to

Note that, by default, no auxiliary files should be generated by the preprocessing script -- the only output should be a .db file created using Python's sqlite3 module. However, if certain options are passed to the preprocessing script (-pg, -px, -spqr, -nbdf, -npdf, -sp, etc), this can cause certain extra/"auxiliary" files to be generated during a run of the script.

We create certain auxiliary files directly from the Python code in the preprocessing script: this includes *.gv, *.xdot, *_links, *_single_links files, as well as the sp_[bubbles|chains|etc].txt files generated by -sp. These auxiliary files are generated using Python's os.open() method with the O_EXCL flag set, so on modern computer systems creating these files shouldn't overwrite extant files if if -w is not passed.

However, this isn't necessarily the case for auxiliary files generated outside of the Python code (e.g. spqrD.gml or component_D.info files). So those writing operations are technically vulnerable to that race condition, although it's an admittedly uncommon one.

Details

When we call check_file_existence() before creating a new auxiliary file from within save_aux_file() in the preprocessing script, a user or a process could get around this check for errors by creating a file or directory at the checked filepath after check_file_existence() is called but before we start writing to that filepath. This could result in data loss for whoever owns the recently created file/directory, or it could result in this script running into an error. In either case, it's not a desirable situation (although it is an uncommon one).

We circumvent this by using os.fdopen() wrapped to os.open(), with certain flags (based on whether or not the user passed -w) set in order to create files here. (This function is the one place where MetagenomeScope's preprocessing script directly writes to a file; all other file creation operations are done by other processes, e.g. the SPQR script or pysqlite.) This approach allows us to guarantee an error will be thrown and no data will be erroneously written if the aforementioned race condition happens.

(Note that, for NFS, this approach only works "...when using NFSv3 or later on kernel 2.6 or later," according to the open(2) man page as of June 8, 2018. That being said, NFSv3 dates back to June 1995 and the Linux kernel v2.6 dates back to December 2003, so most modern systems shouldn't encounter this race condition.)

The use of os.open() in conjunction with the os.O_EXCL flag in order to prevent the race condition, as well as the background information for this writeup, is based on Adam Dinwoodie (username me_and)'s answer to this Stack Overflow question.