Skip to content

Kicking off analysis pipeline

Danying Shao edited this page Feb 17, 2020 · 1 revision

The sequencer data repository is the place that hosts the data generated from the sequencers. It can be hosted on a separate server different from sequencer or PEGR. PEGR needs to communicate with the repository through ssh using a ssh keyfile (see https://github.com/seqcode/pegr/wiki/Deployment). The settings needs to be included in the configuration file (see https://github.com/seqcode/pegr/wiki/Configurations). The sequence runs are organized in the repository such that each sequence run is contained in a folder, and the folder names start with date and thus are incremental.

PEGR has two settings in the table "Chores" that are used in managing the sequence run analysis: "RunsInQueue" and "PriorRunFolder". After a sequence run is submitted on PEGR, PEGR will append the sequence run ID to "RunsInQueue" (first in, first out). An empty "RunsInQueue" means that there are no sequence runs waiting to be analyzed.

A cron job was set up to detect the sequencer data repository every 15 minutes. It is configured in the file pegr/grails-app/jobs/pegr/WalleJob.groovy and calls WallService. Note that all quartz jobs can be enabled or disabled for each environment at the file pegr/grails-app/conf/QuartzConfig.groovy.

The cron job first checks if there is sequence run(s) waiting in the queue and if the "cegr_run_info.txt" file from the previous run on the sequencer data repository has been moved. If so, it will find the next sequence run's folder and match it with the first sequence run in the waiting queue. It then construct the file cegr_run_info.txt and folder cegr_config, and move them to the sequencer data repository. PEGR then removes this sequence run's ID from the queue. The "cegr_run_info.txt" will be removed from the sequencer data repository by the external data processing pipeline (e.g. Galaxy) after analysis is completed for the run.