- with R packages aroma.affymetrix, DNAcopy
- start from .CEL files
- output total copy number (CN) probe files, segment files and allele-specific CN files and B allele frequency (BAF) segment files as well as LOH calculation and its segment files.
- for single samples, plot the genome-wide CN landscape, and also zooming in a specific region
- for multiple samples, plot aggregated CN segments by frequency (strip plot) and heatmap of CN landscape sorted by hierarchical clustering.
perl arrayplotter.pl -in test_file/GSM325151
perl arrayplotter.pl -in test_file/GSM412388
perl multiple_segment_plot.pl -f test_file/multiple_segment_test.tsv -sf test_file/sample_types.tsv -genome hg19
##################################
##################################
- extract probe values from .CEL files
- segment with DNAcopy
- cleanup segment, adjust baseline
- make plots
- extract metadata from GEOmeta text files
- insert metadata to db
- insert probe, segment data to db
- update db if segments are re-processed
- update dbstats (not tested)
An example directory structure indicating one series with 2 array experiments, where array1 is processed with steps probe
, segment
and reseg
, with results written into the processed
directory.
working_dir/
├── PlatformInfo
├── ReferenceFile
├── annotationData
│ └── chipTypes
│ └── CytoScanHD_Array
│ └── CytoScanHD_Array.cdf
├── plmData
├── probeData
├── processed
│ └── series1
│ ├── array1
│ │ ├── probes,cn.tsv
│ │ ├── segments,cn,provenance.tsv
│ │ └── segments,cn.tsv
│ └── array2
└── rawData
└── series1
└── CytoScanHD_array
├── array1.CEL
└── array2.CEL
- To segment all the (total copy number) probe files in one series, i.e. all files named
probes,cn.tsv
inworking_dir/processed/series1/array.../
will be processed. The output filessegments,cn.tsv
will be written into the same array folder.
rscript --vanilla processPipeline_combined.r -w working_dir -s series1 -e segment -p cn
- To do noise filtering on all the (total copy number) segment files in one series, i.e. all files named
segments,cn.tsv
inworking_dir/processed/series1/array.../
will be processed. The originalsegments,cn.tsv
will be renamed tosegments,cn,provenance.tsv
and the new segment files will be named assegments,cn.tsv
and written into the same array folder.
rscript --vanilla processPipeline_combined.r -w working_dir -s series1 -e reseg -p cn
A particular array can be selected for this additional noise filtering process by -a
rscript --vanilla processPipeline_combined.r -w working_dir -s series1 -a array1 -e reseg -p cn
- The metadata directory is
/Volumes/arraymapIncoming/GEOmeta/
- Default
workingdir
is~/aroma/hg19/
, for any step after probe,workingdir
can be anything as long as samples in the following structure:$workingdir/processed/$series/$array
. Butprobe
step requires several other subdirectories in theworkingdir
:rawData
,annotationData
,PlatformInfo
,referenceFile
. - Default raw data retrieval directory is
$workingdir/rawData
in the structureseries/array/xxx.CEL
.