diff --git a/jacob_frelinger/hist.png b/jacob_frelinger/hist.png index aab4753..d8a41b8 100644 Binary files a/jacob_frelinger/hist.png and b/jacob_frelinger/hist.png differ diff --git a/jacob_frelinger/jacob_frelinger.rst b/jacob_frelinger/jacob_frelinger.rst index b7802db..4c40513 100644 --- a/jacob_frelinger/jacob_frelinger.rst +++ b/jacob_frelinger/jacob_frelinger.rst @@ -17,10 +17,10 @@ Fcm - A python library for flow cytometry .. class:: abstract Flow cytometry has the ability to measure multiple parameters of a - heterogeneous mix of cells at single cell resolution. This has lead flow + heterogeneous mix of cells at single cell resolution. This has led flow cytometry to become an integral tool in immunology and biology. Most flow cytometry analysis is performed in expensive proprietary software packages, - and few opensource tool exist for working with flow cytometry data. + and few opensource tools exist for working with flow cytometry data. In this paper we present `fcm`, an BSD licensed python library for traditional gating based analysis in addition to newer model based analysis methods. @@ -34,27 +34,30 @@ Introduction .. background on flow -Flow cytometry (FCM) has become an integral tool in immunology and biology due to the -ability of FCM to measure cell properties at the single cell level for -thousands to millions of cells in a high throughput manner. In FCM, -cells are typically labeled with monoclonal antibodies to cell surface -or intracellular proteins. The monoclonal antibodies are conjugated to -different fluorochromes that emit specific wavelengths of light when -excited by lasers. These cells are then streamed single file via a -capillary tube where they may be excited by multiple lasers. Cells -scatter the laser light in different ways depending on their size and -granularity, and excited fluorochromes emit light of characteristic -wavelengths. Scattered light is recorded in forward and side scatter detectors, -and specific fluorescent emission light is recorded into separate -channels. Since each fluorescent dye is attached to specific cell markers by -monoclonal antibodies, the intensity of emitted light is a measure of the -number of bound antibodies of that specificity [Herzenberg2006]_ . The -data recorded for each cell is known as an event, although events may -sometimes also represent cell debris or clumps. -Modern instruments can resolve about a dozen fluorescent emissions -simultaneously and hence measure the levels of a dozen different -markers per cell - further increase in resolution is limited by the -spectral overlap (spillover) between fluorescent dyes. +Flow cytometry (FCM) has become an integral tool in immunology and biology due +to the ability of FCM to measure cell properties at the single cell level for +thousands to millions of cells in a high throughput manner. In FCM, cells are +typically labeled with monoclonal antibodies, molecules that bind only to a +specific region of a protein. These monoclonal antibodies are specific for +cell surface or intracellular proteins. The monoclonal antibodies are +conjugated to different fluorochromes that emit specific wavelengths of light +when excited by lasers. These cells are then streamed single file via a +capillary tube where they may be excited by multiple lasers. Cells scatter the +laser light in different ways depending on their size and granularity, and +excited fluorochromes emit light of characteristic wavelengths. Scattered +light is recorded in forward and side scatter detectors, and specific +fluorescent emission light is recorded into separate channels. Since each +fluorescent dye is attached to specific cell markers by monoclonal antibodies, +the intensity of emitted light is a measure of the number of bound antibodies +of that specificity [Herzenberg2006]_ . Using multiple monoclonal antibodies +specific to different cell surface markers or intracellular proteins, different +types of cells can be differentiated from each other. The data recorded for +each cell is known as an event, although events may sometimes also represent +cell debris or clumps of cells and debris. Modern instruments can resolve +about a dozen fluorescent emissions simultaneously and hence measure the levels +of a dozen different markers per cell - further increase in resolution is +limited by the spectral overlap (spillover) between fluorescent dyes. With +careful planning panels of seventeen colors are possible [Perfetto2004]_. .. traditional gating based analysis and other tools .. ie the why of fcm @@ -63,47 +66,53 @@ spectral overlap (spillover) between fluorescent dyes. :scale: 50% :align: center :figclass: align-center - + Diagram of how events are recorded in a flow cytometer provided by - lanl.gov :label:`flow` - - -Analysis of FCM data has traditionally relied on expert interpretation -of scatter plots known as dot plots that show the scattered light or -fluorescence intensity for each cell depicted as a point. Expert -operators examine these two dimensional dot plots in sequence and -manually define boundaries around cell subsets of interest in each -projection. The regions demarcated by these boundaries are known as -gates, and the cell subsets of interest may require multiple levels of -gates to identify. Much work is needed train expert operators to -standardize gate placement and minimize variance. Maecker et al [Maecker2005]_ found -a significant source of variability in a multi-center study was due to -variability in gating. New technologies have the potential to greatly -increase the number of simultaneous markers that can be resolved -with FCM. Inductively coupled plasma mass spectrometry [Ornatsky2006]_ replaces the -fluorescent dyes with stable heavy metal isotopes and fluorescent detection -with mass spectrometry. This eliminates the spectral overlap (spillover) from + lanl.gov :label:`flow` In this figure monoclonal antibodies specific to CD4 + are used to identify Helper T cells. When the labeled cells pass by the + laser and detector the laser causes the monoclonal antibodies fluoresce. + This fluorescent emission is then recorded, and can be used to identify + which cells are Helper T cells. + + +Analysis of FCM data has traditionally relied on expert interpretation of +scatter plots known as dot plots that show the scattered light or fluorescence +intensity for each cell depicted as a point. Expert operators examine these two +dimensional dot plots in sequence and manually define boundaries around cell +subsets of interest in each projection. The regions demarcated by these +boundaries are known as gates, and the cell subsets of interest may require +multiple levels of gates to identify. Much work is needed to train expert +operators to standardize the placement of gates and minimize the variance between +operators. Maecker et al. [Maecker2005]_ found a significant source of +variability in a multi-center study was due to variability in where operators +placed gates. New technologies have the potential to greatly increase the +number of simultaneous markers that can be resolved with FCM. Inductively +coupled plasma mass spectrometry [Ornatsky2006]_ replaces the fluorescent dyes +with stable heavy metal isotopes and fluorescent detection with mass +spectrometry. This eliminates the spectral overlap (spillover) from fluorescent dyes allowing a significantly increased number of markers to be -resolved simultaneously. - -With the increasing number of markers that can be resolved -simultaneously, there has been an increasing interest in automated methods of -cell subset identification. While there is need for such tools, with the -exception of the R BioConductor package, few open source -packages exist for doing both traditional analysis and automated analysis. -The majority of open source packages simply extract flow events into -tabular/csv formats, losing all metadata and providing no additional tools for -analysis. `fcm` attempts to resolve this by providing methods for working -with flow data in both gating-based and model-based methods. +resolved simultaneously. A recent study [Newell2012]_ using mass spectrometry +based cytometry produced data sets with over 40 markers. + +With the increasing number of markers that can be resolved simultaneously, +there has been an increasing interest in automated methods of cell subset +identification. While there is need for such tools, with the exception of the R +BioConductor [BioConductor]_ package, few open source packages exist for doing +both traditional analysis and automated analysis. The majority of open source +packages simply extract flow events into tabular/csv formats, losing all +metadata and providing no additional tools for analysis. `fcm` attempts to +resolve this by providing methods for working with flow data in both +gating-based and model-based methods. .. write project goals -The goals in writing `fcm` [fcm]_ are to provide a general-purpose python library for working with -flow cytometry data. Targeted uses include interactive data exploration with -[ipython]_, building pipelines for batch data analysis, and -development of GUI and web based applications. In this paper we will explore -the basics of working with flow cytometry data using `fcm` and how to use fcm -to perform analysis using both gating and model based methods. +The goals in writing `fcm` [fcm]_ are to provide a general-purpose python +library for working with flow cytometry data. Targeted uses include +interactive data exploration with ipython [ipython]_, building pipelines for +batch data analysis, and development of GUI and web based applications. In +this paper we will explore the basics of working with flow cytometry data using +`fcm` and how to use fcm to perform analysis using both gating and model based +methods. Loading, compensating and transforming data ------------------------------------------- @@ -134,7 +143,7 @@ In addition to traditional numpy array indexing, the text names of channels can be used to access channels too. .. code-block:: python - + In [1]: import numpy as np In [2]: import fcm @@ -143,21 +152,21 @@ be used to access channels too. In [4]: x.channels[7] Out[4]: 'AViD' - + In [5]: np.all(x[:,7] == x[:,'AViD']) Out[5]: True - + When processing cells and acquiring data, often the emission spectra of -fluorescent dyes overlap with neighboring channels. This spillover of light -needs to be corrected in a process called compensation that attempts -to remove the additional signal from neighboring channels. Using a -compensation matrix that describes the amount of spillover from each channel -into others, `fcm` will by default apply compensation at the time of -loading data, but this default behavior can be suppressed and -compensation performed at a later time if necessary. The spillover or compensation -matrix is typically found in the `FCMdata.notes.text` metadata, and `loadFCS()` will -default to compensating using that matrix if another is not specified. +fluorescent dyes overlaps with neighboring channels. This spillover of light +needs to be corrected in a process called compensation that attempts to remove +the additional signal from neighboring channels. Using a compensation matrix +that describes the amount of spillover from each channel into others, `fcm` +will by default apply compensation at the time of loading data, but this +default behavior can be suppressed and compensation performed at a later time +if necessary. The spillover or compensation matrix is typically found in the +`FCMdata.notes.text` metadata, and `loadFCS()` will default to compensating +using that matrix if another is not specified. .. figure:: comp.png @@ -202,7 +211,7 @@ In addition to traditional gates, `fcm` provides additional gate like filters, `DropChannel`, to remove unwanted columns from a view, and `Subsample`, that use a python slice objects to filter events. `FCMdata` objects `gate()` method can be used to apply gate objects in successive manner as it returns the updated -`FCMdata` object allowing chaining of `gate()` calls, like so: +`FCMdata` object allowing chaining of `gate()` calls, like so: .. code-block:: python @@ -216,8 +225,8 @@ which is equivalent to the following three lines of code: FCMdata.gate(g2) FCMdata.gate(g3) -In `fcm`, gating `FCMdata` object does not produce new `FCMdata` objects, but -rather each `FCMdata` object maintains a tree of each gated populations. +In `fcm`, gating an `FCMdata` object does not produce a new `FCMdata` object, but +rather each `FCMdata` object maintains a tree of the gated populations. Moving between nodes of the tree, accomplished by the `FCMdata.visit()` method, selects which events are retured on array lookup, using `numpy`'s efficient indexing to generate views. This allows `FCMdata` objects to contain an entire @@ -234,7 +243,7 @@ analyzing flow data. Model based analysis is an approach to automate and increase reproducibility in the analysis of flow data by the use of statistical models fitted to the data. With the appropriate multivariate statistical models, data fitting can be naturally performed on the full dimensionality, -allowing analysis to scale well with the increasing number of parameters in +allowing analysis to scale with the increasing number of parameters in flow cytometry. Mixture models are one such model based method. Mixture models are often chosen due to their ability to use multiple simpler distributions added together to describe a much more complex distribution as seen in figure @@ -255,7 +264,7 @@ simplest method being k-means classification, and more advanced methods based on the use of mixtures of Gaussians for data fitting. The general procedure for fitting a data set to a statistical model consists of creating a `FCMmodel` object containing hyper-parameters, followed by calling its `fit` -method on a collection of (or just one) `FCMdata` objects to generate +method on a collection of (or just one) `FCMdata` objects to generate `ModelResult` objects. Each `ModelResult` object holds the estimated parameters of the statistical model -- a `KMeans` object representing the centroid locations in a k-means model, or a `DPMixture` object representing the estimated @@ -278,13 +287,20 @@ fit data from the [dpmix]_ package, which is capable of using [gpustats]_ to utilize GPU cards for efficient estimation of mixture parameters. The two models are `DPMixtureModel` and `HDPMixtureModel`, describing a truncated Dirichlet process mixture model, and a hierarchical truncated Dirichlet -process mixture model. +process mixture model. `DPMixtureModel` has two methods of estimating parameters of the model for a given dataset, the first using Markov chain monte carlo (MCMC) and the second -using Bayesian expectation maximization (BEM). Sensible defaults for -hyperparameters have been chosen that in our experience perform satisfactorily -on all FCS data samples we have analyzed. +using Bayesian expectation maximization (BEM). The hyperparameters for the +mixture model are values that govern the prior distributions. Similar prior +distributions to those selected here were used in [Cron2013]_ and [Richards2014]_ +and experience suggests these are appropriate for a wide variety of +cytokine panels. Furthermore, examples of model parameterization can be found +in the FCM documentation examples section http://pythonhosted.org//fcm/advanced.html#clustering. If they need changing, +hyper-parameters can be changed by changing instance variables associated with +the `DPMixtureModel` or `HDPMixtureModel` objects. + +The hyperparameters for the mixture model are values that govern the prior distributions. Similar prior distributions to those selected here were used in [cite PlosComp paper and Comp Immuno Methods paper etc] and experience suggests these are appropriate for a wide variety of cytokine panels. Furthermore, examples of model parameterization can be found in the FCM documentation examples section [[link]]. .. code-block:: python :linenos: @@ -300,21 +316,21 @@ on all FCS data samples we have analyzed. # 100 iterations dpmodel = stats.DPMixtureModel(10, niter=100, type='BEM') - + # estimate parameters printing every 10 iterations results = dpmodel.fit(data,verbose=10) - + #assign data to components c = results.classify(data) - + # plot data coloring by label pylab.scatter(data[:,0], data[:,1], c=c, s=1, edgecolor='none') pylab.xlabel(data.channels[0]) pylab.ylabel(data.channels[1]) - - + + The above code labels each event by color to the cluster it belongs to as seen in figure :ref:`bem` @@ -328,7 +344,7 @@ hierarchical model that fits all datasets such that component means and covariance are common to all fitted samples but the weights of components are specific for each sample. Since `HDPMixtureModel` estimates multiple datasets simultaneously, a list of `DPMixture` objects is returned corresponding to -each of the `FCMdata` objects passed to `HDPMixureMode.fit()`. +each of the `FCMdata` objects passed to `HDPMixureMode.fit()`. Visualization ------------- @@ -337,7 +353,7 @@ By using packages like [matplotlib]_ it becomes easy to recreate the typical plots flow cytometry analysts are used to seeing. Convenience functions for several common plot types have been included in the `fcm.graphics` sub-package. The common pseudocolor dotplot is handled by the function -`fcm.graphics.pseudocolor()` +`fcm.graphics.pseudocolor()` .. code-block:: python @@ -366,16 +382,16 @@ Another common plot is overlay histograms, which is provided by import fcm.graphics as graph from glob import glob xs =[fcm.loadFCS(x) for x in glob('B6901GFJ-08_*.fcs')] - graph.hist(xs,3, display=True) + graph.hist(xs,'SSC-A', display=True) The code above will produce the histogram seen in figure :ref:`hist` .. figure:: hist.png - Overlay histogram of three samples from the EQAPOL data set. :label:`hist` + Overlay histogram of three samples from the EQAPOL data set, showing the Side Scatter parameter (SSC-A). :label:`hist` More examples of flow cytometry graphics can be seen in the gallery at -http://packages.python.org/fcm/gallery. +http://packages.python.org/fcm/gallery.html Conclusion and future work @@ -390,7 +406,9 @@ Conclusion and future work Currently `fcm` is approaching its 1.0 release, providing a stable API for development and we feel `fcm` is ready for wider usage in the scientific community. -Internally we use `fcm` for EDA for data sets from HIV/AIDS, caner, and +Further examples on on how to use `fcm` can be found in the documentation at +http://packages.python.org/fcm/ . +Internally we use `fcm` for EDA for data sets from HIV/AIDS, cancer, and solid-organ transplantation studies. In addition we have developed pipelines for batch analysis of large numbers of FCS files from the Duke Center for AIDS Research, External Quality Assurance Program Oversight Laboratory (EQAPOL), @@ -410,7 +428,7 @@ analyzing the images generated. These technologies will necessitate improved tools to analyze data generated by these newer cytometers. Our hope is that `fcm` can meet these needs and continue to grow to address these needs, with specific goals of developing tools to facilitate cross sample comparison and -time series of flow data. +time series of flow data. The next generation of the FCS file standard, Analytical Cytometry Standard, has been proposed, using NetCDF as the format for event storage. @@ -421,7 +439,7 @@ associated xml and image files proposed to be included in the ACS container, adding support for the finalized version of ACS standard should not be difficult. Gating-ML, an XML format proposed with ACS for describing gates and thier placement, has been gaining popularity. We are exploring how best to -implement readers and writers for Gating-ML +implement readers and writers for Gating-ML. Acknowledgements ---------------- @@ -436,33 +454,55 @@ References ---------- .. [fcm] Frelinger J, Richards A, Chan C, http://code.google.com/p/py-fcm/ -.. [Herzenberg2006] Herzenberg LA, Tung J et al (2006), +.. [Herzenberg2006] Herzenberg LA, Tung J et al. (2006), *Interpreting flow cytometry data: a guide for the perplexed*, - Nat Immunol 7(7):681-685 -.. [Maecker2005] Maecker HT, Frey T et al (2007), + Nat Immunol 7(7):681-685 + +.. [Perfetto2004] Perfetto S, Chattopadhyay P, and Roederer M, (2004), + *Seventeen-colour flow cytometry: unravelling the immune system*, + Nature Reviews Immunology 4.8 (2004): 648-655. + +.. [Maecker2005] Maecker HT, Frey T et al. (2007), *Standardization of cytokine flow cytometry assays*, BMC Immunol 6:13 -.. [Ornatsky2006] Ornatsky O, Baranov VI et al (2006), +.. [Ornatsky2006] Ornatsky O, Baranov VI et al. (2006), *Multiple cellular antigent detection by ICP-MS*, J Immunol Methods 308(1-2):68-76 +.. [Newell2012] Newell E, et al. (2012), + *Cytometry by Time-of-Flight Shows Combinatorial Cytokine Expression and Virus-Specific Cell Niches within a Continuum of CD8+ T Cell Phenotypes*, + Immunity , Volume 36 , Issue 1 , 142 - 152 -.. [ipython] Pérez F, Granger BE, IPython: A System for - Interactive Scientific Computing, Computing in Science and - Engineering, vol. 9, no. 3, pp. 21-29, May/June 2007, +.. [Bioconductor] Gentleman R, Carey V, et al. (2004), + *Bioconductor: Open software development for computational biology and bioinformatics*, + Genome Biology, Vol. 5, R80 + + +.. [ipython] Pérez F, Granger BE, + *IPython: A System for Interactive Scientific Computing*, + Computing in Science and Engineering, vol. 9, no. 3, pp. 21-29, May/June 2007, doi:10.1109/MCSE.2007.53. URL: http://ipython.org -.. [Parks2005] Parks, D. R., Roederer, M. and Moore, W. A. (2006), +.. [Parks2005] Parks D,, Roederer M, and Moore W, (2006), *A new “Logicle” display method avoids deceptive effects of logarithmic scaling for low signals and compensated data*, Cytometry, 69A: 541–551. doi: 10.1002/cyto.a.20258 .. [dpmix] Cron A, https://github.com/andrewcron/dpmix +.. [Cron2013] Cron A, Gouttefangeas C, Frelinger J, Lin L, Singh SK, et al. (2013), + *Hierarchical Modeling for Rare Event Detection and Cell Subset Alignment across Flow Cytometry Samples*, + PLoS Comput Biol 9(7): e1003130. doi:10.1371/journal.pcbi.1003130 + +.. [Richards2014] Richards A, Staats J, Enzor J, McKinnon K, Frelinger J, Denny T, Weinhold K, Chan C, (2014), + *Setting objective thresholds for rare event detection in flow cytometry*, + Journal of Immunological Methods, Available online 12 April 2014, ISSN 0022-1759 + http://dx.doi.org/10.1016/j.jim.2014.04.002. + .. [gpustats] Cron A and McKinney W, https://github.com/dukestats/gpustats -.. [matplotlib] Hunter JD, (2007), *Matplotlib: A 2D Graphics +.. [matplotlib] Hunter J, (2007), *Matplotlib: A 2D Graphics Environment*, Computing in Science & Engineering 9, 90 (2007) .. [cytostream] Richards A, http://code.google.com/p/cytostream/