Skip to content
Markus Lux edited this page Nov 9, 2016 · 5 revisions

flowLearn is a future software that can predict channel thresholds for gating based on only density information and (very) few manual expert gates provided. To achieve that, it compares and aligns densities such that thresholds can be transferred from one sample to another.

Methodology

An overview on flowLearn is given in the small presentation I held mid-September 2016. The slides can be found in flowLearn-presentation_2016-09-13.pdf (Bioinformatics drive under Markus) and should give a starting point.

Essentially, the task is as follows: Given a set of FCS files,

  1. Find a (set of) representative prototype sample(s).
  2. Let an expert manually gate these prototypes as accurate as possible.
  3. Transfer the now known thresholds to all ungated samples (i.e. ''learn'' from them).
  4. Repeat this for all populations, going downstream

0: Using machine learning techniques

Before using the current alignment approach (see below), we evaluated a number of machine learning methods. They seemed to work pretty well, however alignment still works best. An overview of tested techniques is shown in slide 11.

1: Finding representative prototypes

This is still subject of current investigation. A prototype should have the following properties:

  • should not be an outlier
  • should be representative for a subset of samples. Specifically, this means that densities of the subset should be similar to the prototypes' density
  • should have a correct threshold

In the slides, it is written that apparently, there is no advantage of selecting a correct prototype. This is wrong.

Ideas to identify prototype sets

  • by clustering sample densities: Here, optimally a clustering based on the alignment distance matrix can be used. The prototype for a given test sample is then chosen as the cluster center.
  • by not using a particular sample but a calculated mean density
  • by using the HM&M algorithm for template-based classification

2: Manual expert gating

Training gates set manually by experts should be channel thresholds, i.e. they should be parallel to the channel axes. If rotation is necessary, it should be set by the expert, too.

At most, two channels are gated. In flowLearn, these are denoted by channel A and B.

3: Threshold transfer (learning)

For each channel, we are given a training density estimated from a parent population, together with at most two thresholds per channel (lower and upper).

The task is to transfer each individual threshold on a density to another test density (slide 6). To accomplish this, both densities have to be compared. We use a technique called Dynamic Time Warping (DTW) to align both densities. As it can be seen in slide 13 a red-dashed prototype (or reference) density is aligned to a black-solid test density. Here, each point in the reference is mapped to one point in the test. Given that this mapping works well, thresholds can be transferred easily by using the mapping.

Each DTW alignment comes with an alignment distance which indicates how much warping was necessary in order to transform the test density into the reference density. The more warping, the higher the distance.

Furthermore, given that for a test density, there are multiple prototypes given, it is also possible that predicted thresholds from all prototypes are combined. This is known as committee learning. Basically, the thresholds are just averaged either by mean or median. In early evaluation results, this improved the prediction performance significantly! A slight modification of this is an average weighted by alignment distance which gives thresholds from non-similar prototype densities a lower weight.

Issues when using DTW

  • If both compared densities are too different, the alignment won't work properly anymore and hence, the threshold transfer will fail, too. Therefore, it is vital to select a good correct prototype in advance (see step 2).

  • It is necessary to calculate densities with high granularity. We use a default of 512 evenly spaced density features (discretization).

  • There are different modifications and extensions of DTW:

  • Original DTW tries to align both densities by comparing their intensity values. As thresholds often depend on the local slope, we use a Derivative DTW which compares estimated local derivatives instead of intensities. This seems to work a bit better.

  • There is another extension which is called Weighted DTW which also can be applied for the derivative approach. Here, the larger an x-offset between two given aligned points, the larger the DTW distance becomes, eventually leading to a misalignment. This approach is probably not useful for the alignment itself, however it should be possible to get DTW distances which better reflect the similarities between densities.

  • DTW offers a number of regularizations to prevent misalignments or overfitting. Often the best alignment might map one density point of the reference to multiple points in the test density (singularity). To prevent this, one can use a slanted band around the DTW cost matrix or use specific step patterns for alignment. I evaluated a bunch of populations in terms of F1 score using a few regularizations and came to the conclusion that the stepIds pattern of the dtw package works best. Even though, apparently different regularizations didn't impact the performance significantly, I opted for the step pattern resulting in the "most natural looking" alignments from my subjective view.

  • Calculating a full matrix of alignment distances (i.e. for selecting prototypes / clustering) might take a very long time, depending on the number of samples! Given 2000 samples for one population, it took multiple hours to compute this matrix. As the matrix is computed in parallel, I advise to use a high-performance computers with a lot of cores.

Workflow

Right now, flowLearn is in an evaluation stage. This means that the code is not yet ready for production. A lot of scripts are aimed towards this workflow and are optimized with respect to testing and reading/writing/caching a lot of evaluation data. Obviously, in a production environment, this wouldn't be the case and flowLearn would be a proper R package with a well-defined API.

Future application

As I see it now, a future application will be a fully integrated gating environment, which only receives a bunch of FCS files as an input and outputs all the gates/proportions/etc. Inbetween, the user interacts with flowLearn.

  1. FCS files from one panel are fed into flowLearn.
  2. flowLearn iteratively picks the next population, going downstream in the analysis.
  3. For each population, flowLearn selects one or more prototypes for the user to gate.
  4. The user is presented an interface in which she can set channel thresholds as gates and rotate the data, if necessary.
  5. flowLearn predicts remaining samples based on the user gating
  6. Goto 2 until all populations have been gated

Problems / Todos

Misalignments

Misalignments might happen due to different factors. These factors should be taken into account when selecting prototypes:

  • Both compared densities are too different to compare. For example, one density might be an outlier. This is directly reflected in the alignment distance which should be high in such cases.

  • There is a lot of noise in the densities.

  • There is a large x-offset between two aligned peaks. This occasionally happens because of (mostly negative) outlier events which cause the density to shift either to the left or right and let the alignment fail as a consequence of that. Example: cd64ppcd16p.png

Wrong training data

For evaluation, we used different data sets. It happened that thresholds would not be well gated for some populations. Obviously, this manifests itself in bad performance, too.

Confidence values

There should be some sort of confidence values on how well a gate could be predicted by flowLearn. We had some ideas about how this could be done but none of them worked so far. For example, one could use the alignment distances, coupled with other density information. Or one could look on how consistent other prototypes predict the same test sample.

Evaluation

flowLearn is evaluated in terms of the capability to identify subsets of cells (i.e. gate populations), which can be compared to a gold standard. This gold standard optimally should be given by manual expert gates. However, we know that that's difficult to obtain. Therefore, we use results obtained by flowDensity. Those results, i.e. the gates have to translated to a common format, which is used by flowLearn.

The performance is measured by precision, recall, and F1 score. Given a parent population P, a gold standard subset of cells A of this population P, and a subset B as predicted by flowLearn, it goes as follows:

  • Recall: What's the proportion of cells in A that also exist in B? (i.e. how many cells have been recalled from A)
  • Precision: How accurately (precisely) were the cells in B recalled? (i.e. the more there are cells in B which are not in A, the smaller the precision)
  • F1 score: the geometric mean of precision and recall

Hence, flowLearn would do a perfect job if precision and recall are both equal to 1.00 and consequently the F1 score is, too.

Then usually, all populations of a given panel are predicted by flowLearn, and the median(f1) or mean(f1) indicate the overall performance.

Current results

flowLearn has been tested on populations from two panels: OneStudy whole blood cell (WBC) and IMPC bone marrow. The results can be found in the files onestudy_wbc.pdf and impc_bm.pdf, respectively.

In order to judge the performance of flowLearn, it is important to understand the meaning of these plots. Exemplary for the IMPC results, for each population on the x-axis (mean proportions in parentheses), the median F1 score was evaluated over 10 different runs. In each run, a random set of prototypes was used. Only one prototype per channel was used. So the top value of each boxplot reflects the performance on a population, given that a good prototype was selected.

Furthermore, in the IMPC results there are some populations which are not as good as others, especially HFA, HFB, HFC, and Plasma cells. I asked Albina about this and she told me that the biologists told her, that for those populations the gating is not finalized and has to be revisited. I think, this is an interesting observation: possibly the flowLearn performance gives a hint of the quality of a gating.

Implementation Details

Used R packages

  • parallel - parallelization
  • FNN - efficient nearest neighbor search
  • dtw - Dynamic Time Warping
  • Rtsne - non-linear dimensionality reduction (i.e. for visualization and clustering)
  • stringr - for string operations

Older ones (files deleted in SVN trunk, have a look into history):

  • kernlab - SVM
  • randomForest - random forests

Code organization

flowLearn is not an R package, yet. There is a number of files containing functions which are related to few specific sub-functionalities:

File Functionality
flowLearn.R Includes all other scripts. This is the main script to include if you want to use flowLearn.
helpers.R Various methods for printing, plotting, normalizing, etc...
alignment.R Everything related to alignment using DTW
predict.R Predicting thresholds and selecting prototypes
evaluation.R Calculating F1 scores and evaluating given predicted gates
run_evaluation_full_panel.R Runnable script that evaluates all populations of a panel and generates nice result plots
train_data.R Data structures and methods for storing, reading, and converting training data
convert_train_impc.R Runnable script which converts flowDensity gatings into flowLearns training data structures (IMPC BM panel)
convert_train_onestudy.R Runnable script which converts flowDensity gatings into flowLearns training data structures (OneStudy WBC panel)

Preparing flowDensity gates for evaluation

In order to generate flowLearn training data from flowDensity, it has to be converted into the right format, first. This is done using the convert_train_*.R scripts. The script should run on their own and generate a ''trainingFiles.*'' directory. This contains individual .rds files which the LearningSet structures for each populations are saved in. In the subdirectory fcs/, more .rds files for the parent populations of individual populations are stored. In order to not clutter up the hard drive, these files only contain the two channels relevant for gating the specific populations.

Important: To keep things easy, the evaluation scripts depend on a directory called trainingFiles. This is usually just a symbolic link to existing trainingFiles.* directories. Create the symlink with (exemplary for IMPC):

cd flowLearn

ln -s trainingFiles.impc trainingFiles

Alignment functions

There is a global proxy function myDtw() which modularizes the parameters for DTW to ensure consistent use throughout flowLearn. To Change DTW parameters, change this function.

Future ToDos and Timeline

  • by Nov. 20, 2016 Modularize R packages: One flowLearn packages and one flowPrototype package
  • by Dec. 31, 2016 How to select prototypes?