The software under discussion here is the AutoGate implementation of the paper’s pipeline. Previously AutoGate supported semi-supervised analysis after its first release in March 2014. The pipeline is mostly programmed in MatLab.
The performance of the pipeline steps are as follows
1) Epp. The Epp method will be published separately but we summarize it here to help the reviewers understanding.
a) Epp is designed to find subsets based on phenotyping markers and scatter parameters but NOT based on stimulation or intra-cellular markers.
b) Performance impacts come from the number of dimensions, cells and clusters per 2D projection
i) Dimensions: each additional dimension increases performance quadratically by expanding the number of 2D projections. 11D has 55 2D pairings, 12D has 66,13D has 78….. 30D has 435, 31 requires has 465
ii) Cells: increase in cell counts has a linear impact on the cost of creating DBM’s density grid. Most of the processing after this remains the same since it works with the grid. There is also linear cost related to memory use.
iii) Clusters per 2D projection: each additional cluster that DBM finds impacts performance
(1) Combinatorically for Epp’s contiguity check. Epp checks all 2-way splits of a 2D projection’s clusters to see if the clusters with each side of the split are contiguous with each other.
(2) Near-combinatorically for computing the Separatrix of contiguous splits.
(3) The combinatoric cost can be illustrated with a 4-cluster projection, this entails the following combinations: (a) 1 + (2 3 4) (b) (1 2) + (3 4) (c) (1 3) + (2 4) (d) (1 4) + (2 3) (e) (1 2 3) + 4 (f) (1 2 4) + 3 (g) (1 3 4) + 2 iv)
The number of clusters per 2D projection is affected by
(1) Cluster detail level. The 5 steps of very low to very high constitutes a narrowing of the bandwidth on the density grid. Wide bandwidth consults a wider neighborhood radius of grid points to compute each grid point’s density.
(2) The diversity found in the dimensions’ measurements which depends on the markers, their stains (fluorophores/metals), the instrument and the sample preparations in terms of diversity of cell types.
2) QF match
a) Performance impacts come from the number of dimensions, cells, cell-overlap and merge candidates.
i) Dimensions: unlike the Epp step, the QF match step handles additional dimensions in a linear manner and only during the initial sub task of adaptive binning as well as the sub task of calculating distances during QF dissimilarity computation.
ii) Cells: this impacts performance when it increases the total adaptive bin count for a subset. AutoGate uses slower non vectorized programming modules for subsets with more than 200,000 cells because they risk running out of memory if they use the fast vectorized programming. Vectorized programming uses matrices without for loops for all terms of the QF dissimilarity formula instead of scalars and for loops. In MatLab this has been observed to improve performance over even a C implementation with scalars. MatLab r2017a may use the SIMD operations of the Intel hardware for vectorized operations. When larger subsets exist, the AutoGate user can choose to ignore the slower comparison operation for large subsets effectively removing the particular subset from the analysis. iii) Cell overlap: This can impact quality of performance more than time of performance.
Why? When QF match compares 2 groups of subsets and subsets in either group have cells that other subsets in the same group also have, then the QF match has been observed to mislead. Thus, AutoGate does a check for overlap and warns the user of this situation.
iv) Merge candidates: an increase in merge candidates impacts QF match combinatorically. It’s impact on QF match can be costlier than any other impact in the entire pipeline. Specific examples in the following section will indicate. A high number of merge candidates happens when between the 2 groups of subsets being matched there is one subset in one group that matches best with 2 or more subsets in the other group. To resolve the best match the algorithm requires choosing the best QF dissimilarity measure for every combination of merge candidates including single unmerged candidates. This likelihood of long running combinatorics increases as the difference in number of subsets between groups differs. E.g. matching a group of 8 subsets with a group of 12 subsets is less likely to suffer long merge testing than matching a group of 8 subsets with a group of 52 subsets.A subset “best match” is only ineligible for merge testing if one of the parameters is more than 4 standard deviation units different.
b) The impact of large merging tasks is addressed by the
i) Software (AutoGate) avoiding uncertain computer-frozen impressions through detailed progress reports that allow the user to cancel (no computer lock up) without data loss consequences.
ii) User halting the matching and either (1) Cancelling the entire computation (2) De-selecting merge testing for subsets in which they are less interested (and then continuing the computation). If a user selects to ignore merge testing for one or more subsets then this action only alters the compute time and not the correctness of the matching results for the other subsets in which the user remains interested.
iii) User defining additional known subsets when and if the smaller group of subsets being matched contains non Epp subsets. The additional subsets need only be density clusters for cells that are not already contained by the group’s subsets. Groups of Epp subsets cannot have this remedy since Epp addresses every cell. Thus, no remedy exists when the smaller match group is Epp subsets.
AutoGate does multi-dimensional scaling (MDS) of subset medians with MatLab’s built-in function cmdscale. MDS is done strictly for the purpose of a quick visualization in conventional MatLab 2D plots. The operation of MDS is always faster than the cost of the visualization so not much needs to be reported on it here. Moreover, since AutoGate’s MDS only considers subset medians, users are encouraged to use the subsequent visualization for a quick report on QF matching rather than a guide to the HiD relatedness of the subsets. For HiD relatedness the primary AutoGate visualization is its Phenogram which considers all of the data. The paper refers to Phenograms as QF tree.
4) QF tree
AutoGate produces a visual dendrogram to express HiD relatedness. The non-visual processing defaults to the use of QF dissimilarity metric on the adaptive binning of all of a subset’s data plus Euclidean distances on the subset medians. AutoGate offers the user other distance/dissimilarity measures. The distance-only alternatives, however, consider medians and thus suffer from the same risk of under-informing as does AutoGate’s MDS visualization. The majority of the QF tree’s computation cost is the non-visual processing. The visualization invokes MatLab’s phytree function and then performs customizations on the visual objects output by phytree.
a) QF tree performance is impacted by number of cells and subsets in a linear manner.
b) The impacts are not major since there is no quadratic or combinatorial change in workload when input factors scale up or down.
c) QF trees have the risk of slowness for the same reasons as described previously with QF match: accelerating the QF dissimilarity and Euclidean distance calculations with vectorized programming requires pre-allocating memory in amounts which increase exponentially with larger subsets. Thus slower non vectorized programming must occur.