BinaryClust2: An Integrated Computational Pipeline for Systematic Mining of Mass Cytometry Data to Assist Deep Immune Profiling in Haematological Research

Sun, Jing

Introduction: The key role of immune system in modulating treatment efficacy and clinical outcomes in the settings of haematological diseases has been well recognized through the application of flow cytometry technologies over years. Mass cytometry (MC) is a next generation cytometry platform to simultaneously profile over 40 protein markers on single-cell resolution, which facilitates clinical research on immune monitoring and biomarker discovery. Nonetheless, the high-dimensional data generated present a daunting challenge for clinical investigators. Therefore, we describe BinaryClust2 (https://github.com/JingAnyaSun/BinaryClust2), an R-based computational framework integrated with the state-of-the-art algorithms and novel cell lineage characterisation method to assist comprehensive exploration of liquid MC data.

Method: Here, we present the implementation of BinaryClust2 to three example MC datasets: In-house dataset from peripheral blood mononuclear cells (PBMCs) of 9 baseline myeloproliferative neoplasm (MPN) patients (~4 million cells), previously published data from 11 MPN patients receiving influenza vaccine (~0.2million cells), and published dataset of 59 covid patients and 23 healthy donors (~2million cells). BinaryClust2 has a streamlined analytical workflow, which comprises the following steps (Figure 1A):

Step 1: Quality control, batch effect evaluation and correction

Data quality control is a separate step before downstream analysis which includes diagnostic plots and batch effect examination. Algorithms CytofRUV and CytoNorm are available in the pipeline to remove unwanted variations caused by batch effects.

Step 2: Semi-supervised identification of main cell types

BinaryClust2 adopts a knowledge-based semi-supervised approach to predict main cell types. Users are required to provide a simple marker expression matrix of pre-defined cell types along with fcs files and metadata to construct a SingleCellExperiment (SCE) object, then the embedded algorithm can automatically classify cell populations without manual annotation.

Step 3: In-depth interrogation and differential testing

Specific population can be further extracted from whole cells and subject to in-depth exploration using unsupervised algorithms for subpopulation discovery. Dimensionality reduction tools UMAP and TSNE, unsupervised clustering methods Phenograph and flowSOM, and various data visualization plotting functions are offered in BinaryClust2. For statistical analysis, multiple study group comparison (n>2) of cell abundance and functional marker expression is supported via Kruskal Wallis test with multiple testing correction and post hoc analysis.

Results: The performance of the semi-supervised classification function was tested independently in the MPN and influenza PBMC datasets, 7 main cell lineages were identified with accuracy comparable to manual gating by human experts (Figure 1B): average F-measure reached 0.93 and 0.98 respectively. Moreover, taking manual gating as ground truth reference, BinaryClust2 outperformed the unsupervised approach flowSOM concerning accuracy (F-measure: 0.93 vs. 0.70) and speed (140s vs 339s) in MPN dataset, while remaining equivalent to the well-performing semi-supervised approach LDA in accuracy (F-measure: 0.93 vs. 0.93) but faster in runtime (140s vs 595s), as shown in Table 1. Application to covid-19 dataset by Chevrier et al. achieved reproducible results and additional discoveries. 13 main cell types were characterised, abundance of B cells, Basophils, cDCs, DN T cells, Monocytes, Neutrophils, NK cells, pDCs, CD8 T cells obtained statistical significance (all P<0.05) among study conditions (healthy, mild covid, severe covid). We also grouped markers reflecting functional status of immune cells and found Granzyme B expression was significantly increased in the majority of main immune cells of covid patients. Phenograph was further applied in neutrophils and monocytes and returned 14 and 16 subsets respectively.

Conclusion: Overall, BinaryClust2 incorporates expert’s prior biological knowledge in a semi-supervised fashion to accurately deconvolute well-defined main cell lineages, while also preserving the potential of unsupervised approaches to discover novel cell subsets and providing a user-friendly toolset to remove the analytical barrier for high-dimensional immune profiling.

124 BinaryClust2: An Integrated Computational Pipeline for Systematic Mining of Mass Cytometry Data to Assist Deep Immune Profiling in Haematological Research