-Author name in bold denotes the presenting author
-Asterisk * with author name denotes a Non-ASH member
Clinically Relevant Abstract denotes an abstract that is clinically relevant.

PhD Trainee denotes that this is a recommended PHD Trainee Session.

Ticketed Session denotes that this is a ticketed session.

2954 Mapbatch: Conservative Batch Normalization for Single Cell RNA-Sequencing Data Enables Discovery of Rare Cell Populations in a Multiple Myeloma Cohort

Program: Oral and Poster Abstracts
Session: 803. Emerging Diagnostic Tools and Techniques: Poster II
Hematology Disease Topics & Pathways:
Artificial Intelligence, Bioinformatics, Diseases, Computational Biology, Emerging Technologies, Myeloid Malignancies, Genomic Profiling, Technology and Procedures, Machine Learning
Sunday, December 12, 2021, 6:00 PM-8:00 PM

Chern Han Yong1*, Shawn Hoon, Ph.D2*, Sanjay De Mel, BSc (Hons), MRCP, FRCPath3*, Stacy Xu, Ph.D4*, Jonathan Adam Scolnick5*, Xiaojing Huo, Ph.D4*, Michael Lovci, Ph.D4*, Wee Joo Chng, MB ChB, PhD, FRCP(UK), FRCPath, FAMS6,7,8 and Limsoon Wong, Ph.D1*

1School of Computing, National University of Singapore, Singapore, Singapore
2Molecular Engineering Lab (MEL), Institute of Molecular and Cell Biology (IMCB), Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
3Department of Haematology-Oncology, National University Cancer Institute Singapore, Singapore, Singapore
4Proteona Pte Ltd, Singapore, Singapore
5Healthy Longevity Translational Research Programme, Department of Physiology, National University of Singapore, Singapore, Singapore
6Department of Hematology-Oncology, National University Cancer Institute of Singapore, National University Health System, Singapore, Singapore
7Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
8Cancer Science Institute of Singapore, National University of Singapore, Singapore, Singapore

Introduction

Many cancers involve the participation of rare cell populations that may only be found in a subset of patients. Single-cell RNA sequencing (scRNA-seq) can identify distinct cell populations across multiple samples with batch normalization used to reduce processing-based effects between samples. However, aggressive normalization obscures rare cell populations, which may be erroneously grouped with other cell types. There is a need for conservative batch normalization that maintains the biological signal necessary to detect rare cell populations.

MapBatch

We designed a batch normalization tool, MapBatch, based on two principles: an autoencoder trained with a single sample learns the underlying gene expression structure of cell types without batch effect; and an ensemble model combines multiple autoencoders, allowing the use of multiple samples for training.

Each autoencoder is trained on one sample, learning a projection into the biological space S representing the real expression differences between cells in that sample (Figure 1a, middle). When other samples are projected into S, the projection reduces expression differences orthogonal to S, while preserving differences along S. The reverse projection transforms the data back into gene space at the autoencoder’s output, sans expression differences orthogonal to S (Figure 1a, right). Since batch-based technical differences are not represented in S, this transformation selectively removes batch effect between samples, while preserving biological signal. The autoencoder output thus represents normalized expression data, conditioned on the training sample.

To incorporate multiple samples into training, MapBatch uses an ensemble of autoencoders, each trained with a single sample (Figure 1b). We train with a minimal number of samples necessary to cover the different cell populations in the dataset. We implement regularization using dropout and noise layers, and an a priori feature extraction layer using KEGG gene modules. The autoencoders’ outputs are concatenated for downstream analysis. For visualization and clustering, we use the top principal components of the concatenated outputs. For differential expression (DE), we perform DE on each of the gene matrices output by each model, then take the result with the lowest P-value.

To test MapBatch, we generated a synthetic dataset based on 7 batches of publicly available PBMC data. For each batch we simulated rare cell populations by selecting one of three cell types to perturb by up and down-regulating 40 genes in 0.5%-2% of the cells (Figure 1c). We simulated additional batch effect by scaling each gene in each batch with a scaling factor. Upon visualization and clustering, cells grouped largely by batch (Figure 1d). After batch normalization, cells grouped by cell type rather than batch, and all three perturbed cell populations were successfully delineated (Figure 1e). DE between each perturbed population and its mother cells accurately retrieved the perturbed genes, showing that normalization maintained real expression differences (Figure 1e). In contrast, three methods tested Seurat (Stuart et al., 2019), Harmony (Korsunsky et al., 2019), and Liger (Welch et al., 2019) could only derive a subset of the perturbed populations (Figures 1f-h).

MapBatch identifies rare populations in multiple myeloma (MM)

We used MapBatch to process bone marrow scRNA-seq data from 14 MM samples and 2 healthy controls. After batch normalization, unsupervised clustering identified 20 clusters, which we annotated using MapCell (Koh & Hoon, 2019) (Figures 2a, 2b). We identified 3 small clusters of cells that could not be reliably annotated, comprising less than 1% of total cells and found in only a subset of patients (Figures 2c, 2d). As validation, we observed that these cells were present in distinct clusters in individual samples using their uncorrected expression data, providing evidence that these clusters were not driven by batch effect nor MapBatch (Figure 2e).

Conclusion

Batch normalization of scRNA-seq data involves a trade-off between minimizing batch effect and maximizing the remaining biological signal. While most methods lean towards the former, MapBatch maintains more biological signal for downstream analysis, enabling the discovery of previously difficult to find cell populations.

Disclosures: Xu: Proteona Pte Ltd: Current Employment. Scolnick: Proteona Pte Ltd: Current holder of individual stocks in a privately-held company. Huo: Proteona Pte Ltd: Ended employment in the past 24 months. Lovci: Proteona Pte Ltd: Current Employment. Chng: Amgen: Honoraria, Research Funding; Abbvie: Honoraria; Janssen: Honoraria, Research Funding; Novartis: Honoraria; Celgene: Honoraria, Research Funding.

*signifies non-member of ASH