-Author name in bold denotes the presenting author
-Asterisk * with author name denotes a Non-ASH member
Clinically Relevant Abstract denotes an abstract that is clinically relevant.

PhD Trainee denotes that this is a recommended PHD Trainee Session.

Ticketed Session denotes that this is a ticketed session.

4513 Machine Learning Can Outperform Ann Arbor Staging in Predicting Survival in Patient with Diffuse Large B-Cell Lymphoma: Analysis of a Large National Cancer Database

Program: Oral and Poster Abstracts
Session: 627. Aggressive Lymphomas: Clinical and Epidemiological: Poster III
Hematology Disease Topics & Pathways:
artificial intelligence (AI), Technology and Procedures, machine learning
Monday, December 11, 2023, 6:00 PM-8:00 PM

Madhan Srinivasan Kumar, MD1, Veena Gujju, MD2*, Ji Hwan Park, PhD3*, Debra Hogue, M.S.3*, Abdul Rafeh Naqash, MD4* and Taha Mahdi Salih Al-Juhaishi, MD5

1Internal Medicine, Saint Vincent Hospital, Worcester, MA
2Department of Medicine – Section of Hematology and Medical Oncology, Baylor College of Medicine, Houston, TX
3School of Computer Science, University of Oklahoma, Norman, OK
4Hematology and Medical Oncology, TSET Phase 1 Program, University of Oklahoma Health Sciences Center - Stephenson Cancer Center, Oklahoma City, OK
5Hematology and Medical Oncology, Stem Cell Transplantation and Cellular Therapy Program, University of Oklahoma Health Sciences Center - Stephenson Cancer Center, Oklahoma City, OK


Diffuse Large B-cell Lymphoma (DLBCL) is the most common lymphoma in the world with usually an aggressive clinical course. The Ann Arbor staging system and International Prognostic Index (IPI) commonly utilized in clinical practice for risk stratification have known limitations. Machine learning (ML) has emerged as a promising tool for more comprehensive and deeper data analysis. We sought to utilize the ability of ML to predict survival in DLBCL compared to Ann Arbor staging system using a large national database.


We employed the ML algorithm XGBoost on the National Cancer Institute’s Surveillance, Epidemiology and End Result (SEER) database to predict overall survival (OS) and the lymphoma specific survival (LSS). For prediction analysis, we transformed the survival labels into a simple Boolean format: "alive" represented as 0, “dead” as 1, and “dead (attributable to this cancer diagnosis)” also as 1. We utilized one-hot encoding to convert categorical features and variables into binary vectors. The data set was divided into two parts: training (80%) and test (20%). Further, we split the training set into the actual training set and validation set by using stratified 5-fold cross validation. Hyper-parameter optimization was done within the validation set. A broad range of attributes were utilized by the model for its prediction algorithm. To understand how each attribute contributes to predictions, we calculated its importance score in XGBoost.


A total of 64,912 patients with DLBCL were found and their data were extracted. The majority were Caucasian (78.9%) with a median age range of 60 to 69. The model was able to predict OS and LSS, with an area under the curve (AUC) of 0.89 and 0.75 (Figure 1), respectively. Factors selected by the model for survival prediction included presence or absence of B-symptoms, treatment status, and disease stage. For OS and LSS, the model found B symptoms to be the highest contributing factor with an importance score of 0.205 and 0.167, respectively. Other important factors incorporated by the model included age and stage IV for OS, and stage IV and clinically asymptomatic status for LSS. The least important factors were location of the primary lymphoma site and year of diagnosis (Table 1).


Machine learning tools can help predict survival in patients with DLBCL and able to challenge current staging systems. Our results warrant validation in future prospective studies.

Disclosures: No relevant conflicts of interest to declare.

<< Previous Abstract | Next Abstract
*signifies non-member of ASH