-Author name in bold denotes the presenting author
-Asterisk * with author name denotes a Non-ASH member
Clinically Relevant Abstract denotes an abstract that is clinically relevant.

PhD Trainee denotes that this is a recommended PHD Trainee Session.

Ticketed Session denotes that this is a ticketed session.

3592 Why Implementing Machine Learning Algorithms in Clinical Routine Is Not a Plug-and-Play Operation: A Simulation Study of a Recently Published ML Algorithm for Acute Leukemia Subtype Diagnosis

Program: Oral and Poster Abstracts
Session: 803. Emerging Tools, Techniques, and Artificial Intelligence in Hematology: Poster II
Hematology Disease Topics & Pathways:
Research, Lymphoid Leukemias, ALL, Acute Myeloid Malignancies, AML, Clinical Practice (Health Services and Quality), APL, Clinical Research, Diseases, Real-world evidence, Lymphoid Malignancies, Myeloid Malignancies
Sunday, December 8, 2024, 6:00 PM-8:00 PM

Christopher M. M Sauer, MD, PhD, MPH1,2*, Till Rostalski, MD2*, Jens Kleesiek, MD PhD2*, Hans Christian Reinhardt, MD3, Felix Nensa, MD2* and Gernot Pucher, MSc1,2*

1Department of Hematology & Stem Cell Transplantation, University Hospital Essen, Essen, Germany
2Institute for AI in Medicine, University Hospital Essen, Essen, Germany
3Department of Hematology and Stem Cell Transplantation, West German Cancer Center and German Cancer consortium (DKTK partner site Essen), University Hospital Essen, University of Duisburg-Essen, Essen, Germany

Background

Artificial intelligence (AI) and machine learning (ML) algorithms have shown great promise in clinical medicine, offering potential improvements in diagnostic accuracy and patient outcomes. Despite the increasing number of published algorithms, most remain unvalidated in real-world clinical settings. This study aims to simulate the practical implementation of a recently published ML algorithm (Alcazer et al. Lancet Dig Health, 2024; AI-PAL), and identify potential challenges. AI-PAL was designed for the diagnosis of acute leukemia (AL) based on laboratory values only, thereby potentially decreasing diagnostic turnaround time from days to hours.

Methods

We conducted a stepwise simulation of a clinical implementation of the AI-PAL algorithm at the University Hospital Essen (UME). Using our clinical FHIR research database, we identified all initially diagnosed patients with acute leukemia and selected differential diagnoses. To reproduce the results published by Alcazer et al. for the extracted UME cohort, an automated prediction pipeline using the publicly available code of AI-PAL was implemented in R. The algorithm's performance was assessed using the Area Under the Receiver Operator Curve (AUROC). The confidence of predictions was assessed using the AI-PAL cutoff-thresholds for positive predictive values (PPV) and negative predictive values (NPV). Based on predicted probabilities for each AL type, patients were assigned to positive confident diagnoses, confident negative diagnoses or uncertain diagnosis classes. Classification performance was increased by recalibrating the published thresholds. A waiver from the Medical Ethics Committee of the University Duisburg Essen (23-11573-BO) is applicable to this research.

Results

A total of 20,283 hospital encounters with AL diagnosis at UME were identified, of which 545 were inpatients with an initial and untreated diagnosis of AL. The frequency of acute leukemia types and variable distributions differed between UME and AI-PAL, with e.g. AML being more common (78.5% vs. 53.0%) and monocytes counts being lower (0,1G/L UME vs. 0,5G/L AI-PAL, respectively). The AI-PAL algorithm demonstrated significantly lower performance in our simulated clinical implementation compared to the published results. The area under the receiver operating curve for acute lymphoblastic leukemia dropped to 0.667 (95%CI: 0.606-0.725) and for acute myeloid leukemia to 0.710 (95%CI: 0.654-0.762). Based on the certainty thresholds provided in the original publication, not a single ALL case of UME was classified as “certainly ALL” (N=0/104), while 1 ALL patient was classified “certainly APL” and 9 ALL patients as “certainly AML”. Overall, robustness to differential diagnosis of AL types was low, with 11.1% (N=2/18) of mantle cell lymphoma and 7.7% (N=3/39) of myelodysplastic syndrome getting misclassified as AML.

Recalibration of diagnosis thresholds improved classification certainty. For instance, with ALL cases now 15.4% (were classified as “certain ALL” cases was 0%), with a higher proportion of patients also being labelled as “certainly not APL” (86.5%, was 65.4%) and “certainly not AML” (20.4%, was 0%),

Discussion

Based on these findings, it is too early to rely on laboratory values alone to differentiate subtypes of AL. Additional retraining and validation studies are likely required at most institutions before clinical use. The findings underscore the challenges of implementing ML algorithms in clinical practice. Despite robust development and validation in research settings, ML models like AI-PAL may require significant adjustments and recalibration to maintain performance in different clinical settings. Our results suggest that clinical decision support algorithms should undergo local performance validation before integration into routine care to ensure reliability and safety. This work advocates for the necessity of context-specific adjustments and prospective real-world evaluations.

Disclosures: Sauer: Pacmed: Consultancy; BMS: Honoraria. Reinhardt: CDL Therapeutics GmbH: Current equity holder in private company; Gilead: Research Funding; Merck: Consultancy, Honoraria; Vertex: Consultancy, Honoraria; Novartis: Consultancy, Honoraria; Janssen-Cilag: Consultancy, Honoraria; Roche: Consultancy, Honoraria; AstraZeneca: Consultancy, Honoraria, Research Funding; AbbVie: Consultancy, Honoraria. Nensa: Siemens Healthineers: Research Funding.

Previous Abstract | Next Abstract >>
*signifies non-member of ASH