Why Implementing Machine Learning Algorithms in Clinical Routine Is Not a Plug-and-Play Operation: A Simulation Study of a Recently Published ML Algorithm for Acute Leukemia Subtype Diagnosis

Sauer, Christopher M.

Background

Artificial intelligence (AI) and machine learning (ML) algorithms have shown great promise in clinical medicine, offering potential improvements in diagnostic accuracy and patient outcomes. Despite the increasing number of published algorithms, most remain unvalidated in real-world clinical settings. This study aims to simulate the practical implementation of a recently published ML algorithm (Alcazer et al. Lancet Dig Health, 2024; AI-PAL), and identify potential challenges. AI-PAL was designed for the diagnosis of acute leukemia (AL) based on laboratory values only, thereby potentially decreasing diagnostic turnaround time from days to hours.

Methods

We conducted a stepwise simulation of a clinical implementation of the AI-PAL algorithm at the University Hospital Essen (UME). Using our clinical FHIR research database, we identified all initially diagnosed patients with acute leukemia and selected differential diagnoses. To reproduce the results published by Alcazer et al. for the extracted UME cohort, an automated prediction pipeline using the publicly available code of AI-PAL was implemented in R. The algorithm's performance was assessed using the Area Under the Receiver Operator Curve (AUROC). The confidence of predictions was assessed using the AI-PAL cutoff-thresholds for positive predictive values (PPV) and negative predictive values (NPV). Based on predicted probabilities for each AL type, patients were assigned to positive confident diagnoses, confident negative diagnoses or uncertain diagnosis classes. Classification performance was increased by recalibrating the published thresholds. A waiver from the Medical Ethics Committee of the University Duisburg Essen (23-11573-BO) is applicable to this research.

Results

A total of 20,283 hospital encounters with AL diagnosis at UME were identified, of which 545 were inpatients with an initial and untreated diagnosis of AL. The frequency of acute leukemia types and variable distributions differed between UME and AI-PAL, with e.g. AML being more common (78.5% vs. 53.0%) and monocytes counts being lower (0,1G/L UME vs. 0,5G/L AI-PAL, respectively). The AI-PAL algorithm demonstrated significantly lower performance in our simulated clinical implementation compared to the published results. The area under the receiver operating curve for acute lymphoblastic leukemia dropped to 0.667 (95%CI: 0.606-0.725) and for acute myeloid leukemia to 0.710 (95%CI: 0.654-0.762). Based on the certainty thresholds provided in the original publication, not a single ALL case of UME was classified as “certainly ALL” (N=0/104), while 1 ALL patient was classified “certainly APL” and 9 ALL patients as “certainly AML”. Overall, robustness to differential diagnosis of AL types was low, with 11.1% (N=2/18) of mantle cell lymphoma and 7.7% (N=3/39) of myelodysplastic syndrome getting misclassified as AML.

Recalibration of diagnosis thresholds improved classification certainty. For instance, with ALL cases now 15.4% (were classified as “certain ALL” cases was 0%), with a higher proportion of patients also being labelled as “certainly not APL” (86.5%, was 65.4%) and “certainly not AML” (20.4%, was 0%),

Discussion

Based on these findings, it is too early to rely on laboratory values alone to differentiate subtypes of AL. Additional retraining and validation studies are likely required at most institutions before clinical use. The findings underscore the challenges of implementing ML algorithms in clinical practice. Despite robust development and validation in research settings, ML models like AI-PAL may require significant adjustments and recalibration to maintain performance in different clinical settings. Our results suggest that clinical decision support algorithms should undergo local performance validation before integration into routine care to ensure reliability and safety. This work advocates for the necessity of context-specific adjustments and prospective real-world evaluations.

3592 Why Implementing Machine Learning Algorithms in Clinical Routine Is Not a Plug-and-Play Operation: A Simulation Study of a Recently Published ML Algorithm for Acute Leukemia Subtype Diagnosis