Evaluating Physician-AI Interaction for Multiple Myeloma Management: Paving the Path Towards Precision Oncology

Lam, Barbara

Introduction

The gap between clinical trials and real-world care continues to grow in the field of multiple myeloma (MM). Randomized controlled trials (RCTs) cannot account for every patient scenario. Modern clinical decision support systems (CDSSs) that leverage machine learning (ML) models offer an alternate path to precision oncology. However, little work has been done to understand how clinicians reconcile RCT with ML data, particularly when results are discordant.

Aim

We designed a CDSS that displays simulated survival and adverse event data from an RCT and ML model and conducted a survey study to evaluate how clinicians incorporate the available data to make treatment decisions for 12 patients with MM.

Methods

Participants were presented with varying combinations of RCT and ML results in increasing “tiers” of information for 12 patients (A-L) with MM (Table 1). In tier 1, participants were provided RCT data only. In tier 2, participants were given outcomes of an ML model and in tier 3, they were provided with information about how the model was trained and validated. At each tier, participants were asked to select a treatment (“red pill” or “blue pill”), rate their confidence in treatment on a Likert scale from 1-10, and when ML data was available, rate their perceived reliability of the model on a Likert scale from 1-10.

Participants were recruited from internal medicine and hematology/oncology departments via email between January and April 2023, and were offered a $50 Amazon gift card as incentive.

We used descriptive statistics to analyze respondent characteristics. For each scenario, we ran two-sample paired t-tests to compare the change in confidence and reliability between tier 2 and tier 1 (ML versus RCT data) and tier 3 versus tier 2 (ML data with information about training and validation versus without). We utilized a Bonferroni correction to adjust the alpha level for significance. We also ran McNemar’s tests with a Bonferroni correction to assess the difference in proportions of blue pill selection at different tiers to characterize the extent of treatment switching.

Results

A total of 284 physicians were invited to participate in the study and 32 participated, for a response rate of 11.3%. Half were internal medicine residents and half were hematology/oncology fellows and attendings. A majority were male (72.0%), white (69.0%) and all were less than 40 years of age.

For scenarios A-D, the patient met inclusion criteria for the RCT and was well represented in training data for the ML model. Confidence was highest for scenario A where survival results between the RCT and ML model were concordant (Figure 1). Across scenarios B, C, and D, confidence dropped after seeing the ML results (B: p=0.05, C: p=0.05, D: p=0.36) but increased after participants learned the ML model was trained on patients like theirs (B: p=.002, C: p=0.17, D: p=0.06). In scenarios C and D, a majority of participants switched treatment after learning the ML model showed no benefit with the red pill (C: p=5.1x10^-4, D: p=1.5x10^-5).

In scenarios E-H, the patient did not meet inclusion criteria for the RCT but was well represented in the training data for the ML model. When the ML model showed worse adverse events (F) or no benefit with the red pill (G, H) the majority of participants switched treatment choice (F: p=0.27, G: p=0.002, H: p=1.9x10^-4). There was an increase in confidence when participants learned the ML model had been trained on representative patients (E: p=0.008, F: p=0.01, G: p=0.18, H: p=2.0x10^-3).

In scenarios I-L, the patient did not meet inclusion criteria for the RCT and was not well represented in the training data for the ML model. There was a decrease in confidence when participants learned this in tier 3 (I: p=0.18, J: p=3.0x10^-4, K: p=0.05, L: p=0.009). However, the majority of participants still switched to the blue pill when the ML model showed no benefit with the red pill (K: p=1.9x10^-4).

Conclusions

Confidence in treatment was highest when RCT and ML findings were concordant. Participants chose treatments based on ML model estimates even before assessing how the ML model was trained or validated. Participants preferred the treatment that demonstrated a survival benefit, regardless of whether it was supported by RCT data or an ML model.

2281 Evaluating Physician-AI Interaction for Multiple Myeloma Management: Paving the Path Towards Precision Oncology