Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making

Civettini, Ivan

Introduction:

Large Language Models (LLMs) are a form of Artificial Intelligence (AI), by identifying patterns and connections within data, they can predict the most likely words or phrases in specific contexts. Previous studies have indicated that GPT (Generative Pre-trained Transformer; OpenAI) performs well in answering single-choice clinical questions. However, its performance seems to be less satisfactory when dealing with multiple-choice questions and more intricate clinical cases (Cosima et al. 2023 EAO; Cascella et al. 2023 J Med Syst). Notably, no study has evaluated LLMs responses in the context of Transplantation Decision Making, a complex process heavily reliant on physician expertise. Additionally, most studies focused solely on GPT's performance, without considering other competitive LLMs like Llama-2 or VertexAI. Our study aims to assess the performance of LLMs in the domain of hematopoietic stem cell transplantation.

Methods:

We modified and anonymized the clinical histories of six hematological patients. An experienced hematologist reviewed and validated these modified clinical histories, which included demographic data, past medical history, hematology disease features (genetic data and MRD when available), treatment responses, adverse events from previous therapies, and potential donor information (related/unrelated, HLA, CMV status).

We presented these clinical cases to six experienced bone marrow transplant physicians from two major JACIE accredited hospitals and 11 hematology residents from the University Milano-Bicocca. LLMs employed for the analysis were: GPT-4, VertexAI Palm 2, Llama-2 13b and 70b. LLMs were configured with different temperature settings to control token selection randomness, always maintaining low levels for more deterministic responses.

A triple-blinded survey was conducted using Typeform, where both senior hematologists and residents provided anonymized responses with personal tokens. The senior hematologists, residents, and LLMs testers were unaware of the responses provided by the other groups. We calculated Fleiss K (K) and overall percentage of agreement (OA) between residents and LLMs, considering the consensus answer (CoA) among experts as the most frequent response. Subsequently, OA and K values for both residents and LLMs were compared using T- or Mann-Whitney tests with Graphpad v 10.0.1.

Results:

The results showed perfect agreement among experts in patient transplant eligibility assessment (K=1.0) and substantial agreement in the choice of donors and conditioning regimens (K=0.62 for both questions). Fair agreement was observed in Transplant Related Mortality (TRM) estimation (K=0.22).

The median OA and K value between residents and the CoA of experts were 76.5% (range 52.9-88.2%) and 0.61 (range 0.4-0.8), respectively. The median OA and K value between LLMs answers and experts were 58.8% (range 47-71%) and 0.45 (range 0.3-0.61), respectively. The mean OA and K value of residents were significantly higher compared to LLMs (p=0.02). Specifically, residents showed higher median OA and K values in patient eligibility assessment (median OA 100 vs. 83% and K 1 vs. 0.78; p=0.01). However, there was no significant difference in median K for donor choice (0.56 vs. 0.56), conditioning regimen (0.67 vs. 0.33), and TRM evaluation (0.33 vs. 0) (Table 1). The median K values of GPT-4, Palm-2, Llama2-13b, and Llama2-70b were 0.49, 0.53, 0.33, and 0.53 respectively (Figure 1).

Conclusion:

Our study sheds light on the potential and limitations of LLMs in complex hematopoietic stem cell transplantation decision-making. While LLMs showed promising results with a median OA of 59%, residents demonstrated superior performance. LLMS displayed good performances in patients' eligibility and donor choice but showed shortcomings in conditioning regimens and TRM evaluation.

Not using a rating scale from experts when evaluating LLMs responses aimed to avoid potential bias. However, it is important to note that the consensus answer, even though it was the most frequent, does not necessarily imply that other responses provided by the experts were incorrect. Therefore, the lower consensus among the experts in TRM evaluation, possibly due to the challenge of precisely calculating TRM in a survey-based evaluation, should also lead to a cautious approach when evaluating residents and LLMs answers in this setting.

3726 Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making