-Author name in bold denotes the presenting author
-Asterisk * with author name denotes a Non-ASH member
Clinically Relevant Abstract denotes an abstract that is clinically relevant.

PhD Trainee denotes that this is a recommended PHD Trainee Session.

Ticketed Session denotes that this is a ticketed session.

3726 Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making

Program: Oral and Poster Abstracts
Session: 903. Health Services and Quality Improvement –Myeloid Malignancies: Poster II
Hematology Disease Topics & Pathways:
Acute Myeloid Malignancies, AML, artificial intelligence (AI), Clinical Practice (Health Services and Quality), Workforce, Diseases, Myeloid Malignancies, emerging technologies, Technology and Procedures, natural language processing
Sunday, December 10, 2023, 6:00 PM-8:00 PM

Ivan Civettini, MD1,2*, Arianna Zappaterra, MD1,2*, Daniele Ramazzotti, PhD3*, Bianca Maria Granelli, MD1,2*, Giovanni Rindone, MD1,2*, Andrea Aroldi, MD4*, Stefano Bonfanti, MD1,2*, Federica Colombo, MD1,2*, Marilena Fedele, MD2*, Giovanni Grillo5*, Matteo Parma, MD2*, Paola Perfetti, MD2*, Elisabetta Terruzzi, MD2*, Carlo Gambacorti-Passerini, MD2,6 and Fabrizio Cavalca2*

1Department of Medicine and Surgery, University Milano-Bicocca, Monza, Italy
2Hematology Department, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy
3Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
4Hematology Division and Bone Marrow Unit, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy
5Stem Cell Transplantation, Azienda Ospedaliera Ospedale Niguarda Ca' Granda, Milano, Italy
6University Milano-Bicocca, Monza, Italy


Large Language Models (LLMs) are a form of Artificial Intelligence (AI), by identifying patterns and connections within data, they can predict the most likely words or phrases in specific contexts. Previous studies have indicated that GPT (Generative Pre-trained Transformer; OpenAI) performs well in answering single-choice clinical questions. However, its performance seems to be less satisfactory when dealing with multiple-choice questions and more intricate clinical cases (Cosima et al. 2023 EAO; Cascella et al. 2023 J Med Syst). Notably, no study has evaluated LLMs responses in the context of Transplantation Decision Making, a complex process heavily reliant on physician expertise. Additionally, most studies focused solely on GPT's performance, without considering other competitive LLMs like Llama-2 or VertexAI. Our study aims to assess the performance of LLMs in the domain of hematopoietic stem cell transplantation.


We modified and anonymized the clinical histories of six hematological patients. An experienced hematologist reviewed and validated these modified clinical histories, which included demographic data, past medical history, hematology disease features (genetic data and MRD when available), treatment responses, adverse events from previous therapies, and potential donor information (related/unrelated, HLA, CMV status).

We presented these clinical cases to six experienced bone marrow transplant physicians from two major JACIE accredited hospitals and 11 hematology residents from the University Milano-Bicocca. LLMs employed for the analysis were: GPT-4, VertexAI Palm 2, Llama-2 13b and 70b. LLMs were configured with different temperature settings to control token selection randomness, always maintaining low levels for more deterministic responses.

A triple-blinded survey was conducted using Typeform, where both senior hematologists and residents provided anonymized responses with personal tokens. The senior hematologists, residents, and LLMs testers were unaware of the responses provided by the other groups. We calculated Fleiss K (K) and overall percentage of agreement (OA) between residents and LLMs, considering the consensus answer (CoA) among experts as the most frequent response. Subsequently, OA and K values for both residents and LLMs were compared using T- or Mann-Whitney tests with Graphpad v 10.0.1.


The results showed perfect agreement among experts in patient transplant eligibility assessment (K=1.0) and substantial agreement in the choice of donors and conditioning regimens (K=0.62 for both questions). Fair agreement was observed in Transplant Related Mortality (TRM) estimation (K=0.22).

The median OA and K value between residents and the CoA of experts were 76.5% (range 52.9-88.2%) and 0.61 (range 0.4-0.8), respectively. The median OA and K value between LLMs answers and experts were 58.8% (range 47-71%) and 0.45 (range 0.3-0.61), respectively. The mean OA and K value of residents were significantly higher compared to LLMs (p=0.02). Specifically, residents showed higher median OA and K values in patient eligibility assessment (median OA 100 vs. 83% and K 1 vs. 0.78; p=0.01). However, there was no significant difference in median K for donor choice (0.56 vs. 0.56), conditioning regimen (0.67 vs. 0.33), and TRM evaluation (0.33 vs. 0) (Table 1). The median K values of GPT-4, Palm-2, Llama2-13b, and Llama2-70b were 0.49, 0.53, 0.33, and 0.53 respectively (Figure 1).


Our study sheds light on the potential and limitations of LLMs in complex hematopoietic stem cell transplantation decision-making. While LLMs showed promising results with a median OA of 59%, residents demonstrated superior performance. LLMS displayed good performances in patients' eligibility and donor choice but showed shortcomings in conditioning regimens and TRM evaluation.

Not using a rating scale from experts when evaluating LLMs responses aimed to avoid potential bias. However, it is important to note that the consensus answer, even though it was the most frequent, does not necessarily imply that other responses provided by the experts were incorrect. Therefore, the lower consensus among the experts in TRM evaluation, possibly due to the challenge of precisely calculating TRM in a survey-based evaluation, should also lead to a cautious approach when evaluating residents and LLMs answers in this setting.

Disclosures: No relevant conflicts of interest to declare.

*signifies non-member of ASH