-Author name in bold denotes the presenting author
-Asterisk * with author name denotes a Non-ASH member
Clinically Relevant Abstract denotes an abstract that is clinically relevant.

PhD Trainee denotes that this is a recommended PHD Trainee Session.

Ticketed Session denotes that this is a ticketed session.

2263 Retrieval Augmented Generation for the Detection of Major Bleeding Events in the Electronic Health Record

Program: Oral and Poster Abstracts
Session: 901. Health Services and Quality Improvement: Non-Malignant Conditions Excluding Hemoglobinopathies: Poster I
Hematology Disease Topics & Pathways:
Research, Epidemiology, Clinical Research
Saturday, December 7, 2024, 5:30 PM-7:30 PM

Peter Kaplinsky1*, Rohan Singh, MS1*, Thomas F Fusillo, DO, MS1,2*, Avi Leader, MD3, Jeffrey I. Zwicker, MD1 and Simon Mantha, MD, MPH4

1Memorial Sloan Kettering Cancer Center, New York, NY
2Internal Medicine, Icahn School of Medicine at Mount Sinai, Mount Sinai Morningside/West, New York, NY
3Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, NY
4Hematology Service, Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, NY

Background: Non-Surgical Major Bleeding (NSMB) is an important clinical event associated with significant morbidity and mortality. Detecting NSMB events in the electronic health record is central to conducting research and quality assessment studies for patients on anticoagulation therapy and in other settings as well. Determining a patient’s NSMB history by means of manual clinical chart review is limited by time and human resources concerns. Natural language processing (NLP) has emerged as a promising modality for the detection of clinical events in the electronic health record. Transformer NLP models have revolutionized the field and have performed satisfactorily for varied clinical endpoints, including venous thromboembolism for which this approach has been extensively validated. However, there is a lack of data concerning the use of NLP to detect bleeding events. Training such models generally requires an arbitrarily large, annotated dataset, which necessitates significant amounts of human labor. The use of generative large language models (LLM’s) with retrieval augmented generation (RAG) is emerging as a promising modality to generate machine-derived labels without a large human-annotated set. RAG seeks to avoid hallucinations and incomplete knowledge in LLM’s by providing domain-specific context from external sources. We hypothesized RAG could be used to detect NSMB events in cancer patients.

Methods: Clinical notes for 174 patients seen at Memorial Sloan Kettering Cancer Center (MSKCC) were assessed manually for NSMB using the CEDARS+PINES NLP platform (Mantha et al, 2024). The definition of NSMB was derived from the formulation of the International Society on Thrombosis and Haemostasis (Schulman et al, 2005) and optimized for use within a generative LLM prompt. Clinical notes were processed into chunks of 250 characters and vector encoded with the gte-large-en-v1.5 embedding model for retrieval (Li et al, 2023). The vectors for the clinical notes were stored in a Chroma vector database. The ISTH definition of NSMB was split into three components, and for each patient, the top 20 relevant chunks pertaining to each component were retrieved via vector similarity search. These chunks were further subset with a reranking model to optimize their relevance to NSMB. Finally, they were formatted into a prompt template, which was used as input to the Llama3 8B LLM (Meta Inc, 2024). To leverage the Chain-of-Thought prompting technique which elicits reasoning in LLM’s, the prompt template included the ISTH definition of NSMB and step-by-step instructions for determining if a patient had NSMB. For a given patient, the LLM outputs a binary label (NSMB detected or no evidence of NSMB) and its rationale.

Results: 3,676 notes for 174 patients at MSKCC were processed with the RAG pipeline. We optimized model performance by tuning several hyperparameters. Hyperparameter options included chunk size of 250 or 500, LLM prompt strategy, reranking with a cross encoder or LLM, and LLM model choice. The final RAG pipeline used the following configuration: chunk size of 250, Llama 3 8B as the LLM and reranking model, and in-context learning with Chain-of-Thought prompting. We noted that the performance of RAG was highly sensitive to the hyperparameter choices. With a 14% prevalence of NSMB in our dataset, the model achieved a precision of 72% and recall of 72%. Our model was computationally efficient, processing all 174 samples in ~15 minutes.

Conclusions: These results highlight the potential of LLM’s with RAG for the detection of NSMB in electronic health records. Such an approach could be used to augment or even replace manual chart annotation in clinical research and quality assessment studies without the need for large, human-annotated training datasets. Further refinement of these models to optimize their precision, recall, and clinical utility is warranted.

Disclosures: Leader: Leo Pharma: Honoraria. Zwicker: Calyx: Consultancy; Regeneron: Consultancy, Research Funding; Parexel: Consultancy; BMS: Consultancy; Med Learning Group: Consultancy; Quercegen: Research Funding; Incyte Corporation: Research Funding; UpToDate: Patents & Royalties; CSL Behring: Other: Personal fees; Sanofi: Other: Personal fees. Mantha: Janssen Pharmaceuticals: Consultancy.

*signifies non-member of ASH