Utilizing Large Language Models to Automate Prognostication in Patients with Myelofibrosis: A Retrospective Cohort Study

Khan, Muhammad Ali

Introduction

Prognostication in myelofibrosis (MF) currently relies on manual assessment of clinical, pathological, and laboratory variables, which is inefficient and error-prone. The potential of large language models (LLMs) and rule-based programming to automate data extraction and prognostication in these patients remains unexplored and merits investigation.

Methods

Patients with MF were identified from a retrospective chart review. Data regarding clinical (age, constitutional symptoms, transfusion requirements), laboratory (hemoglobin, circulating white blood cell, platelet, and blast counts), and pathological (grade of bone marrow fibrosis, chromosomal analyses, cytogenetics) features was extracted from the latest free-text clinical notes and pathology reports using the generative pretrained transformer-4 (GPT-4) model in a zero-shot setting with the prompting parameters set to generate deterministic responses. The patients were randomly sampled into development (15%) and test (85%) sets, with the documents of patients in the development set used for iterative prompt engineering. The finalized prompts were used for data extraction in the test set. Rule-based programming was used for post-processing of the extracted data and ascertaining prognosis via Dynamic International Prognostic Scoring System (DIPSS), DIPSS-plus (DIPSS+), and Mutation-enhanced International Prognostic Scoring System-70 (MIPSS-70). Automatically extracted data and prognostic categories were compared with the manually generated human annotations. Mean accuracy, precision, and recall with 95% confidence intervals (CI) were computed. Detailed error and sensitivity analyses were conducted to assess performance.

Results

The study included 100 patients (development: 15; test: 85) with 100 clinical notes (development: 15; test: 85) and 90 pathology reports (development: 13; test: 77). The mean age of these patients was 70 (SD: 10.6); 55 (55%) patients were men, and 93 (93%) patients were White. In the development set, the mean accuracy, precision, and recall for data extraction from clinical notes were 94% (95% CI: 90-98%), 92% (84-99%), and 99% (97-100%). For data extraction from pathology reports, the mean accuracy, precision, and recall were 96% (93-100%), 98% (92-100%), and 83% (63-100%). For prognostication, the mean accuracy, precision, and recall were 87% (79-94%), 80% (73-87%), and 100% using DIPSS (accuracy: 80%; precision: 73%; recall: 100%), DIPSS+ (accuracy: 93%; precision: 83%; recall: 100%) and MIPSS-70 (accuracy: 87%, precision: 83%; recall: 100%).

In the test set, the mean accuracy, precision, and recall for data extraction from clinical notes were 96% (95% CI: 93-98%), 91% (87-96%), and 98% (97-100%). For data extraction from pathology reports, the mean accuracy, precision, and recall were 95% (92-98%), 93% (87-99%), and 90% (80-100%). For prognostication the mean accuracy, precision, and recall were 95% (94-96%), 89% (88-90%) and 99% (96-100%) using DIPSS (accuracy: 94%; precision: 89%; recall: 100%), DIPSS+ (accuracy: 95%; precision: 88%; recall: 95%) and MIPSS-70 (accuracy: 94%, precision: 90%; recall: 100%).

Error analyses identified incorrect extractions by GPT-4 were related to the circulating blasts and constitutional symptoms variables from the clinical notes. Sensitivity analyses excluding the associated variables improved overall accuracy from 94% to 97% and 96% to 97% in the development and test sets, respectively. Similarly, incorrect extractions from the pathology reports were related to the CALR gene mutation variables. Sensitivity analyses excluding the associated variables improved the overall accuracy from 96% to 98% and 95% to 97% in the development and test sets, respectively.

Conclusion

LLMs and rule-based programming can accurately extract data from the electronic health records and prognosticate MF patients via DIPSS, DIPSS+, and MIPSS-70. The automated approach could significantly streamline clinical and research workflows, providing an efficient and less resource-intensive alternative to manual prognostication.

1794 Utilizing Large Language Models to Automate Prognostication in Patients with Myelofibrosis: A Retrospective Cohort Study