Using Machine Learning to Predict Near-Term Thrombosis Risk in Patients with Polycythemia Vera

Krichevsky, Spencer

Introduction:

Thrombosis (arterial and venous) remains the leading cause of morbidity/mortality for patients (pts) with polycythemia vera (PV) [1-2]. With appropriate therapy, thrombosis risk can be reduced to 1-5% per year (yr) [3]. Yet some seemingly well-controlled pts still suffer thrombotic events. This rate means that a well-powered clinical trial testing anti-thrombotic strategies in PV to improve thrombosis-free survival is prohibited by the large number of pts and study duration required. For example, the high-end annual incidence of thrombosis in ELN high-risk pts is <5%/yr, so a 2yr study hoping to demonstrate a 50% thrombosis reduction (α=5%, 1-ß=80%) requires 1104 pts. This study would be vastly more efficient if we could enroll only pts with high near-term thrombotic risk but we cannot accurately identify these pts. To address this, we developed a machine learning (ML) approach to dynamically predict this risk.

Methods:

After obtaining institutional approval, 526 pts with manually verified PV diagnoses were identified by querying electronic health records and our research database repository. The dataset (4M data elements,14K hematology clinic visits) includes lab measures, molecular/pathology studies, medications, and pt history.

Data completeness varied with >90% availability in 200+ features. To assess the error introduced by imputation, we computed the Euclidean distance/Jaccard index between masked pre-imputed data and imputed data over 8K iterations. We found that features with <50% missingness could be imputed without dataset corruption.

ML classification models (random forest (RF), boosted models, and support vector machines) were evaluated using AutoGluon. RF emerged as the top performer. Feature engineering, guided by Gini feature importance and clinical feedback, identified 21 training features. A parallelized pipeline optimized 1.6M model combinations (# of trees, node splitting features, max depth, thresholding) via grid search. Several optimization strategies were assessed.

Of 470 pts with clinic visits through Mar 31, 2020, 75% were used for model training (Training Cohort) and 25% for model evaluation (Testing Cohort). Cohort details were previously reported [4]. A prospective Validation Cohort included clinic visits after Apr 1, 2020 and was unseen by trained models.

A risk calculator designed in RShiny applied models corresponding to user-entered clinical data and computed 1yr thrombosis risk. User-entered data are not stored to protect data privacy.

Results:

Training/Testing pts had median age of 54, and 64 (14%) had a prior thrombosis at dx (8% venous, 7% arterial). Over a median followup of 10 yrs, 159 thromboses occurred in 115 pts (88 venous, 71 arterial). Annual incidence rates (IR) of thrombosis were higher shortly after dx (IR: 4% vs 1%, 2yr cutoff) and following a thrombotic event (IR: 10% vs 2%, 2yr cutoff). The full model performed very well (F1=0.91, AUC=0.84) when compared with ELN (F1=0.1, AUC=0.39) (Figure 1). High-performing simplified models (>90% testing/validation sensitivity) were applied to the Validation Cohort. The full 21-feature model and high-performing models (Figure 2) showed clear separation for 1yr thrombosis risk prediction. Features of importance included blood counts, body mass index (BMI), and time since dx were frequently represented in these models.

Discussion:

Over 1M models were trained on high-volume clinical data to predict near-term thrombosis risk in PV pts, considering intrinsic factors (age, blood type), disease events (time since dx/thrombosis), and short-term changes (BMI) to tailor risk mitigation strategies. Prospective validation demonstrates model generalizability. External validation using REVEAL study [1] and the Danish Registry data is ongoing.

Our tool’s ability to identify high-risk pt populations can be applied as an inclusion criterion for prospective clinical trials, reducing the time and pt numbers needed for a well-powered study.

Whereas using ELN-defined high risk would require 1K pts to demonstrate a 50% reduction in thrombosis rates, our prospective ML model-defined high near-term risk would only require ~80 pts, making such a study very feasible.

3186 Using Machine Learning to Predict Near-Term Thrombosis Risk in Patients with Polycythemia Vera