Vrishank Chandrasekhar
United States
Novel Digital Biomarker for Early Pan-Cancer Prognosis Prediction via Free-Text Clinical Narratives
Abstract
Cancer is the second leading cause of death worldwide, affecting 20 million people and resulting in >9.7 million deaths annually. Early prediction of clinical outcomes is crucial for targeted therapy; however, current staging systems are limited in their ability to risk-stratify, with low concordances in select cancers. Prior research has demonstrated that free-text clinical narratives may provide prognostically relevant information. Yet, heavy reliance on longitudinal corpora and a long-standing focus on binary classification rather than time-to-event prediction make clinical implementation difficult. This study develops a natural language processing (NLP) framework that models unstructured free-text clinical narratives from initial diagnosis to predict pan-cancer patient prognosis. First, we train an unsupervised FastText model to learn from ~2.3 million clinical notes from 56,339 patients, after removing noise, segmenting sentences, and synthesizing clinical terminology. The model learns medically relevant word representations across 686 institutions and generates note-level embeddings, which are then used for Cox Proportional Hazards (CPH) modeling. Using this framework, we demonstrate that clinical narratives can successfully stratify a diverse cohort of patients across and within cancers without relying on temporal data, achieving concordances of 0.72 and 0.77 with survival and recurrence outcomes, respectively. This study highlights the potential of diagnostic free-text clinical narratives as a novel, pan-cancer digital biomarker that can be generalized to virtually any electronic health record (EHR), leading to more targeted treatment plans and improved clinical outcomes.
