From Free Text to Upper Gastrointestinal Cancer Diagnosis: Fine-Tuning Language Models on Endoscopy and Histology Narratives

Misri, Kazhan, Alexandre, Leo and de la Iglesia, Beatriz (2025) From Free Text to Upper Gastrointestinal Cancer Diagnosis: Fine-Tuning Language Models on Endoscopy and Histology Narratives. In: Proceedings of the 17th International Conference on Knowledge Discovery and Information Retrieval (KDIR 2025). International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K - Proceedings . SciTePress – Science and Technology Publications, pp. 501-508. ISBN 9789897587696

Full text not available from this repository. (Request a copy)

Abstract

Clinical free text reports from endoscopy and histology are a valuable yet underexploited source of information for supporting upper gastrointestinal (GI) cancer diagnosis. Our initial learning task was to classify procedures as cancer-positive or cancer-negative based on downstream registry-confirmed diagnoses. For this, we developed a patient-level dataset of 63,040 endoscopy reports linked with histology data and cancer registry outcomes, allowing supervised learning on real-world clinical data. We fine-tuned two transformer-based models: general-purpose BERT and domain-specific BioClinicalBERT and evaluated methods to address severe class imbalance, including random minority upsampling and class weighting. BioClinicalBERT combined with up sampling achieved the best recall (sensitivity) of 85% and reduced false negatives compared to BERT’s recall of 78%. Calibration analysis indicated that predicted probabilities were broadly reliable. We also applied SHapley Additive exPlanations (SHAP) to interpret model decisions by highlighting influential clinical terms, fostering transparency and trust. Our findings demonstrate the potential of scalable, interpretable natural lan guage processing models to extract clinically meaningful insights from unstructured narratives, providing a foundation for future retrospective review of cancer diagnosis and clinical decision support tools.

Item Type: Book Section
Uncontrolled Keywords: clinical text classification,transformer models,upper gi cancer,software,strategy and management,management of technology and innovation,sdg 3 - good health and well-being ,/dk/atira/pure/subjectarea/asjc/1700/1712
Faculty \ School: Faculty of Science > School of Computing Sciences
Faculty of Medicine and Health Sciences > Norwich Medical School
UEA Research Groups: Faculty of Science > Research Groups > Data Science and AI
Faculty of Science > Research Groups > Health Computing
Faculty of Science > Research Groups > Norwich Epidemiology Centre
Faculty of Medicine and Health Sciences > Research Groups > Norwich Epidemiology Centre
Faculty of Medicine and Health Sciences > Research Groups > Gastroenterology and Gut Biology
Faculty of Medicine and Health Sciences > Research Centres > Metabolic Health
Faculty of Medicine and Health Sciences > Research Centres > Norwich Institute for Healthy Aging
Related URLs:
Depositing User: LivePure Connector
Date Deposited: 08 Jun 2026 15:54
Last Modified: 08 Jun 2026 15:54
URI: https://ueaeprints.uea.ac.uk/id/eprint/103321
DOI: 10.5220/0013836200004000

Actions (login required)

View Item View Item