Research

HuLP: Human-in-the-Loop for Prognosis

May 10, 2025

Introduction

The researchers at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) address the critical roles of diagnosis and prognosis in oncology, noting the particular complexity and uncertainty associated with prognosis. While diagnosis focuses on detecting the presence of cancerous cells or tumors, prognosis requires predicting the future progression of the disease, including survival time and the likelihood of recurrence. This complexity is driven by numerous factors, including tumor characteristics, patient demographics, and the efficacy of treatment, making prognosis a challenging task for clinicians to assess accurately.

Deep learning models are increasingly being used to support clinicians in prognosis, but current approaches exhibit two significant limitations. First, existing models do not allow space for clinicians to intervene, even in cases where the model’s predictions may be inaccurate or less confident. This lack of clinician input restricts their ability to refine or enhance the model’s predictions. While some studies have explored human intervention in natural image applications, they have not yet been widely applied to prognostic models in clinical settings. Clinicians’ expertise and feedback during inference could help improve model performance, similar to how doctors collaborate and refine diagnoses based on shared knowledge.

The second challenge lies in handling incomplete data and censored patient outcomes, where the exact event time (such as survival duration) is unknown. Missing data can result from incomplete collection, non-compliance, or technical errors, while censored outcomes may arise when patients stop attending follow-up visits, relocate, or withdraw from studies. Current AI research often addresses missing data with basic imputation methods, such as using averages or k-nearest neighbor approaches. However, these methods fall short of the detailed insights that oncologists gain through radiological images, which provide richer information about patients’ conditions. Prognostic models relying solely on electronic health records (EHRs) often fail to capture the complex variability between individuals with similar clinical profiles. For instance, two patients diagnosed with lung cancer and exhibiting identical clinical data can still have significantly different survival outcomes, highlighting the limitations of traditional EHR-based models.

In response to these challenges, the researchers introduce Human-in-the-Loop for Prognosis (HuLP), a deep learning architecture designed to improve the accuracy and interpretability of prognostic models in clinical practice. Inspired by prior work, HuLP aims to integrate human feedback into the model’s inference process, allowing clinicians to intervene and provide valuable insights that can refine prognostic predictions. The model also leverages the integration of radiological images, which offer temporal information about patient conditions that are often overlooked in static clinical datasets, to enhance the overall reliability of cancer prognosis.

HuLP offers two key innovations:

  1. User Interaction and Expert Intervention: HuLP allows clinicians to actively intervene during model inference, offering their expertise to refine the model’s predictions. This enhances the model’s decision-making, particularly in complex prognostic cases where expert knowledge is crucial.
  2. Handling Missing Data and Extracting Rich Representations: HuLP effectively manages missing covariates and outcomes, ensuring reliable prognostic predictions. Using clinical information as intermediate concept labels generates richer feature representations, boosting overall accuracy.

Methodology

Figure 1 shows the HuLP model’s architecture, which comprises four main components: the encoder, intervention block, classifier, and prognosticator.

The encoder is a deep neural network that processes an input image and generates a feature embedding, capturing important information from the image. This embedding is designed to learn a shared representation of the input data. It is then passed through several layers that further process the features to create concept embeddings, representing various patient characteristics from the clinical data.

The intervention block is a vital part of HuLP that allows users, such as clinicians, to interact with the model during test time. While the model is being trained, it assigns probabilities to each concept based on the image data. These probabilities represent the likelihood of certain patient characteristics being present. The embeddings are split into two parts: one representing positive and the other representing negative associations. The intervention block allows clinicians to replace these probabilities with their expert knowledge, indicating certainty about a concept’s presence or absence. This input helps the model refine its predictions, improving accuracy and trustworthiness in clinical decision-making.

The classifier in the HuLP model is a fully connected (FC) layer that ensures each concept embedding is aligned with its respective patient characteristics. It encourages each embedding to predict only one concept, followed by a softmax and cross-entropy loss to optimize predictions. The classifier is designed to handle missing data effectively.

The prognosticator is the final component of the model, responsible for processing the learned concepts and predicting clinical outcomes over time. It does this by passing the concept embeddings through an FC layer, which outputs each patient’s estimated hazard (likelihood of an event, such as death or cancer recurrence). This helps the model predict the progression of health conditions, giving insights into future outcomes based on the input data.

The loss function in the HuLP model combines two parts: the concept loss (L1) and the prognosis loss (L2).

  1. Concept Loss (L1): This loss is applied to the classifier layer and is calculated using cross-entropy. It ensures that the model correctly predicts the clinical concepts for patients with available data without imputing missing values before training. Averaging the loss over all non-missing covariates avoids the pitfalls of hard imputation and improves accuracy for incomplete datasets.
  2. Prognosis Loss (L2): This loss is applied to the prognosticator and is based on a modified version of the DeepHit loss for survival analysis. It combines two components:
    • Log-Likelihood Loss (lossLL): This captures the time of the event (e.g., death or recurrence) for uncensored patients and tracks the time of last known survival for censored patients. It uses the estimated hazard and survival function to assess the likelihood of these outcomes.
    • Ranking Loss (lossrank): This incentivizes the correct ordering of patient survival times, comparing pairs of patients to improve the model’s ability to rank survival probabilities accurately.

The final loss (Lfinal) is a weighted combination of the concept loss (L1) and prognosis loss (L2), with hyperparameters determining the balance between them. This setup enables HuLP to handle missing data and provide accurate prognosis predictions effectively.

Experimental Setup

Researchers at MBZUAI assessed the prognostic capabilities of the HuLP model by comparing it against conventional benchmarks using two real-world datasets: ChAImeleon and HECKTOR.

The ChAImeleon dataset includes 320 lung cancer CT scans linked to electronic health records (EHR), with clinical features such as age, gender, smoking status, and tumor categories (T, N, M stages). Up to 26% of the data is missing, and 59% is censored. To address this, the researchers combined missing labels into a category labeled “X” and merged cancer sub-stages into parent stages to increase sample sizes. A segmentation model was used to focus on lung areas in the scans.

The HECKTOR dataset comprises 224 CT and PET scans for head-and-neck cancer, also linked to EHR. Clinical features include TNM staging, tobacco and alcohol consumption, and treatment type. This dataset presents challenges with up to 90% missing data and 75% censored outcomes. Features with over 80% missing data were removed, and cancer sub-stages were similarly combined.

For both datasets, scans underwent preprocessing steps such as resampling, cropping, and resizing to standardize the input data, enhancing HuLP’s performance across various data types.

The researchers implemented the HuLP model using DenseNet-121 as the encoder, training it for 100 epochs. Positive and negative embeddings of size 64 were combined into a final concept embedding of size 32. The prognostic model predicted 12 discrete time bins for the ChAImeleon dataset and 16 for HECKTOR, based on quantiles of survival times. The model used a batch size 32, the AdamW optimizer with a learning rate of 1e-3, and a cosine annealing scheduler. All experiments were carried out in PyTorch.

HuLP’s performance was compared with deep survival methods: DeepHit, Deep-MTLR, and Fusion. DeepHit and Deep-MTLR were chosen for their strong performance in discrete survival tasks, and Fusion was selected as a multimodal baseline due to its success in the HECKTOR competition. DeepHit and Deep-MTLR were tested using mode imputation and FC layers, while Fusion combined imaging features from DenseNet-121 with EHR data through late fusion.

The experiments were conducted using five-fold cross-validation, ensuring a balanced ratio of events and censored patients, and were repeated with two random seeds for reliability.

Results

The researchers reported the time-dependent concordance index (C-index) of the HuLP model based on survival curves from the Antolini et al. study. The results, summarized in Table 2, highlight the limitations of using electronic health records (EHR) without imaging data, which lack depth and richness, and the challenges of using images alone, which leave the model unguided. HuLP consistently outperformed these methods, achieving statistically significant improvements (p-value < 0.05) and remaining competitive with the Fusion model despite the latter’s disjoint learning from EHR and image embeddings.

HuLP’s innovation lies in integrating EHR as an intermediate concept, guiding the model to relevant features and generating rich, disentangled embeddings of clinical information from images. This approach offers two key advantages: it facilitates human expert intervention during inference and enhances robustness to missing data. During testing, ground-truth labels were used for non-missing data while retaining model predictions for missing entries. This integration of user interaction significantly improved prognostic capabilities, yielding an increase of about 0.1 in the C-index on the ChAImeleon dataset, as shown in Table 3.

To assess HuLP’s robustness to missing data, the researchers created an 80:20 train-validation split stratified by gender, ensuring each validation patient shared identical or similar EHR with at least one other patient, with at least one entry missing. They introduced random masking of training EHR to simulate various missing data scenarios. The findings showed that HuLP outperformed conventional imputation methods, such as mode, kNN, and MICE, particularly in scenarios with low levels of missing data. However, improvements were less pronounced in high missing-data situations, likely due to inadequate feedback from the model. Despite this, HuLP maintained competitive results against the baselines.

Discussions

The HuLP model represents a significant innovation in prognosis by being the first to incorporate human interaction and intervention for known concepts. This approach is valuable in prognosis, where predicting future outcomes can be more challenging than diagnosing current conditions. Unlike traditional methods that treat human experts as passive users, HuLP empowers clinicians to engage with the model actively. This collaborative dynamic fosters a synergistic relationship, enabling clinicians to refine the model’s concept predictions based on their expertise. By allowing for such adjustments, HuLP enhances the accuracy of prognostic assessments and improves the model’s interpretability and reliability, ultimately enhancing greater confidence in clinical decision-making.

Furthermore, HuLP addresses the issue of missing data with a tailored methodology that outperforms conventional imputation techniques like mode, kNN, and MICE. While these traditional methods often oversimplify complex clinical datasets and may introduce bias, HuLP leverages neural networks to handle the intricacies of missing data better. HuLP implicitly imputes missing covariates based on imaging features during testing rather than relying on simplistic hard imputation. This approach aligns more closely with clinician workflows, enhancing the reliability and trustworthiness of prognostic assessments.

Conclusion

In conclusion, this paper introduces HuLP (Human-in-the-Loop for Prognosis), a groundbreaking approach that enables clinicians to interact with and intervene in model predictions during testing. This innovation significantly enhances the reliability and interpretability of prognostic models in clinical settings. By effectively extracting meaningful representations from imaging data and managing missing covariates and outcomes, HuLP demonstrates superior performance across two medical datasets. Future research should prioritize validating HuLP in real clinical environments with actual clinical inputs and further exploring the usability of its disentangled feature embeddings.

To learn more, check out the resources below:

You Might Also Like

No Comments

Leave a Reply

Disclaimer: All views and opinions expressed in these posts are those of the author(s) and do not represent the views of MBZUAI.

Subscribe to Stay Updated