Affiliations
Published
May 18, 2025
Introduction
Researchers at MBZUAI are addressing the significant challenge of accurately identifying cancerous tumors through medical imaging. The complexity of the disease, combined with the need to interpret various imaging modalities, complicates the diagnostic process, often leading to delays and inconsistencies. There is a growing need for advanced Deep Neural Network (DNN) models that can automate tumor identification with greater accuracy to streamline this process.
It is well established that combining CT and PET imaging can improve tumor segmentation by using the structural details from CT scans and the metabolic insights from PET scans. However, PET scans are often unavailable due to their cost and clinical necessity. Traditional approaches to integrating these modalities either assume the constant availability of both scans or require an increase in model parameters, which is computationally inefficient.
Transformer-based models, such as UNETR, have emerged as powerful approaches for segmentation tasks, outperforming CNNs by capturing long-range dependencies in the data. However, adapting pre-trained models for multi-modal inputs like CT and PET remains challenging.
To address this, the researchers propose PEMMA (Parameter Efficient Multimodal Adaptation), a novel approach that builds on Low-Rank Adaptation (LoRA) techniques. LoRA introduces low-rank updates to the transformer’s attention layers without retraining the entire model, enabling efficient adaptation of pre-trained CT models to handle additional imaging modalities like PET. This method optimizes performance while minimizing computational overhead, making it a promising solution for multi-modal tumor segmentation.
Contributions
- Efficient Multimodal Integration: The researchers proposed a method that efficiently incorporates new imaging modalities into existing models while minimizing cross-modal interference.
- Flexible Fine-tuning: Their approach demonstrates the feasibility of flexibly fine-tuning the model, even when only one modality, such as CT or PET, is available.
- Knowledge Retention: The method effectively retains previously learned knowledge when adapting the model to new data, ensuring that the model builds on past training without losing accuracy.
Problem Statement: The researchers at MBZUAI began with a pre-trained transformer-based tumor segmentation model designed to process CT scans and produce segmentation masks. The main objective was to upgrade this model so that it could incorporate both CT and PET scans, aiming for better segmentation accuracy.
The key challenge was to ensure that the adapted model could efficiently integrate PET scan data without significantly increasing the number of parameters compared to the original CT model. The adaptation had to minimize cross-modal entanglement, allowing the model to be fine-tuned independently using either CT or PET scans without negatively impacting the knowledge learned from the other modality.
Standard Adaptation Methods
Early Fusion: A common approach for multi-modal CT and PET segmentation is to treat both as different channels of the same input, creating a new input space that combines the two modalities. In transformer-based models like UNETR, this requires two key architectural changes: replacing the uni-modal patch embedding layer with a multi-modal one and modifying the skip connection between the input and decoder to accommodate the increased number of channels. This allows the model to be trained with both CT and PET data.
While this approach is parameter-efficient, as it only adds a few parameters to handle the increased channels, it is limited. Features from CT and PET scans can become entangled, making it difficult to fine-tune the model with only one modality later without affecting the other. Three initialization strategies can be used to address the introduction of PET-related parameters: random initialization, zero initialization, or cross-modal initialization, where the pre-trained CT model weights are reused to guide the integration of PET information.
Late Fusion: Another approach is to train separate CT and PET scan models and combine their outputs at the mask level. This provides flexibility when only one modality is available during training or inference. However, this method doubles the parameters and may not deliver optimal segmentation accuracy consistently.
Proposed Adaptation Method: PEMMA
To address the strengths and weaknesses of standard adaptation methods, the researchers at MBZUAI introduce a novel framework for efficiently adapting a uni-modal model to a multi-modal one, leveraging the inherent modularity of transformer architectures. This method, referred to as Parameter-Efficient Multi-Modal Adaptation (PEMMA), draws inspiration from visual prompt tuning (VPT) and low-rank adaptation (LoRA). PEMMA facilitates the efficient integration of the PET modality into a pre-existing CT-based model while maintaining parameter efficiency and avoiding cross-modal entanglement.
PEMMA consists of three key components:
- Modality Introduction via Prompts: The PET modality is introduced as a set of visual prompts (context tokens) by adding a new PET patch embedding layer to the pre-trained CT model. This layer generates PET patch tokens processed through the subsequent transformer blocks, similar to how visual prompts function in VPT. This approach enables the model to retain PET information without significantly altering the overall architecture.
- Attention Layer Fine-Tuning through LoRA: Rather than fine-tuning all parameters of the transformer encoder, PEMMA focuses on updating only the attention layers using LoRA matrices. In LoRA, updates to the attention layer’s key and value weight matrices are decomposed into two low-rank matrices, allowing for parameter-efficient updates without affecting most of the model’s weights. This ensures efficient integration of the newly introduced PET information with minimal parameter overhead.
- Parallel Input Skip Layers: Unlike the standard early fusion approach, where the CT skip connection layer is replaced with a combined CT+PET skip layer, PEMMA introduces an additional parallel skip connection for the PET modality. The outputs of the CT and PET skip layers are combined linearly, ensuring that the two modalities remain disentangled. This design minimizes cross-modal entanglement, allowing the model to be fine-tuned on one modality later without negatively impacting the other.
A new set of PET patch tokens is generated when a PET image is passed through the PET patch embedding layer. The transformer encoder blocks process these tokens and CT tokens. While the decoder is designed to handle only the CT tokens, the self-attention mechanism within the transformer ensures that PET information is distilled into the CT tokens, mirroring the behavior of VPT, where contextual prompts influence tokens while being omitted in the final stages.
The LoRA framework modifies only the key and value matrices for the attention layers using low-rank updates while freezing all other transformer parameters. This enables the model to incorporate the PET modality with minimal additional learnable parameters.
To prevent cross-modal entanglement, PEMMA introduces a new PET-specific skip connection that operates in parallel with the original CT skip connection. The outputs from these two layers are combined, allowing the model to utilize both modalities during training and inference effectively. Notably, using parallel skip layers enables the model to be fine-tuned later using only one modality without degrading performance on the other modality.
Flexible Training and Inference Strategy
The PEMMA framework introduces three new parameters: the PET patch embedding parameters, the LoRA parameters, and the PET skip layer parameters. CT and PET training data are necessary when transitioning from a uni-modal to a multi-modal model. During this phase, all the newly introduced parameters are learned, while the parameters of the pre-trained uni-modal model remain entirely frozen. This results in the multi-modal model being composed of both the original parameters from the uni-modal model and the new parameters, maintaining a marginal increase in the total number of parameters. Thus, the overall parameter count remains efficient.
Once the multi-modal model has been established, if new data becomes available for updating the model, only the LoRA parameters must be adjusted. All other parameters, including those from the pre-trained model and the latest PET embedding and skip layer parameters, can remain frozen. This design offers significant flexibility in fine-tuning the multi-modal model, allowing it to be updated using just one or both modalities. If one modality is unavailable during training, its corresponding input can be set to zero, ensuring the model can still function effectively.
In terms of inference, the PEMMA framework is similarly adaptable. The multi-modal model can utilize both modalities effectively when available, enhancing segmentation accuracy. However, it can also operate using only a single modality if needed. While this may result in some degradation in segmentation performance compared to using both modalities, it ensures that the model remains robust and functional under varying conditions. This flexibility makes PEMMA a practical solution for real-world applications where data availability may fluctuate.
Experimental Set-up
Pre-processing: The dataset used in this study is publicly available on the MICCAI 2022 HEad and neCK TumOR (HECKTOR) challenge website. It consists of 522 samples distributed across various centers and scanner types, as outlined in Table 1. To enhance the training process, four random crops of size 96×96×96 are extracted from each scan. These augmentations are designed to diversify the training data and provide a more comprehensive representation.
Implementation Details: The researchers use PyTorch version 2.1.0 with the MONAI Library. All models are trained for a maximum of 18,000 steps, with the best-performing model selected based on the highest average dice score from the validation set. The training process employs a learning rate of 1e-4 and a weight decay of 1e-5 across all experiments. The AdamW optimizer is applied with a batch size set to 2. The entire pipeline operates on a single Nvidia A6000 RTX GPU with 48GB of memory.
Results and Discussion
This study utilizes four centers from the HECKTOR Dataset (CHUM, CHUV, CHUP, and CHUS) to create an initial pre-trained unimodal model based on CT scans. Both modalities are fine-tuned in a multi-modal adaptation setting using data from the MDA center. The method is tested on joint (CT+PET) and separate modalities, comparing its performance with standard adaptation approaches (Early and Late).
The results show that PEMMA achieves performance comparable to early fusion methods while being 12 times more efficient. CT and PET scans from two additional centers (HGJ and HMR) are introduced to adapt to new datasets, with two training scenarios: CT only and both modalities. PEMMA significantly outperforms early and late fusion techniques, with increases of +19% and +28% in average dice scores for the new datasets.
These findings emphasize LoRA’s effectiveness in integrating additional modality information while reducing the need for extensive retraining, cutting trainable parameters by 92%. The proposed approach also proves beneficial in continual learning settings, where minimal parameter updates are required, facilitating effective adaptation to new tasks without catastrophic forgetting. Overall, the study successfully achieves parameter efficiency and minimizes cross-modality entanglement.
Conclusion
This study presents a Parameter-Efficient Multi-modal Adaptation (PEMMA) method designed to enhance the model’s adaptability and efficiency across diverse data sources and modalities without requiring extensive retraining or significantly increasing the model’s parameter count. This technique enables the model to retain prior knowledge while learning from new data modalities, thereby improving its capacity to process evolving multi-modal information.
The authors aim to evaluate this method with other medical imaging modalities, such as MRI, and broaden their research to encompass a variety of datasets. Additional assessments will be crucial to determine the model’s reliability and flexibility across a more comprehensive array of medical imaging scenarios, ultimately leading to more effective diagnostic tools.
An intriguing direction for future research is the exploration of Parameter-Efficient Multi-disease Adaptation, which would allow deep neural networks (DNNs) to adapt efficiently and effectively when trained on data from different diseases. This approach could significantly enhance the adaptation framework’s clinical applicability and utility in real-world medical settings.
To learn more, check out the resources below:
No Comments