Welcome MindEye2, an AI that can now read your mind! The concept of shared-subject models enables fMRI-To-Image with 1 hour of data. Let’s take a look at how it works!
Highlights:
- Medical AI Research Center (MedARC) announced MindEye2, the predecessor to MindEye1.
- It is a substantial advancement in fMRI-to-image reconstruction by introducing the concepts of shared-subject modelling.
- It is a significant improvement in decoding brain activity.
MindEye2 Explained
Advancements in reconstructing visual perception from brain activity have been remarkable, yet their practical applicability has yet to be restricted.
This is primarily because these models are typically trained individually for each subject, demanding extensive (Functional Medical Resonance Imaging) fMRI training data spanning several hours to achieve satisfactory results.
However, MedARC’s latest study demonstrates high-quality reconstructions with just one hour of fMRI training data:
Announcing MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data! 🧠👁️
— MedARC (@MedARC_AI) March 19, 2024
High-quality reconstructions of images from fMRI using 1 hour of training data vs. past work using 30-40 hrs/person!
arXiv: https://t.co/X5pYSF7qIE
project page: https://t.co/haGNjyCUXM pic.twitter.com/d799XYYpic
MindEye2 presents a novel functional alignment method to overcome these challenges. It involves pretraining a shared-subject model, which can then be fine-tuned using limited data from a new subject and generalized to additional data from that subject.
This strategy achieves reconstruction quality comparable to that of a single-subject model trained with 40 times more training data.
They pre-train their model using seven subjects’ data, then fine-tuning on a minimal dataset from a new subject.
MedARC’s research paper explained their innovative functional alignment technique, which involves linearly mapping all brain data to a shared-subject latent space, succeeded by a shared non-linear mapping to the CLIP (Contrastive Language-Image Pre-training) image space.
Subsequently, they refine Stable Diffusion XL to accommodate CLIP latent as inputs instead of text, facilitating mapping from CLIP space to pixel space.
This methodology enhances generalization across subjects with limited training data, achieving state-of-the-art image retrieval and reconstruction metrics compared to single-subject approaches.
The MindEye2 Pipeline
MindEye2 utilizes a single model trained via pretraining and fine-tuning, mapping brain activity to the embedding space of pre-trained deep-learning models. During inference, these brain-predicted embeddings are input into frozen image generative models for translation to pixel space.
The reconstruction strategy involves retraining the model with data from 7 subjects (30-40 hours each) followed by fine-tuning with data from an additional held-out subject.
Single-subject models were trained or fine-tuned on a single 8xA100 80Gb GPU node for 150 epochs with a batch size of 24. Multi-subject pretraining used a batch size of 63 (9 samples per subject). Training employed Huggingface Accelerate and DeepSpeed Stage 2 with CPU offloading.
The MindEye2 pipeline is shown in the following image:
The schematic of MindEye2 begins with training the model using data from 7 subjects in the Natural Scenes Dataset, followed by fine-tuning on a held-out subject with limited data. Ridge regression maps fMRI activity to a shared-subject latent space.
An MLP backbone and diffusion prior generate OpenCLIP ViT-bigG/14 embeddings, utilized by SDXL unCLIP for image reconstruction. The reconstructed images undergo refinement with base SDXL.
Submodules retain low-level information and aid retrieval tasks. Snowflakes represent frozen models for inference, while flames indicate actively trained components.
Shared-Subject Functional Alignment
To accommodate diverse brain structures, MindEye2 employs an initial alignment step using subject-specific ridge regression. Unlike anatomical alignment methods, it maps flattened fMRI activity patterns to a shared-subject latent space.
MedARC said the following about it:
“The key innovation was to pretrain a latent space shared across multiple people. This reduced the complexity of the task since we could now train our MindEye2 model from a good starting point.”
Each subject has a separate linear layer for this mapping, ensuring robust performance in various settings. The model pipeline remains shared across subjects, allowing flexibility for new data collection without predefined image sets.
Backbone, Diffusion Prior, & Submodules
In MindEye2, brain activity patterns are first mapped to a shared-subject space with 4096 dimensions. Then, they pass through an MLP backbone with 4 residual blocks. These representations are further transformed into a 256×1664-dimensional space of OpenCLIP ViT-bigG/14 image token embeddings.
Concurrently, they are processed through a diffusion prior and two MLP projectors for retrieval and low-level submodules.
Unlike MindEye1, MindEye2 uses OpenCLIP ViT-bigG/14, adds a low-level MLP submodule, and employs three losses from the diffusion prior, retrieval submodule, and low-level submodule.
Image Captioning
To predict image captions from brain activity, they first convert the predicted ViT-bigG/14 embeddings from the diffusion before CLIP ViT/L-14 space. These embeddings are then fed into a pre-trained Generative Image-to-Text (GIT) model, a method previously shown to work well with brain activity data.
Since there was no existing GIT model compatible with OpenCLIP ViT-bigG/14 embeddings, they independently trained a linear model to convert them to CLIP ViT-L/14 embeddings. This step was crucial for compatibility.
Caption prediction from brain activity enhances decoding approaches and assists in refining image reconstructions to match desired semantic content.
Fine-tuning Stable Diffusion XL for unCLIP
CLIP aligns images and text in a shared embedding space, while unCLIP generates image variations from this space back to pixel space. Unlike prior unCLIP models, this model aims to faithfully reproduce both low-level structure and high-level semantics of the reference image.
To achieve this, it fine-tunes the Stable Diffusion XL (SDXL) model with cross-attention layers conditioned solely on image embeddings from OpenCLIP ViT-bigG/14, omitting text conditioning due to its negative impact on fidelity.
Model Inference
The reconstruction pipeline starts with the diffusion prior’s predicted OpenCLIP ViT4 bigG/14 image latents fed into SDXL unCLIP, producing initial pixel images. These may show distortion (“unrefined”) due to mapping imperfections to bigG space.
To improve realism, unrefined reconstructions pass through base SDXL for image-to-image translation, guided by MindEye2’s predicted captions. Skipping the initial 50% of denoising diffusion timesteps, refinement enhances image quality without affecting image metrics.
Evaluation of MindEye2
MedARC utilized the Natural Scenes Dataset (NSD), an fMRI dataset containing responses from 8 subjects who viewed 750 images for 3 seconds each during 30-40 hours of scanning across separate sessions. While most images were unique to each subject, around 1,000 were seen by all.
They adopted the standard NSD train/test split, with shared images as the test set. Model performance was evaluated across various metrics averaged over 4 subjects who completed all sessions. Test samples included 1,000 repetitions, while training samples totalled 30,000, selected chronologically to ensure generalization to held-out test sessions.
fMRI-to-Image Reconstruction
MindEye2’s performance on the full NSD dataset demonstrates state-of-the-art results across various metrics, surpassing previous approaches and even its own predecessor, MindEye1.
Interestingly, while refined reconstructions generally outperform unrefined ones, subjective preferences among human raters suggest a nuanced interpretation of reconstruction quality.
These findings highlight the effectiveness of MindEye2’s advancements in shared-subject modelling and training procedures. Further evaluations and comparisons reinforce the superiority of MindEye2 reconstructions, demonstrating its potential for practical applications in fMRI-to-image reconstruction.
The image below shows reconstructions from different model approaches using 1 hour of training data from NSD.
- Image Captioning: MindEye2’s predicted image captions are compared to previous approaches, including UniBrain and Ferrante, using various metrics such as ROUGE, METEOR, CLIP, and Sentence Transformer. MindEye2 consistently outperforms previous models across most metrics, indicating superior captioning performance and high-quality image descriptions derived from brain activity.
- Image/Brain Retrieval: Image retrieval metrics assess the extent of detailed image information captured in fMRI embeddings. MindEye2 enhances MindEye1’s retrieval performance, achieving nearly perfect scores on benchmarks from previous studies. Even when trained with just 1 hour of data, MindEye2 maintains competitive retrieval performance.
- Brain Correlation: To evaluate reconstruction fidelity, we use encoding models to predict brain activity from reconstructions. This method offers insights beyond traditional image metrics, assessing alignment independently of the stimulus image. “Unrefined” reconstructions often perform best, indicating that refinement may compromise brain alignment while enhancing perceptual qualities.
How MindEye2 beats its predecessor MindEye1?
MindEye2 improves upon its predecessor, MindEye1, in several ways:
- Pretraining on data from multiple subjects and fine-tuning on the target subject, rather than independently training the entire pipeline per subject.
- Mapping from fMRI activity to a richer CLIP space and reconstructing images using a fine-tuned Stable Diffusion XL unCLIP model.
- Integrating high- and low-level pipelines into a single pipeline using submodules.
- Predicting text captions for images to guide the final image reconstruction refinement.
These enhancements enable the following main contributions of MindEye2:
- Achieving state-of-the-art performance across image retrieval and reconstruction metrics using the full fMRI training data from the Natural Scenes Dataset – a large-scale fMRI dataset conducted at ultra-high-field (7T) strength at the Center of Magnetic Resonance Research (CMRR) at the University of Minnesota.
- Enabling competitive decoding performance with only 2.5% of a subject’s full dataset (equivalent to 1 hour of scanning) through a novel multi-subject alignment procedure.
The image below shows MindEye2 vs. MindEye1 reconstructions from fMRI brain activity using varying amounts of training data. It can be seen that the results for MindEye2 are significantly better, thus showing a major improvement because of the novel approach:
Conclusion
In conclusion, MindEye2 revolutionizes fMRI-to-image reconstruction by introducing the concepts of shared-subject modelling and innovative training procedures. With recent research showing communication between two AI models, we can say there is a lot in store for us!