A scalable framework for fMRI dataset aggregation and modeling of human vision

Benjamin Lahner^1,2,3,4, Mayukh Deb^5,6, Apurva Ratan Murty^5,6, Aude Oliva¹

¹ Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA.

² Department of Ophthalmology, Byers Eye Institute, Stanford University School of Medicine, Stanford University, Stanford, CA, USA.

³ Stanford Bio-X, Stanford University, Stanford, CA, USA.

⁴ Wu Tsai Neurosciences Institute, Stanford University, Stanford, CA, USA.

⁵ Cognition and Brain Science, School of Psychology, Georgia Tech, Atlanta, GA, USA.

⁶ Computational Cognition, Georgia Tech, Atlanta, GA, USA.

Datasets

Subjects

Stimuli

Trials

Paper (bioRxiv) Paper (PDF) Preprocessing Code Dataset

Why MOSAIC?

Human fMRI neuroscience needs massive scale to keep up with modern deep learning and produce generalizable results. Isolated fMRI experiments cannot get us there. MOSAIC unifies existing (and future) fMRI datasets with a shared preprocessing pipeline and a cross-dataset test/train data split. Now researchers can train bigger brain models, test results across datasets, and contribute their own datasets to shape the future of MOSAIC and computational neuroscience.

Get started with MOSAIC in seconds!

Use our Python package to access eight of the largest fMRI datasets in just a few lines of code!

pip install mosaic-dataset

import mosaic

filenames = mosaic.download(
    names_and_subjects={
        "NaturalScenesDataset": [1],
    },
    folder="./MOSAIC",
)

Alternatively, you can browse the S3 bucket to download data manually or use the AWS command line interface.

aws s3 ls --no-sign-request s3://mosaicfmri/

Abstract

Recent large-scale vision fMRI datasets have been invaluable resources to the vision neuroscience community for their deep sampling of individual subjects and diverse stimulus sets. However, practical limitations to the number of subjects, stimuli, and trials that can be collected prevent individual fMRI datasets from reaching the scale necessary for modern modeling approaches and robust conclusions. Here, we introduce MOSAIC (Meta-Organized Stimuli And fMRI Imaging data for Computational modeling), a fMRI dataset aggregation framework designed to leverage the richness of individual datasets for computationally intensive modeling and robust tests of generalization. MOSAIC is composed of eight large-scale vision fMRI datasets totaling 93 subjects, 430,007 fMRI-stimulus pairs, and 162,839 naturalistic and artificial stimuli. A shared fMRI preprocessing pipeline and a filtered test-train split minimizes dataset-specific confounds and test-set leakage when aggregating the datasets. Crucially, additional datasets can be integrated into MOSAIC post hoc, allowing MOSAIC to evolve according to the community's interests. We use MOSAIC to show that perceptually diverse stimulus sets consistently improve decoding accuracy and stability, carrying implications for future fMRI stimulus set design. We then jointly train brain-optimized encoding models across subjects and datasets to predict fMRI activity of all visual cortex and even the whole brain. In silico functional localizer experiments performed on these digital twin models can recover subject-specific category-selective cortical regions, thereby validating our approach. Together, MOSAIC provides a scalable and community-driven solution to build robust, larger-scale models of human vision.

Acknowledgements

Thank you to the Amazon's AWS Open Data Sponsorship Program for hosting the data. The work was funded by the Multidisciplinary University Research Initiative (MURI) award by the Army Research Office (grant No. W911NF-23-1-0277) to A.O. and the Pathway to Independence Award by the National Institute of Health (NIH) (award No. R00EY032603) to N.A.R.M.

Citation

MOSAIC: A scalable framework for fMRI dataset aggregation and modeling of human vision
Benjamin Lahner, Mayukh Deb, N. Apurva Ratan Murty, Aude Oliva
bioRxiv 2025.11.28.690060; doi: https://doi.org/10.64898/2025.11.28.690060