Recent discussions on scaling up have sparked widespread debates, with some claiming that "scale up is dead." We argue, however, that high-quality data is the true key to effective scaling, particularly textbook-grade, high-quality knowledge corpora. In our recent work, we introduce a novel multimodal knowledge corpus, 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining (). Leveraging a vast collection of online instructional videos, we extract keyframes and their corresponding audio transcriptions to construct a coherent, multimodal interleaved pretraining dataset.
This dataset is organized into many "textbooks" encompassing, e.g., mathematics, physics, and chemistry, enabling VLMs to learn in a more coherent and interleaved way with text and images.
Paper:
Code: