ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?
Figure 1: Example of a Vietnamese multimodal exam question from ViExam, combining text and visual elements.
Figure 2: Our three-stage data curation process: (1) Sourcing raw exam PDFs and converting them to images, (2) an automated pipeline to detect and classify questions, and (3) a final manual verification loop by native Vietnamese speakers.
We have conducted the first comprehensive evaluation of vision-language model (VLM) performance on Vietnamese multimodal educational assessments, addressing the critical gap in understanding how these models generalize beyond English-dominant training data. In this work, we introduce ViExam, a benchmark comprising 2,548 multimodal questions spanning seven academic domains: Mathematics, Physics, Chemistry, Biology, Geography, Driving Test, and IQ Test.
Our evaluation shows that current VLMs face substantial challenges in this cross-lingual multimodal setting. State-of-the-art VLMs achieve only 57.74% mean accuracy, while open-source models reach just 27.70%, both of which are well below average human performance (66.54%). Only the reasoning-focused VLM o3 surpasses the human average with 74.07%, but it still lags far behind human best performance (99.60%).
We further analyze strategies for improving cross-lingual performance. Surprisingly, cross-lingual prompting, using English instructions while keeping the Vietnamese multimodal content, fails to enhance outcomes and even reduces accuracy by about 1 percentage point for state-of-the-art VLMs. On the other hand, human-in-the-loop collaboration can provide modest improvements, boosting accuracy by approximately 5 percentage points.
These results highlight both the limitations of today’s VLMs and the urgent need for more inclusive multimodal datasets that extend beyond English. By providing ViExam as an open benchmark and resource, we aim to facilitate future research on cross-lingual multimodal reasoning and support advances in educational technology for low-resource languages.
Code and data are available at: https://vi-exam.github.io