Demystifying the Visual Quality Paradox in Multimodal Large Language Models

This research explores a **"visual-quality paradox"** in Multimodal Large Language Models (MLLMs), finding that **higher human-perceived image quality does not always lead to better MLLM performance**; in fact, degraded images can sometimes improve results for complex reasoning tasks. The study attributes this to **degradations potentially sharpening MLLM attention on semantically relevant features**, as evidenced by analyses of relative attention and logit lens techniques. Furthermore, **conventional image restoration methods often fail to enhance MLLM performance** because they prioritize human-centric visual aesthetics over the specific features MLLMs utilize. To address this, the authors propose **Visual-Quality Test-Time Tuning (VQ-TTT)**, a lightweight adaptation module that dynamically modulates input image quality and fine-tunes shallow vision encoder layers to align with MLLM task-specific preferences. VQ-TTT shows **consistent performance gains with minimal computational overhead**, suggesting a need for adaptive, model-aligned image processing rather than universally "clean" inputs for MLLMs.

Om Podcasten

Cut through the noise. We curate and break down the most important AI papers so you don’t have to.