Demystifying the Visual Quality Paradox in Multimodal Large Language Models

This research explores a **"visual-quality paradox"** in Multimodal Large Language Models (MLLMs), finding that **higher human-perceived image quality does not always lead to better MLLM performance**; in fact, degraded images can sometimes improve results for complex reasoning tasks. The study attributes this to **degradations potentially sharpening MLLM attention on semantically relevant features**, as evidenced by analyses of relative attention and logit lens techniques. Furthermore, **conventional image restoration methods often fail to enhance MLLM performance** because they prioritize human-centric visual aesthetics over the specific features MLLMs utilize. To address this, the authors propose **Visual-Quality Test-Time Tuning (VQ-TTT)**, a lightweight adaptation module that dynamically modulates input image quality and fine-tunes shallow vision encoder layers to align with MLLM task-specific preferences. VQ-TTT shows **consistent performance gains with minimal computational overhead**, suggesting a need for adaptive, model-aligned image processing rather than universally "clean" inputs for MLLMs.

Om Podcasten