TIST Journal 2026 Journal Article
Clue and Context Fusion for Sarcasm Detection with Large Multimodal Models
- Qiuyu Li
- Yushan Pan
- Ding Wang
- Wei Wang
- Xiaowei Huang
- Zhijie Xu
Detecting sarcasm in social media is fundamentally different from general VLM benchmarks: it is a pragmatic contradiction problem in which the literal signal in one modality is intentionally misaligned with the intended meaning, while dominant pre-training (e.g., CLIP-style contrastive agreement) biases models toward modality alignment rather than incongruity detection. We present SCARF, a contradiction-aware framework that equips large multimodal models with explicit sarcasm cues and context-sensitive retrieval. SCARF constructs coarse scene cues and fine localized evidence via tag-constrained QA, then distills them with visual tokens into a [FUSION] control vector for the LLM; a label-contrastive retriever supplies type- and context-matched exemplars, and a local multi-view encoder surfaces micro-cues. With the same backbone and training data, SCARF attains 87.92% Acc/86.67% F1 on MMSD2.0 and 77.14% Acc/76.44% F1 zero-shot on XDMSD, outperforming a comparably fine-tuned LLaVA-1.5. Ablations show sarcasm clue fusion is the main driver of gains, and tag-constrained QA improves rationale grounding and reduces hallucinations.