Author name cluster

Ziyan Jiang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers

2 author rows

AAAI Conference 2026 System Paper

KnowPilot: Your Knowledge-Driven Copilot for Domain Tasks

Zekun Xi
Ziyan Jiang
Yujie Bao
Zhenqian Xu
Shumin Deng

Despite the rapid advancement of generative agents, their deployment in real world industry scenarios often encounters challenges due to a lack of domain-specific knowledge. To address this gap, we present KnowPilot: a Domain-Specific Knowledge Augmented Agent System. KnowPilot is an open-source framework that integrates task-specific priors, explicit knowledge, and experiential knowledge to enhance agent performance in specialized applications. It combines knowledge retrieval from structured repositories with a memory system capable of capturing expert experience through human–AI interaction.

PDF Details DOI

TMLR Journal 2026 Journal Article

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

Jialin Yang
Dongfu Jiang
Tony He
Sherman Siu
Yuxuan Zhang
Disen Liao
Zhuofeng Li
Huaye Zeng

As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce $\textbf{StructEval}$, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: $\textbf{1)}$ generation tasks, producing structured output from natural language prompts, and $\textbf{2)}$ conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps—even state-of-the-art models like o1-mini achieve only $75.58$ average score, with open-source alternatives lagging approximately $10$ points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.

PDF Details

TMLR Journal 2026 Journal Article

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Rui Meng
Ziyan Jiang
Ye Liu
Mingyi Su
Xinyi Yang
Yuepeng Fu
Can Qin
Raghuveer Thirukovalluru

Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, retrieval-augmented generation (RAG) systems, and recommendation. To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering -- spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.

PDF Details

ICLR Conference 2025 Conference Paper

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

Jiacheng Chen
Tianhao Liang
Sherman Siu
Zhengqing Wang
Kai Wang 0068
Yubo Wang 0019
Yuansheng Ni
Ziyan Jiang

We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MM-Bench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth. We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions.

Details

ICLR Conference 2025 Conference Paper

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Ziyan Jiang
Rui Meng
Xinyi Yang 0002
Semih Yavuz
Yingbo Zhou 0002
Wenhu Chen

Embedding models play a crucial role in a variety of downstream tasks, including semantic similarity, information retrieval, and clustering. While there has been a surge of interest in developing universal text embedding models that generalize across tasks (e.g., MTEB), progress in learning universal multimodal embedding models has been comparatively slow, despite their importance and practical applications. In this work, we explore the potential of building universal multimodal embeddings capable of handling a broad range of downstream tasks. Our contributions are twofold: (1) we propose MMEB (Massive Multimodal Embedding Benchmark), which covers four meta-tasks (classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training datasets and 16 evaluation datasets spanning both in-distribution and out-of-distribution tasks, and (2) VLM2Vec (Vision-Language Model → Vector), a contrastive training framework that transforms any vision-language model into an embedding model through contrastive training on MMEB. Unlike previous models such as CLIP and BLIP, which encode text and images independently without task-specific guidance, VLM2Vec can process any combination of images and text while incorporating task instructions to generate a fixed-dimensional vector. We develop a series of VLM2Vec models based on state-of-the-art VLMs, including Phi-3.5-V, LLaVA-1.6, and Qwen2-VL, and evaluate them on MMEB’s benchmark. With LoRA tuning, VLM2Vec achieves a 10% to 20% improvement over existing multimodal embedding models on MMEB’s evaluation sets. Our findings reveal that VLMs are surprisingly strong embedding models.

Details

NeurIPS Conference 2024 Conference Paper

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang
Xueguang Ma
Ge Zhang
Yuansheng Ni
Abhranil Chandra
Shiguang Guo
Weiming Ren
Aaran Arulraj

In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates part of the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16\% to 33\% compared to MMLU, but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5\% in MMLU to just 2\% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is more discriminative benchmark to better track progress in the field.

PDF Details DOI