Author name cluster

Jiayan Qiu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

2 author rows

AAAI Conference 2026 Conference Paper

Remodeling Semantic Relationships in Vision-Language Fine-Tuning

Xiangyang Wu
Liu Liu
Baosheng Yu
Jiayan Qiu
Zhenwei Shi

Vision-language fine-tuning has emerged as an efficient paradigm for constructing multimodal foundation models. While textual context often highlights semantic relationships within an image, existing fine-tuning methods typically overlook this information when aligning vision and language, thus leading to suboptimal performance. Toward solving this problem, we propose a method that can improve multimodal alignment and fusion based on both semantics and relationships.Specifically, we first extract multilevel semantic features from different vision encoder to capture more visual cues of the relationships. Then, we learn to project the vision features to group related semantics, among which are more likely to have relationships. Finally, we fuse the visual features with the textual by using inheritable cross-attention, where we globally remove the redundant visual relationships by discarding visual-language feature pairs with low correlation. We evaluate our proposed method on eight foundation models and two downstream tasks, visual question answering and image captioning, and show that it outperforms all existing methods.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Understanding Interaction as You Need: Intention-Driven Pedestrian Behavior Prediction

Hang Yu
Yansen Yu
Jiayan Qiu

Prediction of pedestrian behavior is crucial for autonomous driving systems and intelligent transportation.Conventional methods predict the behavior based solely on either the pedestrian intention or the distance-related interactions between the pedestrian and its surroundings. However, these methods overlook the associations between intention and interaction for behavior prediction, in which they should be aligned with each other, thus leading to sub-optimal predictions. To solve this problem, we propose to predict the behavior by learning the association between intention and interaction, enabling them to mutually enhance each other during the prediction. Specifically, we first predict the short-term intention of all objects, including the target pedestrian and its surroundings.Then, instead of using the distance-related interactions, we predict the interactions by learning the correlated intentions. Finally, the intention-driven interactions refine the initial intention prediction, thus ensuring the alignment between intention and interaction for behavior prediction. We evaluate our method on two downstream tasks, the pedestrian trajectory prediction and pedestrian intention estimation, and show that it outperforms all the existing methods.

PDF Details DOI

ICML Conference 2025 Conference Paper

Controllable Data Generation with Hierarchical Neural Representations

Sheyang Tang
Xiaoyu Xu
Jiayan Qiu
Zhou Wang

Implicit Neural Representations (INRs) represent data as continuous functions using the parameters of a neural network, where data information is encoded in the parameter space. Therefore, modeling the distribution of such parameters is crucial for building generalizable INRs. Existing approaches learn a joint distribution of these parameters via a latent vector to generate new data, but such a flat latent often fails to capture the inherent hierarchical structure of the parameter space, leading to entangled data semantics and limited control over the generation process. Here, we propose a C ontrollable H ierarchical I mplicit N eural R epresentation ( CHINR ) framework, which explicitly models conditional dependencies across layers in the parameter space. Our method consists of two stages: In Stage-1, we construct a Layers-of-Experts (LoE) network, where each layer modulates distinct semantics through a unique latent vector, enabling disentangled and expressive representations. In Stage-2, we introduce a Hierarchical Conditional Diffusion Model (HCDM) to capture conditional dependencies across layers, allowing for controllable and hierarchical data generation at various semantic granularities. Extensive experiments across different modalities demonstrate that CHINR improves generalizability and offers flexible hierarchical control over the generated content.

Details

NeurIPS Conference 2025 Conference Paper

Dynamic Shadow Unveils Invisible Semantics for Video Outpainting

Ruilin Li
Hang Yu
Jiayan Qiu

Conventional video outpainting methods primarily focus on maintaining coherent textures and visual consistency across frames. However, they often fail at handling dynamic scenes due to the complex motion of objects or camera movement, leading to temporal incoherence and visible flickering artifacts across frames. This is primarily because they lack instance-aware modeling to accurately separate and track individual object motions throughout the video. In this paper, we propose a novel video outpainting framework that explicitly takes shadow-object pairs into consideration to enhance the temporal and spatial consistency of instances, even when they are temporarily invisible. Specifically, we first track the shadow-object pairs across frames and predict the instances in the scene to unveil the spatial regions of invisible instances. Then, these prediction results are fed to guide the instance-aware optical flow completion to unveil the temporal motion of invisible instances. Next, these spatiotemporal guidances of instances are used to guide the video outpainting process. Finally, a video-aware discriminator is implemented to enhance alignment among dynamic shadows and the extended semantics in the scene. Comprehensive experiments underscore the superiority of our approach, outperforming existing state-of-the-art methods in widely recognized benchmarks.

PDF Details

NeurIPS Conference 2025 Conference Paper

SIGMA: Refining Large Language Model Reasoning via Sibling-Guided Monte Carlo Augmentation

Yanwei Ren
Haotian Zhang
Fuxiang Wu
Jiayan Qiu
Jiaxing Huang
Baosheng Yu
Liu Liu

Enhancing large language models by simply scaling up datasets has begun to yield diminishing returns, shifting the spotlight to data quality. Monte Carlo Tree Search (MCTS) has emerged as a powerful technique for generating high-quality chain-of-thought data, yet conventional approaches typically retain only the top-scoring trajectory from the search tree, discarding sibling nodes that often contain valuable partial insights, recurrent error patterns, and alternative reasoning strategies. This unconditional rejection of non-optimal reasoning branches may waste vast amounts of informative data in the whole search tree. We propose SIGMA (Sibling Guided Monte Carlo Augmentation), a novel framework that reintegrates these discarded sibling nodes to refine LLM reasoning. SIGMA forges semantic links among sibling nodes along each search path and applies a two-stage refinement: a critique model identifies overlooked strengths and weaknesses across the sibling set, and a revision model conducts text-based backpropagation to refine the top-scoring trajectory in light of this comparative feedback. By recovering and amplifying the underutilized but valuable signals from non-optimal reasoning branches, SIGMA substantially improves reasoning trajectories. On the challenging MATH benchmark, our SIGMA-tuned 7B model achieves 54. 92\% accuracy using only 30K samples, outperforming state-of-the-art models trained on 590K samples. This result highlights that our sibling-guided optimization not only significantly reduces data usage but also significantly boosts LLM reasoning.

PDF Details