Author name cluster

Keze Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers

2 author rows

AAAI Conference 2026 Conference Paper

3DAlign-DAER: Dynamic Attention Policy and Efficient Retrieval Strategy for Fine-grained 3D-Text Alignment at Scale

Yijia Fan
jusheng zhang
Kaitong Cai
Jing Yang
Jian Wang
Keze Wang

Despite recent advancements in 3D-text cross-modal alignment, existing state-of-the-art methods still struggle to align fine-grained textual semantics with detailed geometric structures, and their alignment performance degrades significantly when scaling to large-scale 3D databases. To overcome this limitation, we introduce 3DAlign-DAER, a unified framework designed to align text and 3D geometry via the proposed dynamic attention policy and the efficient retrieval strategy, capturing subtle correspondences for diverse cross-modal retrieval and classification tasks. Specifically, during the training, our proposed dynamic attention policy (DAP) employs the Hierarchical Attention Fusion (HAF) module to represent the alignment as learnable fine-grained token-to-point attentions. To optimize these attentions across different tasks and geometric hierarchies, our DAP further exploits the Monte Carlo tree search to dynamically calibrate HAF attention weights via a hybrid reward signal and further enhances the alignment between textual descriptions and local 3D geometry. During the inference, our 3DAlign-DAER introduces an Efficient Retrieval Strategy (ERS) to leverage efficient hierarchical searching in the large-scale embedding spaces, outperforming traditional methods (eg, KNN) in accuracy and efficiency. Furthermore, to facilitate text-3D alignment research and train our 3DAlign-DAER, we construct Align3D-2M, a large-scale dataset featuring 2M text-3D pairs, to provide sufficient fine-grained cross-modal annotations. Extensive and comprehensive experiments demonstrate the superior performance of our 3DAlign-DAER on diverse benchmarks.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Cost-Effective Communication: An Auction-based Method for Language Agent Interaction

Yijia Fan
jusheng zhang
Kaitong Cai
Jing Yang
Chengpei Tang
Jian Wang
Keze Wang

Multi-agent systems (MAS) built on large language models (LLMs) often suffer from inefficient ''free-for-all'' communication, leading to exponential token costs and low signal-to-noise ratios that hinder their practical deployment. We challenge the notion that more communication is always beneficial, hypothesizing instead that the core issue is the absence of resource rationality. We argue that "free'' communication, by ignoring the principle of scarcity, inherently breeds inefficiency and unnecessary expenses. To address this, we introduce the Dynamic Auction-based Language Agent (DALA), a novel framework that treats communication bandwidth as a scarce and tradable resource. Specifically, our DALA regards inter-agent communication as a centralized auction, where agents learn to bid for the opportunity to speak based on the predicted value density of their messages. Thus, our DALA intrinsically encourages agents to produce concise, informative messages while filtering out low-value communication. Extensive and comprehensive experiments demonstrate that our economically-driven DALA achieves new state-of-the-art performance across seven challenging reasoning benchmarks, including 84.32% on MMLU and a 91.21% pass@1 rate on HumanEval. Note that this is accomplished with remarkable efficiency, i.e., our DALA uses only 6.25 million tokens, a fraction of the resources consumed by current state-of-the-art methods on GSM8K. Further analysis reveals that our DALA cultivates the emergent skill of strategic silence, effectively adapting its communication strategies from verbosity to silence in a dynamic manner via resource constraints.

PDF Details DOI

AAAI Conference 2026 Conference Paper

HiVA: Self-organized Hierarchical Variable Agent via Goal-driven Semantic-Topological Evolution

Jinzhou Tang
jusheng zhang
Qinhan Lv
Sidi Liu
Jing Yang
Chengpei Tang
Keze Wang

Autonomous agents play a crucial role in advancing Artificial General Intelligence, enabling problem decomposition and tool orchestration through Large Language Models (LLMs). However, existing paradigms face a critical trade-off. On one hand, reusable fixed workflows require manual reconfiguration upon environmental changes; on the other hand, flexible reactive loops fail to distill reasoning progress into transferable structures. We introduce Hierarchical Variable Agent (HiVA), a novel framework modeling agentic workflows as self-organized graphs with the Semantic-Topological Evolution (STEV) algorithm, which optimizes hybrid semantic-topological spaces using textual gradients as discrete-domain surrogates for backpropagation. The iterative process comprises Multi-Armed Bandit-infused forward routing, diagnostic gradient generation from environmental feedback, and coordinated updates that co-evolve individual semantics and topology for collective optimization in unknown environments. Experiments on dialogue, coding, Long-context Q&A, mathematical, and agentic benchmarks demonstrate improvements of 5-10% in task accuracy and enhanced resource efficiency over existing baselines, establishing HiVA's effectiveness in autonomous task execution.

PDF Details DOI

AAAI Conference 2026 Conference Paper

LLM-CAS: Dynamic Neuron Perturbation for Real-Time Hallucination Correction

jusheng zhang
Ningyuan Liu
Yijia Fan
Zihao Huang
Qinglin Zeng
Kaitong Cai
Jian Wang
Keze Wang

Large language models (LLMs) often generate hallucinated content lacking factual or contextual grounding, hindering their reliability in critical applications. Traditional methods like supervised fine-tuning and reinforcement learning from human feedback are data-intensive and computationally expensive, while static parameter editing struggles with context-dependent errors and catastrophic forgetting. To overcome these limitations, we introduce LLM-CAS, a framework that formulates real-time hallucination correction as a hierarchical reinforcement learning (HRL) problem. LLM-CAS trains an agent to learn a sophisticated policy, dynamically selecting optimal, temporary neuron perturbations during inference based on the immediate context. This learned, policy-driven approach provides greater adaptability than prior dynamic methods that rely on heuristic or pre-defined adjustments. As a result, LLM-CAS achieves significant performance gains across various LLMs, improving accuracy by 10.98 percentage points on StoryCloze, 2.71 points on TriviaQA, and 2.06 points on TruthfulQA's MC1 score, thereby outperforming static methods like ITI and CAA, as well as the dynamic SADI framework. This context-aware, efficient approach promises enhanced reliability for LLMs in high-stakes domains, with future potential for multimodal extensions.

PDF Details DOI

AAAI Conference 2026 Conference Paper

ORACLE: Optimizing Reasoning Abilities of Large Language Models via Constraint-Led Synthetic Data Elicitation

Zhuojie Yang
Wentao Wan
Keze Wang

Training large language models (LLMs) with synthetic reasoning data has become a popular approach to enhancing their reasoning capabilities, while a key factor influencing the effectiveness of this paradigm is the quality of the generated multi-step reasoning data. To generate high-quality reasoning data, many recent methods generate synthetic reasoning paths and filter them based on final answer correctness, often overlooking flaws in intermediate reasoning steps. To enhance the verification of intermediate reasoning steps, prior work primarily resorts to code execution or symbolic reasoning engines. However, code-based validation is restricted to code or mathematical tasks, and reasoning engines require a well-structured and complete context. As a result, existing methods fail to function effectively in natural language reasoning tasks that involve ambiguous or incomplete contexts. In these tasks, synthetic data still lack reliable checks for verifying each reasoning step. To address this challenge, we introduce ORACLE, a structured data generation framework inspired by syllogistic reasoning. ORACLE integrates the generative strengths of LLMs with symbolic supervision: the LLM produces step-wise reasoning contexts, while a symbolic reasoning engine verifies the validity of each intermediate step. By employing a unified prompting template to elicit modular reasoning chains, ORACLE enables fine-grained, step-level validation, facilitating the construction of high-quality multi-step reasoning data. Across six logical, factual, and commonsense reasoning benchmarks, our ORACLE consistently outperforms strong baselines on multiple models.

PDF Details DOI

AAAI Conference 2026 Conference Paper

RaCoT: Plug-and-Play Contrastive Example Generation Mechanism for Enhanced LLM Reasoning Reliability

Kaitong Cai
jusheng zhang
Yijia Fan
Jing Yang
Keze Wang

Retrieval-Augmented Generation (RAG) faces a core bottleneck with knowledge-sparse and semantically ambiguous long-tail queries, where retrieval noise distorts reasoning and necessitates costly post-processing. To tackle this, we propose RaCoT (Retrieval-aware Contrastive-of-Thought), a novel framework that shifts contrastive thinking to the pre-retrieval stage. By automatically generating a semantically adjacent yet differently answered contrastive question and extracting a Δ-Prompt to capture their key differences, RaCoT guides the model to proactively focus on the "critical details that determine answer divergence." This approach allows it to suppress semantic interference within a single retrieval pass, overcoming the theoretical bottleneck of single-vector queries that struggle to simultaneously encode signals for what to attend to and what to ignore. On six authoritative benchmarks, including PopQA and TriviaQA-unfiltered, RaCoT outperforms strong baselines like RankRAG and Self-RAG by 0.9-2.4 percentage points. It exhibits superior robustness, with a performance drop of only 8.6% in adversarial tests, far surpassing the over 15% degradation in other methods. Furthermore, its low latency (3.12s) and token overhead (11.54) place it on the accuracy-efficiency Pareto frontier, while ablation studies validate the necessity of each component. Ultimately, RaCoT reframes the RAG paradigm from "post-hoc context cleaning" to "a priori shaping of discriminative reasoning," offering an efficient and robust path toward reliable AI systems for real-time, resource-constrained deployments.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Top-Down Semantic Refinement for Image Captioning

jusheng zhang
Kaitong Cai
Jing Yang
Jian Wang
Chengpei Tang
Keze Wang

Large Vision-Language Models (VLMs) face an inherent contradiction in image captioning: their powerful single-step generation capabilities often lead to a myopic decision-making process. This makes it difficult to maintain global narrative coherence while capturing rich details, a limitation that is particularly pronounced in tasks that require multi-step and complex scene description. To overcome this fundamental challenge, we redefine image captioning as a goal-oriented hierarchical refinement planning problem, and further propose a novel framework, named Top-Down Semantic Refinement (TDSR), which models the generation process as a Markov Decision Process (MDP). However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for VLMs. By incorporating a visual-guided parallel expansion and a lightweight value network, our TDSR reduces the call frequency to the expensive VLM by an order of magnitude without sacrificing planning quality. Furthermore, an adaptive early stopping mechanism dynamically matches computational overhead to the image's complexity. Extensive experiments on multiple benchmarks, including DetailCaps, COMPOSITIONCAP, and POPE, demonstrate that our TDSR, as a plug-and-play module, can significantly enhance the performance of existing VLMs (e.g., LLaVA-1.5, Qwen2.5-VL) by achieving state-of-the-art or highly competitive results in fine-grained description, compositional generalization, and hallucination suppression.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

3D-Agent: A Tri-Modal Multi-Agent Responsive Framework for Comprehensive 3D Object Annotation

jusheng zhang
Yijia Fan
Zimo Wen
Jian Wang
Keze Wang

Driven by the applications in autonomous driving, robotics, and augmented reality, 3D object annotation is a critical task compared to 2D annotation, such as spatial complexity, occlusion, and viewpoint inconsistency. The existing methods relying on single models often struggle with these issues. In this paper, we introduce Tri-MARF, a novel framework that integrates tri-modal inputs (i. e. , 2D multi-view images, text descriptions, and 3D point clouds) with multi-agent collaboration to enhance the 3D annotation process. Our Tri-MARF consists of three specialized agents: a vision-language model agent that generates multi-view descriptions, an information aggregation agent that selects optimal descriptions, and a gating agent that aligns text descriptions with 3D geometries for more refined captioning. Extensive experiments on the Objaverse-LVIS, Objaverse-XL, and ABO datasets demonstrate the superiority of our Tri-MARF, which achieves a CLIPScore of 88. 7 (compared to 78. 6–82. 4 for other SOTA methods), retrieval accuracy of 45. 2/43. 8 (ViLT R@5), and an impressive throughput of 12, 000 objects per hour on a single NVIDIA A100 GPU.

NeurIPS Conference 2025 Conference Paper

CF-VLM:CounterFactual Vision-Language Fine-tuning

jusheng zhang
Kaitong Cai
Yijia Fan
Jian Wang
Keze Wang

Recent advances in vision-language models (VLMs) have greatly improved cross-modal semantic understanding, yet significant limitations remain in fine-grained discrimination and deep causal reasoning tasks. Existing VLMs often rely on superficial statistical correlations, lacking the ability to capture the underlying causal logic between visual and textual content. To address this, we propose the CounterFactual Vision-Language Fine-tuning Model (CF-VLM), a novel framework that enhances the causal reasoning capabilities of VLMs through the targeted use of counterfactual samples. CF-VLM introduces three complementary training objectives: maintaining foundational cross-modal alignment, reinforcing the uniqueness, and stability of factual scene representations against coherent counterfactuals, and sharpening the model’s sensitivity to minimal but critical causal edits. Extensive experiments demonstrate that CF-VLM consistently outperforms strong baselines and state-of-the-art methods on compositional reasoning and generalization benchmarks. Furthermore, it shows promise in mitigating visual hallucinations, indicating improved factual consistency. Our CF-VLM provides a robust foundation for deploying VLMs in high-stakes, real-world scenarios requiring reliable reasoning and interpretability.

NeurIPS Conference 2025 Conference Paper

GAM-Agent: Game-Theoretic and Uncertainty-Aware Collaboration for Complex Visual Reasoning

jusheng zhang
Yijia Fan
Wenjun Lin
Ruiqi Chen
Haoyi Jiang
Wenhao Chai
Jian Wang
Keze Wang

We propose GAM-Agent, a game-theoretic multi-agent framework for enhancing vision-language reasoning. Unlike prior single-agent or monolithic models, GAM-Agent formulates the reasoning process as a non-zero-sum game between base agents—each specializing in visual perception subtasks—and a critical agent that verifies logic consistency and factual correctness. Agents communicate via structured claims, evidence, and uncertainty estimates. The framework introduces an uncertainty-aware controller to dynamically adjust agent collaboration, triggering multi-round debates when disagreement or ambiguity is detected. This process yields more robust and interpretable predictions. Experiments on four challenging benchmarks—MMMU, MMBench, MVBench, and V*Bench—demonstrate that GAM-Agent significantly improves performance across various VLM backbones. Notably, GAM-Agent boosts the accuracy of small-to-mid scale models (e. g. , Qwen2. 5-VL-7B, InternVL3-14B) by 5–6\%, and still enhances strong models like GPT-4o by up to 2–3\%. Our approach is modular, scalable, and generalizable, offering a path toward reliable and explainable multi-agent multimodal reasoning.

ICML Conference 2025 Conference Paper

KABB: Knowledge-Aware Bayesian Bandits for Dynamic Expert Coordination in Multi-Agent Systems

Jusheng Zhang
Zimeng Huang
Yijia Fan
Ningyuan Liu
Mingyan Li
Zhuojie Yang
Jiawei Yao
Jian Wang 0100

As scaling large language models faces prohibitive costs, multi-agent systems emerge as a promising alternative, though challenged by static knowledge assumptions and coordination inefficiencies. We introduce Knowledge-Aware Bayesian Bandits (KABB), a novel framework that enhances multi-agent system coordination through semantic understanding and dynamic adaptation. The framework features three key innovations: a customized knowledge distance model for deep semantic understanding, a dual-adaptation mechanism for continuous expert optimization, and a knowledge-aware Thompson Sampling strategy for efficient expert selection. Extensive evaluation demonstrates KABB achieves an optimal cost-performance balance, maintaining high performance while keeping computational demands relatively low in multi-agent coordination.

NeurIPS Conference 2025 Conference Paper

MAT-Agent: Adaptive Multi-Agent Training Optimization

jusheng zhang
Kaitong Cai
Yijia Fan
Ningyuan Liu
Keze Wang

We propose a novel collaborative multi-agent optimization framework for adaptive training in multi-label image classification, fundamentally advancing beyond static decision rules and isolated automation. Our method deploys a set of distributed, task-specific agents, each responsible for dynamically orchestrating critical training components—including data augmentation, optimization methods, learning rate schedules, and loss functions—according to evolving visual-semantic relationships and training states. Each agent employs an advanced non-stationary multi-armed bandit algorithm, integrating both $\epsilon$-greedy and upper confidence bound strategies, to judiciously balance exploration with exploitation throughout the training lifecycle. A hierarchical composite reward mechanism synergizes overall classification accuracy, rare class recognition, and training stability, fostering both independent optimization and implicit collaborative behavior among agents. The framework further leverages refined techniques such as dual-rate exponential moving average smoothing and structured mixed-precision training to enhance robustness and computational efficiency. Extensive experiments across benchmarks including Pascal VOC, COCO, Yeast, and Mediamill demonstrate that our approach achieves superior mean average precision and rare-class F1 scores compared to state-of-the-art methods, while also exhibiting rapid convergence and remarkable cross-domain generalization. Our results indicate that collaborative multi-agent adaptive optimization offers a scalable and principled solution for self-optimizing deep learning in complex multi-label scenarios.

NeurIPS Conference 2025 Conference Paper

MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models

Zimeng Huang
Jinxin Ke
Xiaoxuan Fan
Yufeng Yang
Yang Liu
Liu Zhonghan
Zedi Wang
Junteng Dai

Large Vision-Language Models (LVLMs) have exhibited remarkable progress. However, deficiencies remain compared to human intelligence, such as hallucination and shallow pattern matching. In this work, we aim to evaluate a fundamental yet underexplored intelligence: association, a cornerstone of human cognition for creative thinking and knowledge integration. Current benchmarks, often limited to closed-ended tasks, fail to capture the complexity of open-ended association reasoning vital for real-world applications. To address this, we present MM-OPERA, a systematic benchmark with 11, 497 instances across two open-ended tasks: Remote-Item Association (RIA) and In-Context Association (ICA), aligning association intelligence evaluation with human psychometric principles. It challenges LVLMs to resemble the spirit of divergent thinking and convergent associative reasoning through free-form responses and explicit reasoning paths. We deploy tailored LLM-as-a-Judge strategies to evaluate open-ended outputs, applying process-reward-informed judgment to dissect reasoning with precision. Extensive empirical studies on state-of-the-art LVLMs, including sensitivity analysis of task instances, validity analysis of LLM-as-a-Judge strategies, and diversity analysis across abilities, domains, languages, cultures, etc. , provide a comprehensive and nuanced understanding of the limitations of current LVLMs in associative reasoning, paving the way for more human-like and general-purpose AI. The dataset and code are available at https: //github. com/MM-OPERA-Bench/MM-OPERA.

NeurIPS Conference 2025 Conference Paper

Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention

Haijing Liu
Zhiyuan Song
Hefeng Wu
Tao Pu
Keze Wang
Liang Lin

Egocentric Referring Video Object Segmentation (Ego-RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first-person videos. This task is critical for understanding egocentric human behavior. However, achieving such segmentation robustly is challenging due to ambiguities inherent in egocentric videos and biases present in training data. Consequently, existing methods often struggle, learning spurious correlations from skewed object-action pairings in datasets and fundamental visual confounding factors of the egocentric perspective, such as rapid motion and frequent occlusions. To address these limitations, we introduce Causal Ego-REferring Segmentation (CERES), a plug-in causal framework that adapts strong, pre-trained RVOS backbones to the egocentric domain. CERES implements dual-modal causal intervention: applying backdoor adjustment principles to counteract language representation biases learned from dataset statistics, and leveraging front-door adjustment concepts to address visual confounding by intelligently integrating semantic visual features with geometric depth information guided by causal principles, creating representations more robust to egocentric distortions. Extensive experiments demonstrate that CERES achieves state-of-the-art performance on Ego-RVOS benchmarks, highlighting the potential of applying causal reasoning to build more reliable models for broader egocentric video understanding.

AAAI Conference 2025 Conference Paper

SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks

Wentao Wan
Zhuojie Yang
Yongcan Chen
Chenglin Luo
Ruilin Wang
Kehao Cai
Nan Kang
Liang Lin

Deductive reasoning is a crucial logical capability that assists us in solving complex problems based on existing knowledge. Although augmented by Chain-of-Thought prompts, Large Language Models (LLMs) might not follow the correct reasoning paths. Enhancing the deductive reasoning abilities of LLMs, and leveraging their extensive built-in knowledge for various reasoning tasks, remains an open question. Attempting to mimic the human deductive reasoning paradigm, we propose a multi-stage Syllogistic-Reasoning Framework of Thought (SR-FoT) that enables LLMs to perform syllogistic deductive reasoning to handle complex knowledge-based reasoning tasks. Our SR-FoT begins by interpreting the question and then uses the interpretation and the original question to propose a suitable major premise. It proceeds by generating and answering minor premise questions in two stages to match the minor premises. Finally, it guides LLMs to use the previously generated major and minor premises to perform syllogistic deductive reasoning to derive the answer to the original question. Extensive and thorough experiments on knowledge-based reasoning tasks have demonstrated the effectiveness and advantages of our SR-FoT.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Adaptive Prompt Routing for Arbitrary Text Style Transfer with Pre-trained Language Models

Qingyi Liu
Jinghui Qin
Wenxuan Ye
Hao Mou
Yuxuan He
Keze Wang

Recently, arbitrary text style transfer (TST) has made significant progress with the paradigm of prompt learning. In this paradigm, researchers often design or search for a fixed prompt for any input. However, existing evidence shows that large language models (LLMs) are prompt-sensitive and it is sub-optimal to apply the same prompt to any input for downstream TST tasks. Besides, the prompts obtained by searching are often unreadable and unexplainable to humans. To address these issues, we propose an Adaptive Prompt Routing (APR) framework to adaptively route prompts from a human-readable prompt set for various input texts and given styles. Specifically, we first construct a candidate prompt set of diverse and human-readable prompts for the target style. This set consists of several seed prompts and their variants paraphrased by an LLM. Subsequently, we train a prompt routing model to select the optimal prompts efficiently according to inputs. The adaptively selected prompt can guide the LLMs to perform a precise style transfer for each input sentence while maintaining readability for humans. Extensive experiments on 4 public TST benchmarks over 3 popular LLMs (with parameter sizes ranging from 1.5B to 175B) demonstrate that our APR achieves superior style transfer performances, compared to the state-of-the-art prompt-based and fine-tuning methods. The source code is available at https://github.com/DwyaneLQY/APR

PDF Details DOI

AAAI Conference 2024 Conference Paper

Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation

Hui Fu
Zeqing Wang
Ke Gong
Keze Wang
Tianshui Chen
Haojie Li
Haifeng Zeng
Wenxiong Kang

Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style. However, existing works primarily focus on achieving precise lip synchronization while neglecting to model the subject-specific speaking style, often resulting in unrealistic facial animations. To the best of our knowledge, this work makes the first attempt to explore the coupled information between the speaking style and the semantic content in facial motions. Specifically, we introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding and leads to a more realistic synthesis of speech-driven facial animations. Subsequently, we propose a novel framework called Mimic to learn disentangled representations of the speaking style and content from facial motions by building two latent spaces for style and content, respectively. Moreover, to facilitate disentangled representation learning, we introduce four well-designed constraints: an auxiliary style classifier, an auxiliary inverse classifier, a content contrastive loss, and a pair of latent cycle losses, which can effectively contribute to the construction of the identity-related style space and semantic-related content space. Extensive qualitative and quantitative experiments conducted on three publicly available datasets demonstrate that our approach outperforms state-of-the-art methods and is capable of capturing diverse speaking styles for speech-driven 3D facial animation. The source code and supplementary video are publicly available at: https://zeqing-wang.github.io/Mimic/

PDF Details DOI

AAAI Conference 2024 Conference Paper

NeRF-VPT: Learning Novel View Representations with Neural Radiance Fields via View Prompt Tuning

Linsheng Chen
Guangrun Wang
Liuchun Yuan
Keze Wang
Ken Deng
Philip H.S. Torr

Neural Radiance Fields (NeRF) have garnered remarkable success in novel view synthesis. Nonetheless, the task of generating high-quality images for novel views persists as a critical challenge. While the existing efforts have exhibited commendable progress, capturing intricate details, enhancing textures, and achieving superior Peak Signal-to-Noise Ratio (PSNR) metrics warrant further focused attention and advancement. In this work, we propose NeRF-VPT, an innovative method for novel view synthesis to address these challenges. Our proposed NeRF-VPT employs a cascading view prompt tuning paradigm, wherein RGB information gained from preceding rendering outcomes serves as instructive visual prompts for subsequent rendering stages, with the aspiration that the prior knowledge embedded in the prompts can facilitate the gradual enhancement of rendered image quality. NeRF-VPT only requires sampling RGB data from previous stage renderings as priors at each training stage, without relying on extra guidance or complex techniques. Thus, our NeRF-VPT is plug-and-play and can be readily integrated into existing methods. By conducting comparative analyses of our NeRF-VPT against several NeRF-based approaches on demanding real-scene benchmarks, such as Realistic Synthetic 360, Real Forward-Facing, Replica dataset, and a user-captured dataset, we substantiate that our NeRF-VPT significantly elevates baseline performance and proficiently generates more high-quality novel view images than all the compared state-of-the-art methods. Furthermore, the cascading learning of NeRF-VPT introduces adaptability to scenarios with sparse inputs, resulting in a significant enhancement of accuracy for sparse-view novel view synthesis. The source code and dataset are available at https://github.com/Freedomcls/NeRF-VPT.

PDF Details DOI

ICRA Conference 2021 Conference Paper

Continuous Transition: Improving Sample Efficiency for Continuous Control Problems via MixUp

Junfan Lin
Zhongzhan Huang
Keze Wang
Xiaodan Liang
Weiwei Chen
Liang Lin

Although deep reinforcement learning (RL) has been successfully applied to a variety of robotic control tasks, it’s still challenging to apply it to real-world tasks, due to the poor sample efficiency. Attempting to overcome this shortcoming, several works focus on reusing the collected trajectory data during the training by decomposing them into a set of policy-irrelevant discrete transitions. However, their improvements are somewhat marginal since i) the amount of the transitions is usually small, and ii) the value assignment only happens in the joint states. To address these issues, this paper introduces a concise yet powerful method to construct Continuous Transition, which exploits the trajectory information by exploiting the potential transitions along the trajectory. Specifically, we propose to synthesize new transitions for training by linearly interpolating the consecutive transitions. To keep the constructed transitions authentic, we also develop a discriminator to guide the construction process automatically. Extensive experiments demonstrate that our proposed method achieves a significant improvement in sample efficiency on various complex continuous robotic control problems in MuJoCo and outperforms the advanced model-based / model-free RL methods. The source code is available 1.

IJCAI Conference 2018 Conference Paper

Convolutional Memory Blocks for Depth Data Representation Learning

Keze Wang
Liang Lin
Chuangjie Ren
Wei Zhang
Wenxiu Sun

Compared to natural RGB images, data captured by 3D / depth sensors (e. g. , Microsoft Kinect) have different properties, e. g. , less discriminable in appearance due to lacking color / texture information. Applying convolutional neural networks (CNNs) on these depth data would lead to unsatisfying learning efficiency, i. e. , requiring large amounts of annotated training data for convergence. To address this issue, this paper proposes a novel memory network module, called Convolutional Memory Block (CMB), which empowers CNNs with the memory mechanism on handling depth data. Different from the existing memory networks that store long / short term dependency from sequential data, our proposed CMB focuses on modeling the representative dependency (correlation) among non-sequential samples. Specifically, our CMB consists of one internal memory (i. e. , a set of feature maps) and three specific controllers, which enable a powerful yet efficient memory manipulation mechanism. In this way, the internal memory, being implicitly aggregated from all previous inputted samples, can learn to store and utilize representative features among the samples. Furthermore, we employ our CMB to develop a concise framework for predicting articulated pose from still depth images. Comprehensive evaluations on three public benchmarks demonstrate significant superiority (about 6%) of our framework over all the compared methods. More importantly, thanks to the enhanced learning efficiency, our framework can still achieve satisfying results using 50% less training data.

IROS Conference 2018 Conference Paper

Embedding Temporally Consistent Depth Recovery for Real-time Dense Mapping in Visual-inertial Odometry

Hui Cheng
Zhuoqi Zheng
Jinhao He
Chongyu Chen
Keze Wang
Liang Lin

Dense mapping is always the desire of simultaneous localization and mapping (SLAM), especially for the applications that require fast and dense scene information. Visual-inertial odometry (VIO) is a light-weight and effective solution to fast self-localization. However, VIO-based SLAM systems have difficulty in providing dense mapping results due to the spatial sparsity and temporal instability of the VIO depth estimations. Although there have been great efforts on real-time mapping and depth recovery from sparse measurements, the existing solutions for VIO-based SLAM still fail to preserve sufficient geometry details in their results. In this paper, we propose to embed depth recovery into VIO-based SLAM for real-time dense mapping. In the proposed method, we present a subspace-based stabilization scheme to maintain the temporal consistency and design a hierarchical pipeline for edge-preserving depth interpolation to reduce the computational burden. Numerous experiments demonstrate that our method can achieve an accuracy improvement of up to 49. 1 cm compared to state-of-the-art learning-based methods for depth recovery and reconstruct sufficient geometric details in dense mapping when only 0. 07% depth samples are available. Since a simple CPU implementation of our method already runs at 10-20 fps, we believe our method is very favorable for practical SLAM systems with critical computational requirements.