Arrow Research search

Author name cluster

Bin Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

96 papers
2 author rows

Possible papers

96

JBHI Journal 2026 Journal Article

Attention-Enhanced Temporal and Spatial Feature Extraction Network for ADHD Diagnosis based on fMRI

  • Dandan Li
  • Zhenyu Zhao
  • Jiangyang Hao
  • Xingwang Dong
  • Yating Zhang
  • Jie Xiang
  • Bin Wang

Attention Deficit Hyperactivity Disorder (ADHD) is a prevalent neurodevelopmental disorder, and accurate diagnosis is critical for ensuring timely intervention. The integration of deep learning and fMRI can effectively explore the abnormal spatiotemporal features of ADHD. However, existing deep learning models are unable to fully capture the temporal dependence and spatial consistency of fMRI data, primarily due to insufficient modeling of multi-scale temporal dependencies and the lack of explicit interaction between static and dynamic brain networks, resulting in poor diagnostic performance for ADHD. To comprehensively and efficiently extract the spatiotemporal features of fMRI signals, we propose an Attention-Enhanced Spatiotemporal Feature Extraction Network (AE-STEN), which comprises a Temporal Cross-scale Convolutional Attention Module (TCAM), a Spatial Collaborative Attention-Guided Graph Representation Module (SCGRM), and a Spatial-Temporal KAN Network (STKAN). TCAM is designed to capture short- and long-term dependencies in fMRI time series by jointly modeling local transient fluctuations and global temporal dependencies. SCGRM effectively extracts consistent spatial features from both dynamic and static fMRI data by explicitly modeling their collaborative interaction rather than treating them independently. STKAN integrates the extracted spatiotemporal features for final classification. Experiments on the ADHD-200 dataset, involving 747 subjects across seven sites, demonstrate that AE-STEN achieves a classification accuracy of up to 76. 06% ± 0. 65%. Moreover, AE-STEN identifies brain regions associated with ADHD consistent with clinical findings, indicating strong interpretability and highlighting the model's potential for clinical application.

AAAI Conference 2026 Conference Paper

DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation

  • Jiajun Jiao
  • Haowei Zhu
  • Puyuan Yang
  • Jianghui Wang
  • Ji Liu
  • Ziqiong Liu
  • Dong Li
  • Yuejian Fang

Diffusion models have achieved remarkable success in image and video generation. However, their inherently multiple step inference process imposes substantial computational overhead, hindering real-world deployment. Accelerating diffusion models is therefore essential, yet determining how to combine multiple model acceleration techniques remains a significant challenge. To address this issue, we introduce a framework driven by large language models (LLMs) for automated acceleration code generation and evaluation. First, we present DiffBench, a comprehensive benchmark that implements a three stage automated evaluation pipeline across diverse diffusion architectures, optimization combinations and deployment scenarios. Second, we propose DiffAgent, an agent that generates optimal acceleration strategies and codes for arbitrary diffusion models. DiffAgent employs a closed-loop workflow in which a planning component and a debugging component iteratively refine the output of a code generation component, while a genetic algorithm extracts performance feedback from the execution environment to guide subsequent code refinements. We provide a detailed explanation of the DiffBench construction and the design principles underlying DiffAgent. Extensive experiments show that DiffBench offers a thorough evaluation of generated codes and that DiffAgent significantly outperforms existing LLMs in producing effective diffusion acceleration strategies.

AAAI Conference 2026 Conference Paper

Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos

  • Zhixin Xu
  • Hengyu Zhou
  • Yuan Liu
  • Wenhan Xue
  • Hao Pan
  • Wenping Wang
  • Bin Wang

Multi-view video reconstruction plays a vital role in computer vision, enabling applications in film production, virtual reality, and motion analysis. While recent advances such as 3D Gaussian Splatting have demonstrated impressive capabilities in dynamic scene reconstruction, they typically rely on the assumption that input video streams are temporally synchronized. However, in real-world scenarios, this assumption often fails due to factors like camera trigger delays, frame rate discrepancies, or independent recording setups, leading to temporal misalignment across views and reduced reconstruction quality. To address this challenge, a novel temporal alignment strategy is proposed for high-quality 4DGS reconstruction from unsynchronized multi-view videos. Our method features a coarse-to-fine alignment module that estimates and compensates for each camera's time shift. The method first determines a coarse, frame-level offset and then refines it to achieve sub-frame accuracy. This strategy can be integrated as a plug-and-play module into existing 4DGS frameworks, enhancing their robustness when handling asynchronous data. Experiments show that this approach effectively processes temporally misaligned videos and significantly enhances baseline methods.

TMLR Journal 2026 Journal Article

Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

  • Ge Zhang
  • Mohammad Ali Alomrani
  • Hongjian Gu
  • Jiaming Zhou
  • Yaochen Hu
  • Bin Wang
  • Qun Liu
  • Mark Coates

Large language models (LLMs) possess vast semantic knowledge but often struggle with complex reasoning tasks, particularly in relational reasoning problems such as kinship or spatial reasoning. In this paper, we present Path-of-Thoughts (PoT), a novel framework for solving relation reasoning that decomposes the task into three key stages: graph extraction, path identification, and reasoning. Unlike previous approaches, PoT efficiently extracts a reasoning graph that identifies crucial entities, relations, and attributes within the context. Subsequently, PoT identifies query-relevant reasoning paths within the graph, facilitating downstream reasoning of potential answers. Experimental evaluations across four datasets of relational reasoning demonstrate that PoT surpasses state-of-the-art baselines by a significant margin (up to 21.3%) without requiring fine-tuning or extensive LLM calls. Furthermore, unlike prior neuro-symbolic methods, PoT exhibits improved resilience against LLM extraction errors and input ambiguity by leveraging the compositional nature of graphs.

AAAI Conference 2026 Conference Paper

Large Language Models Struggle with Unreasonability in Math Problems

  • Jingyuan Ma
  • Damai Dai
  • Zihang Yuan
  • Rui Li
  • Weilin Luo
  • Bin Wang
  • Qun Liu
  • Lei Sha

Large Language Models (LLMs) have shown remarkable success on a wide range of math and reasoning benchmarks. However, we observe that they often struggle when faced with unreasonable math problems. Instead of recognizing these issues, models frequently proceed as if the problem is well-posed, producing incorrect answers or falling into overthinking and verbose self-correction. To systematically investigate this overlooked vulnerability, we propose the Unreasonable Math Problems (UMP) benchmark, designed to evaluate LLMs' ability to detect and respond to unreasonable math problem statements. Based on extensive experiments covering 19 LLMs, we find that even state-of-the-art general models like GPT-4o struggle on UMP. While reasoning models such as DeepSeek-R1 demonstrate a higher sensitivity to unreasonable inputs, this often comes at the cost of generating overly long and meaningless responses that fail to converge. We further find that prompting and fine-tuning enhance the detection of unreasonable inputs, with minor and acceptable trade-offs, making them practical solutions in this challenging setting.

AAAI Conference 2026 Conference Paper

Learning Structurally Stabilized Representations for Lossless DNA Storage

  • Ben Cao
  • Xue Li
  • Tiantian He
  • Bin Wang
  • Shihua Zhou
  • Xiaohu Wu
  • Qiang Zhang

This paper presents Reed-Solomon coded single-stranded representation learning (RSRL), a novel end-to-end model for learning representations for lossless DNA data storage. In contrast to existing learning-based methods, RSRL is inspired by both error-correction codec and structural biology. Specifically, RSRL first learns the representations for the subsequent storage from the binary data transformed by the Reed-Solomon codec (RS code). Then, the representations are masked by an RS-code-informed mask to focus on correcting the burst errors occurring in the learning process. The synergy of RS masks and graph attention enables active error localization, breaking through the limitations of traditional passive error correction. With the decoded representations with error corrections, a novel biologically stabilized loss is formulated to regularize the data representations to possess stable single-stranded structures. By incorporating these novel strategies, RSRL can learn highly durable, dense, and lossless representations for subsequent storage tasks in DNA sequences. The proposed RSRL has been compared with a number of baselines in real-world tasks of multi-type data storage. The experimental results obtained demonstrate that RSRL can store diverse types of data with much higher information density and durability, but much lower error rates.

AAAI Conference 2026 Conference Paper

RICo: Refined In-Context Contribution for Automatic Instruction-Tuning Data Selection

  • Yixin Yang
  • Qingxiu Dong
  • Linli Yao
  • Fangwei Zhu
  • Weilin Luo
  • Bin Wang
  • Zhifang Sui

Data selection for instruction tuning is crucial for improving the performance of large language models (LLMs) while reducing training costs. In this paper, we propose Refined Contribution Measurement with In-Context Learning (RICo), a novel gradient-free method that quantifies the fine-grained contribution of individual samples to both task-level and global-level model performance. RICo enables more accurate identification of high-contribution data, leading to better instruction tuning. We also introduce a lightweight selection paradigm trained on RICo scores, enabling scalable data selection with strictly linear inference complexity. Extensive experiments on 3 LLMs across 12 benchmarks and 5 pairwise evaluation sets demonstrate the effectiveness of RICo. Remarkably, on LLaMA3.1-8B, models trained in 15% of RICo-selected data outperform full datasets by 5.42 percentage points and exceed the best performance of widely used selection methods by 1.48 percentage points. We further analyze high-contribution samples selected by RICo, which show both diverse tasks and appropriate difficulty levels, rather than merely the most difficult cases.

AAAI Conference 2026 Conference Paper

SACodec: Asymmetric Quantization with Semantic Anchoring for Low-Bitrate High-Fidelity Neural Speech Codecs

  • Zhongren Dong
  • Bin Wang
  • Jing Han
  • Haotian Guo
  • Xiaojun Mo
  • Yimin Cao
  • Zixing Zhang

Neural Speech Codecs face a fundamental trade-off at low bitrates: preserving acoustic fidelity often compromises semantic richness. To address this, we introduce SACodec, a novel codec built upon an asymmetric dual-quantizer that employs our proposed Semantic Anchoring mechanism. This design strategically decouples the quantization of Semantic and Acoustic details. The semantic anchoring is achieved via a lightweight projector that aligns acoustic features with a frozen, large-scale mHuBERT codebook, injecting linguistic priors while guaranteeing full codebook utilization. Sequentially, for acoustic details, a residual activation module with SimVQ enables a single-layer quantizer (acoustic path) to faithfully recover fine-grained information. At just 1.5 kbps, SACodec establishes a new state of the art by excelling in both fidelity and semantics: subjective listening tests confirm that its reconstruction quality is perceptually highly comparable to ground-truth audio, while its tokens demonstrate substantially improved semantic richness in downstream tasks. This work suggests that assigning specialized semantic quantizers to distinct information streams offers an effective path to reconcile the long-standing trade-off between fidelity, semantics, and modeling simplicity in low-bitrate speech tokenization.

AAAI Conference 2026 Conference Paper

SEFEL: A Simple Yet Effective Framework for Fast Event Linking

  • Yinan Liu
  • Ziyang Zhang
  • Bin Wang
  • Xiaochun Yang

Event linking aims to associate event mentions in text with their corresponding entries in a knowledge base (KB). This task can help text understanding to benefit downstream tasks (e.g., question answering) and expand the KB through new event knowledge mentioned in the text. Existing event linking approaches usually adopt a retrieve-and-rank framework, which suffers from high computational costs and relies on hand-crafted rules, thereby limiting generalization. Additionally, it is found that some entity linking methods can be used to solve this task directly. However, they also perform not well. In this paper, we propose SEFEL, an end-to-end, argument-aware event representation-based event linking framework to unify the modeling of both in-KB and out-of-KB scenarios. To further enhance the linking performance, we propose a contrastive learning module to refine the learned embeddings of events and event mentions. Experimental results demonstrate that SEFEL improves accuracy by at least 3.59 (in-KB) and 21.5 (out-of-KB) compared with baselines, while its inference speed is more than 38 times faster than baselines, showcasing its accuracy and efficiency.

AAAI Conference 2026 Conference Paper

Self-Improving Sparse Retrieval Through Heuristic Representation Refinement and Representation-Focused Learning

  • Xiaojing Li
  • Bin Wang
  • Xiaochun Yang
  • Meng Luo

Learnable sparse retrieval (LSR) models encode texts into high-dimensional sparse representations, supporting token-level expansion beyond the original text and addressing the vocabulary mismatch problem in traditional bag-of-words retrieval. However, in the absence of representation-level supervision, these representations usually overemphasize irrelevant tokens while neglecting truly relevant ones. We term this phenomenon the Representation Hallucination problem in LSR models, a critical bottleneck impeding accurate retrieval. To address this challenge, we introduce SiRe, a self-improving training framework for sparse retrieval that integrates two core strategies: Heuristic Representation Refinement and Representation-Focused Learning. Specifically, SiRe first identifies and corrects representation hallucinations in the outputs of the current LSR model using heuristic methods. The resulting representations serve as the primary supervision signals, guiding a pretrained language model (e.g., BERT) to mitigate the problem directly at the representation level. This process can be iterated, enabling progressive model improvement. Extensive experiments on both in-domain and out-domain benchmarks show that SiRe produces higher-quality sparse representations, significantly enhancing retrieval performance over strong baselines.

AAAI Conference 2026 Conference Paper

TrajAgg: Dual-Scale Feature Aggregation with Hybrid Training for Trajectory Similarity Computation in Free Space

  • Xiao Zhang
  • Xingyu Zhao
  • Yuan Cao
  • Bin Wang
  • Guiyuan Jiang
  • Yanwei Yu

With the widespread use of location-tracking technologies, large volumes of trajectory data are continuously generated. Trajectory similarity computation is a core task in trajectory mining with broad applications. However, existing methods still face two key challenges: (1) the difficulty of balancing efficiency and representation quality, and (2) the reliance on a single training paradigm, which limits the ability to capture both pairwise similarity and batch-level coherence. To address these challenges, we propose a trajectory similarity computation framework named TrajAgg. Specifically, our framework incorporates a novel Aggregation Transformer that efficiently aggregates GPS and grid features through two stages of direct interaction and enhances the expressiveness of the resulting trajectory embeddings. In addition, by integrating two distinct training paradigms, our model captures both fine-grained pairwise relationships and global structural consistency. We further analyze its effectiveness from the perspective of mutual information. Extensive experiments on three publicly available datasets show that TrajAgg consistently outperforms state-of-the-art baselines. Our method achieves average improvements of 15.11%, 16.49%, 10.41%, and 40.15% in HR@1 under four distance measures across three datasets, respectively.

JBHI Journal 2026 Journal Article

Trifocal Transformer: Connection-Mask-Residual Focused Attention Network for Brain Disease Diagnosis

  • Bin Wang
  • Jiarui Liang
  • Chuyang Ye
  • Ting Yan
  • Miaomiao Liu
  • Tianyi Yan

Functional magnetic resonance imaging (fMRI) allows the observation of brain functional connectivity patterns. Attention-based diagnostic models have been widely applied in fMRI data for brain disease diagnosis. However, the global attention mechanism of the Transformer faces challenges in adaptively identifying and focusing on significant brain regions and connections relevant to disease diagnosis while reducing attention to non-relevant regions and connections in fMRI data, as well as the degradation problem of the attention mechanism, thereby limiting the improvement in diagnostic accuracy. To address these problems, we propose a connection-mask-residual focused attention network (Trifocal Transformer) based on fMRI data for brain disease diagnosis. In the Trifocal Transformer, a Connection Focus Module is developed to simulate brain functional connectivity, thereby enhancing the attention mechanism's ability to focus on significant regions and connections relevant to disease diagnosis. To mitigate the potential negative impact of non-focused regions in the attention map, a learnable Mask Focus Module is designed to adaptively reduce attention to non-relevant regions and connections. To address the degradation of the attention mechanism's focusing ability, we establish Residual Focus Connections between the attention maps, which reinforce the focusing effect across layers and ensure stable attention to significant features. Comprehensive experimental results demonstrate that the Trifocal Transformer achieves superior diagnostic accuracies of 74. 1% and 71. 2% on ADHD-200 and ABIDE I datasets, respectively. Furthermore, our method reveals potentially disease-related regions of interest (ROIs), providing a new neuroimaging perspective for brain disease diagnosis and treatment.

JBHI Journal 2025 Journal Article

A Home-based Dual-mode Upper Limb Rehabilitation System: Teleoperation Mode and Bilateral Mode with sEMG and IMU

  • He Li
  • Shuxiang Guo
  • Ruijie He
  • Bin Wang
  • Mingchao Ding

Upper limb hemiplegia is a common functional disorder among stroke patients, significantly affecting their quality of life. To address this issue, robot-assisted upper limb rehabilitation training has emerged as a new therapeutic approach, breaking through time and space limitations of traditional rehabilitation. Based on the above, a home-based dual-mode upper limb rehabilitation system is built, including teleoperation mode based on a cloud server and bilateral mode with fusion of Surface Electromyography (sEMG) and Inertial Measurement Unit (IMU). In the telerehabilitation mode, patients can receive professional guidance and regular training at home, greatly enhancing the accessibility of rehabilitation services. The experiments with the master side in Beijing City (China) and the slave side in three different cities are conducted through a cloud server. The slave side is controlled by the master side, and the contact force is sent back to the master side. In the bilateral mode, the intention of continuous movements across subjects can be accurately predicted via the fusion of sEMG and IMU, improving the naturalness of human-robot interaction. In the subject-independent modeling, the Root Mean Square Error (RMSE) under fusion showed a relative decrease of 15. 0329% (p −4 ) compared to IMU data alone, and a significantly greater reduction of 61. 9376% (p −4 ) in comparison with sEMG data alone. Robot-assisted upper limb exoskeleton, cloud-based teleoperation and bilateral training based on sEMG and IMU collectively form a new rehabilitation system, representing part of the future rehabilitation trend.

IJCAI Conference 2025 Conference Paper

Accelerating Diffusion-based Super-Resolution with Dynamic Time-Spatial Sampling

  • Rui Qin
  • Qijie Wang
  • Ming Sun
  • Haowei Zhu
  • Chao Zhou
  • Bin Wang

Diffusion models have gained attention for their success in modeling complex distributions, achieving impressive perceptual quality in SR tasks. However, existing diffusion-based SR methods often suffer from high computational costs, requiring numerous iterative steps for training and inference. Existing acceleration techniques, such as distillation and solver optimization, are generally task-agnostic and do not fully leverage the specific characteristics of low-level tasks like super-resolution (SR). In this study, we analyze the frequency- and spatial-domain properties of diffusion-based SR methods, revealing key insights into the temporal and spatial dependencies of high-frequency signal recovery. Specifically, high-frequency details benefit from concentrated optimization during early and late diffusion iterations, while spatially textured regions demand adaptive denoising strategies. Building on these observations, we propose the Time-Spatial-aware Sampling strategy (TSS) for the acceleration of Diffusion SR without any extra training cost. TSS combines Time Dynamic Sampling (TDS), which allocates more iterations to refining textures, and Spatial Dynamic Sampling (SDS), which dynamically adjusts strategies based on image content. Extensive evaluations across multiple benchmarks demonstrate that TSS achieves state-of-the-art (SOTA) performance with significantly fewer iterations, improving MUSIQ scores by 0. 2~3. 0 and outperforming the current acceleration methods with only half the number of steps.

TIST Journal 2025 Journal Article

Adaptive Intention Learning for Session-Based Recommendation

  • Qingbo Zhang
  • Xiaochun Yang
  • Hao Chen
  • Bin Wang
  • Zhu Sun
  • Xiangmin Zhou

In recent years, session-based recommender systems (SRSs) have emerged as a significant research focus within the recommendation field. Capturing user intentions to infer user interest accordingly has proven to be effective in enhancing the accuracy of SRSs. However, existing techniques assume that all sessions have the same number of intentions or that the items in one category belonging to the same session reflect the same intention. In real applications, such as e-commerce, sessions may have different numbers of intentions, and the same type of items in a session may correspond to different intentions. As a result, existing techniques cannot guarantee high-quality user interest prediction. In this article, we propose a novel Adaptive Intention Learning Network (AILN) to capture an adaptive number of intentions for each session, thereby enhancing the accuracy of user interest inference. Specifically, we design an intention evaluation network (IEN) to evaluate whether a subsequence of a session corresponds to a valid intention, and an intention generation network (IGN) to learn the representation of a valid intention. By checking each subsequence of a session, IEN and IGN enable the incremental learning of a session-specific intention hierarchy (IH) to store valid intentions of the session. To reduce the cost of building the IH, we propose a pruning strategy that exploits the intention validity to avoid unnecessary evaluation. The representative intentions are selected from IH and input into a designed interest predictor to infer the user interest. Experimental results on two real-world datasets demonstrate the superiority of our proposed AILN.

IJCAI Conference 2025 Conference Paper

DGraFormer: Dynamic Graph Learning Guided Multi-Scale Transformer for Multivariate Time Series Forecasting

  • Han Yan
  • Dongliang Chen
  • Guiyuan Jiang
  • Bin Wang
  • Lei Cao
  • Junyu Dong
  • Yanwei Yu

Multivariate time series forecasting is a critical focus across many fields. Existing transformer-based models have overlooked the explicit modeling of inter-variable correlations. Similarly, the graph-based methods have also failed to address the dynamic nature of multivariate correlations and the noise in correlation modeling. To overcome these challenges, we propose a novel Dynamic Graph Learning Guided Multi-Scale Transformer (DGraFormer) for multivariate time series forecasting. Specifically, our method consists of two main components: Dynamic correlation-aware graph Learning (DCGL) and multi-scale temporal transformer (MTT). The former aims to capture dynamic correlations across different time windows, filters out noise, and selects key weights to guide the aggregation of relevant feature representations. The latter can effectively extract temporal patterns from patch data at varying scales. Finally, the proposed method can capture rich local correlation graph structures and multi-scale global temporal features. Experimental results demonstrate that DGraformer significantly outperforms existing state-of-the-art models on ten real-world datasets, achieving the best performance across multiple evaluation metrics. The source code of our model is available at \url{https: //anonymous. 4open. science/r/DGraFormer}.

NeurIPS Conference 2025 Conference Paper

Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis

  • Hengyuan Cao
  • Yutong Feng
  • Biao Gong
  • Yijing Tian
  • Yunhong Lu
  • Chuang Liu
  • Bin Wang

Video generative models can be regarded as world simulators due to their ability to capture dynamic, continuous changes inherent in real-world environments. These models integrate high-dimensional information across visual, temporal, spatial, and causal dimensions, enabling predictions of subjects in various status. A natural and valuable research direction is to explore whether a fully trained video generative model in high-dimensional space can effectively support lower-dimensional tasks such as controllable image generation. In this work, we propose a paradigm for video-to-image knowledge compression and task adaptation, termed \textit{Dimension-Reduction Attack} (\texttt{DRA-Ctrl}), which utilizes the strengths of video models, including long-range context modeling and flatten full-attention, to perform various generation tasks. Specially, to address the challenging gap between continuous video frames and discrete image generation, we introduce a mixup-based transition strategy that ensures smooth adaptation. Moreover, we redesign the attention structure with a tailored masking mechanism to better align text prompts with image-level control. Experiments across diverse image generation tasks, such as subject-driven and spatially conditioned generation, show that repurposed video models outperform those trained directly on images. These results highlight the untapped potential of large-scale video generators for broader visual applications. \texttt{DRA-Ctrl} provides new insights into reusing resource-intensive video models and lays foundation for future unified generative models across visual modalities. The project page is \url{https: //dra-ctrl-2025. github. io/DRA-Ctrl/}.

NeurIPS Conference 2025 Conference Paper

Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

  • Zichen Wen
  • Shaobo Wang
  • Yufa Zhou
  • Junyuan Zhang
  • Qintong Zhang
  • Yifeng Gao
  • Zhaorun Chen
  • Bin Wang

Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model’s parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.

AAAI Conference 2025 Conference Paper

IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities

  • Bin Wang
  • Chunyu Xie
  • Dawei Leng
  • Yuhui Yin

In the field of multimodal large language models (MLLMs), common methods typically involve unfreezing the language model during training to foster profound visual understanding. However, the fine-tuning of such models with vision-language data often leads to a diminution of their natural language processing (NLP) capabilities. To avoid this performance degradation, a straightforward solution is to freeze the language model while developing multimodal competencies. Unfortunately, previous works have not attained satisfactory outcomes. Building on the strategy of freezing the language model, we conduct thorough structural exploration and introduce the Inner-Adaptor Architecture (IAA). Specifically, the architecture incorporates multiple multimodal adaptors at varying depths within the large language model to facilitate direct interaction with the inherently text-oriented transformer layers, thereby enabling the frozen language model to acquire multimodal capabilities. Unlike previous approaches of freezing language models that require large-scale aligned data, our proposed architecture is able to achieve superior performance on small-scale datasets. We conduct extensive experiments to improve the general multimodal capabilities and visual grounding abilities of the MLLM. Our approach remarkably outperforms previous state-of-the-art methods across various vision-language benchmarks without sacrificing performance on NLP tasks. Code and models will be released.

AAAI Conference 2025 Conference Paper

LLM4RSR: Large Language Models as Data Correctors for Robust Sequential Recommendation

  • Yatong Sun
  • Xiaochun Yang
  • Zhu Sun
  • Yan Wang
  • Bin Wang
  • Xinghua Qu

Sequential Recommenders (SRs) are trained to predict the next item as the target given its preceding items as the input, assuming every input-target pair is matched and is reliable for training. However, users can be induced by external distractions to click on items inconsistent with their true preferences, resulting in unreliable training instances with mismatched input-target pairs. To resist unreliable data, researchers attempt to develop Robust SRs (RSRs). However, our data analysis unveils that existing RSRs are data-driven. That is, for most instances formed by infrequently co-occurred items, existing RSRs are uncertain about their reliability. To fill this gap, we propose a generic framework -- LLM4RSR (Large Language Models for Robust Sequential Recommendation) to semantically complement data-driven RSRs by correcting uncertain instances into reliable ones based on LLMs' semantic comprehension of items beyond co-occurrence. In this way, RSRs can be re-trained with the corrected data for better accuracy. This is a selective knowledge distillation procedure, where the LLM acts as a teacher guiding student RSRs via uncertain instances. To align LLMs with the data correction task and mitigate inherent hallucinations, we equip the LLM with profile, plan, and memory modules, which are automatically optimized via textual gradient descent, eliminating the need for human effort and expertise. Experiments on four real-world datasets spanning eight backbones verify the generality, effectiveness, and efficiency of LLM4RSR.

NeurIPS Conference 2025 Conference Paper

LogicTree: Improving Complex Reasoning of LLMs via Instantiated Multi-step Synthetic Logical Data

  • Zehao Wang
  • Lin Yang
  • Jie Wang
  • Kehan Wang
  • Hanzhu Chen
  • Bin Wang
  • Jianye Hao
  • Defu Lian

Despite their remarkable performance on various tasks, Large Language Models (LLMs) still struggle with logical reasoning, particularly in complex and multi-step reasoning processes. Among various efforts to enhance LLMs' reasoning capabilities, synthesizing large-scale, high-quality logical reasoning datasets has emerged as a promising direction. However, existing methods often rely on predefined templates for logical reasoning data generation, limiting their adaptability to real-world scenarios. To address the limitation, we propose LogicTree, a novel framework for efficiently synthesizing multi-step logical reasoning dataset that excels in both complexity and instantiation. By iteratively searching for applicable logic rules based on structural pattern matching to perform backward deduction, LogicTree constructs multi-step logic trees that capture complex reasoning patterns. Furthermore, we employ a two-stage LLM-based approach to instantiate various real-world scenarios for each logic tree, generating consistent real-world reasoning processes that carry contextual significance. This helps LLMs develop generalizable logical reasoning abilities across diverse scenarios rather than merely memorizing templates. Experiments on multiple benchmarks demonstrate that our approach achieves an average improvement of 9. 4\% in accuracy on complex logical reasoning tasks.

IROS Conference 2025 Conference Paper

Long-Distance Delivery of Collective Cell Microrobots Driven by Mobile Magnetic Actuation System

  • Yimin Sun
  • Ying Cao
  • Haoyu Zhang
  • Bin Wang
  • Qijun Yang
  • Mingxue Cai
  • Tiantian Xu
  • Qianqian Wang

Collective microrobots enable controlled batch delivery, showing promising application in the biomedical field. However, significant challenges remain in achieving long-distance delivery of collective microrobots in dynamic environments. This study proposes a magnetic actuation strategy for delivering collective cell microrobots in flowing conditions. A magnetic actuation method is developed, and a mobile actuation system with multiple coils coordination is designed to generate spatially isotropic magnetic fields. Experiments of delivering collective microrobots are conducted in flowing conditions, including downstream and upstream with an average flow velocity up to 8. 84 mm/s. Results demonstrate that the proposed actuation strategy enhances driving performance in dynamic environments, achieving long-distance delivery of collective microrobots (over 548 mm). The final access rate of microrobots reaches 90. 63% and 94. 79% in upstream and downstream conditions, respectively. Our strategy provides an efficient control method for delivering collective microrobots, showing potential for targeted delivery in biomedical applications.

IJCAI Conference 2025 Conference Paper

Non-collective Calibrating Strategy for Time Series Forecasting

  • Bin Wang
  • Yongqi Han
  • Minbo Ma
  • Tianrui Li
  • Junbo Zhang
  • Feng Hong
  • Yanwei Yu

Deep learning-based approaches have demonstrated significant advancements in time series forecasting. Despite these ongoing developments, the complex dynamics of time series make it challenging to establish the rule of thumb for designing the golden model architecture. In this study, we argue that refining existing advanced models through a universal calibrating strategy can deliver substantial benefits with minimal resource costs, as opposed to elaborating and training a new model from scratch. We first identify a multi-target learning conflict in the calibrating process, which arises when optimizing variables across time steps, leading to the underutilization of the model's learning capabilities. To address this issue, we propose an innovative calibrating strategy called Socket+Plug (SoP). This approach retains an exclusive optimizer and early-stopping monitor for each predicted target within each Plug while keeping the fully trained Socket backbone frozen. The model-agnostic nature of SoP allows it to directly calibrate the performance of any trained deep forecasting models, regardless of their specific architectures. Extensive experiments on various time series benchmarks and a spatio-temporal meteorological ERA5 dataset demonstrate the effectiveness of SoP, achieving up to a 22% improvement even when employing a simple MLP as the Plug (highlighted in Figure 1).

NeurIPS Conference 2025 Conference Paper

OmniTry: Virtual Try-On Anything without Masks

  • Yutong Feng
  • Linlin Zhang
  • Hengyuan Cao
  • Yiming Chen
  • Xiaoduan Feng
  • Jian Cao
  • Yuxiong Wu
  • Bin Wang

Virtual Try-ON (VTON) is a practical and widely-applied task, for which most of existing works focus on clothes. This paper presents OmniTry, a unified framework that extends VTON beyond garment to encompass any wearable objects, e. g. , jewelries and accessories, with mask-free setting for more practical application. When extending to various types of objects, data curation is challenging for obtaining paired images, i. e. , the object image and the corresponding try-on result. To tackle this problem, we propose a two-staged pipeline: For the first stage, we leverage large-scale unpaired images, i. e. , portraits with any wearable items, to train the model for mask-free localization. Specifically, we repurpose the inpainting model to automatically draw objects in suitable positions given an empty mask. For the second stage, the model is further fine-tuned with paired images to transfer the consistency of object appearance. We observed that the model after the first stage shows quick convergence even with few paired samples. OmniTry is evaluated on a comprehensive benchmark consisting of 12 common classes of wearable objects, with both in-shop and in-the-wild images. Experimental results suggest that OmniTry shows better performance on both object localization and ID-preservation compared with existing methods. The code, model weights, and evaluation benchmark of OmniTry will be made publicly available. The code, model weights, and evaluation benchmark of OmniTry are available at https: //omnitry. github. io/.

NeurIPS Conference 2025 Conference Paper

ReCon: Region-Controllable Data Augmentation with Rectification and Alignment for Object Detection

  • Haowei Zhu
  • Tianxiang Pan
  • Rui Qin
  • Jun-Hai Yong
  • Bin Wang

The scale and quality of datasets are crucial for training robust perception models. However, obtaining large-scale annotated data is both costly and time-consuming. Generative models have emerged as a powerful tool for data augmentation by synthesizing samples that adhere to desired distributions. However, current generative approaches often rely on complex post-processing or extensive fine-tuning on massive datasets to achieve satisfactory results, and they remain prone to content–position mismatches and semantic leakage. To overcome these limitations, we introduce ReCon, a novel augmentation framework that enhances the capacity of structure-controllable generative models for object detection. ReCon integrates region-guided rectification into the diffusion sampling process, using feedback from a pre-trained perception model to rectify misgenerated regions within diffusion sampling process. We further propose region-aligned cross-attention to enforce spatial–semantic alignment between image regions and their textual cues, thereby improving both semantic consistency and overall image fidelity. Extensive experiments demonstrate that ReCon substantially improve the quality and trainability of generated data, achieving consistent performance gains across various datasets, backbone architectures, and data scales.

AAAI Conference 2025 Conference Paper

Reverse Distribution Based Video Moment Retrieval for Effective Bias Elimination

  • Lingdu Kong
  • Xiaochun Yang
  • Tieying Li
  • Bin Wang
  • Xiangmin Zhou

Video Moment Retrieval (VMR) aims to identify a temporal segment in an untrimmed video that best matches a given textual query. Bias in VMR is a critical issue, where the model achieves favorable results even if disregarding the video input. Existing evaluation methods, such as Resplitting, have attempted to address bias by creating out-of-distribution (OOD) datasets. However, these methods provide an incomplete definition of bias and do not quantify bias. To this end, we provide a comprehensive definition of bias in VMR, encompassing both data bias and model bias. Besides, our evaluation metrics can analyze the magnitude of these biases better. To address both data and model biases comprehensively, we introduce Reverse Distribution based VMR (ReDis-VMR). This novel approach dynamically generates datasets with inverse distributions tailored to different models based on Gaussian kernel estimation. As a result, it enables a more accurate evaluation of model performance. Building on ReDis-VMR, we further propose the Dynamic Expandable Adjustment (DEA) pipeline. DEA incrementally expands the model structure to enhance its focus on video and text features, and it incorporates a fair loss to minimize the influence of concentrated data distributions. The experimental results on bias ratio demonstrate that our ReDis method achieves state-of-the-art performance in bias elimination, while the results on moment retrieval confirm the effectiveness of our DEA framework across three evaluation methods, two datasets, and three baselines.

NeurIPS Conference 2025 Conference Paper

ROSE: Remove Objects with Side Effects in Videos

  • Chenxuan Miao
  • Yutong Feng
  • Jianshu Zeng
  • Zixiang Gao
  • Hantang Liu
  • Yunfeng Yan
  • Donglian Qi
  • Xi Chen

Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, \textit{e. g. ,} their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents \method, termed \textbf{R}emove \textbf{O}bjects with \textbf{S}ide \textbf{E}ffects, a framework that systematically studies the object's effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that \method achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios.

AAAI Conference 2025 Conference Paper

Spatiotemporal-aware Trend-Seasonality Decomposition Network for Traffic Flow Forecasting

  • Lingxiao Cao
  • Bin Wang
  • Guiyuan Jiang
  • Yanwei Yu
  • Junyu Dong

Traffic prediction is critical for optimizing travel scheduling and enhancing public safety, yet the complex spatial and temporal dynamics within traffic data present significant challenges for accurate forecasting. In this paper, we introduce a novel model, the Spatiotemporal-aware Trend-Seasonality Decomposition Network (STDN). This model begins by constructing a dynamic graph structure to represent traffic flow and incorporates novel spatio-temporal embeddings to jointly capture global traffic dynamics. The representations learned are further refined by a specially designed trend-seasonality decomposition module, which disentangles the trend-cyclical component and seasonal component for each traffic node at different times within the graph. These components are subsequently processed through an encoder-decoder network to generate the final predictions. Extensive experiments conducted on real-world traffic datasets demonstrate that STDN achieves superior performance with remarkable computation cost. Furthermore, we have released a new traffic dataset named JiNan, which features unique inner-city dynamics, thereby enriching the scenario comprehensiveness in traffic prediction evaluation.

AAAI Conference 2025 Conference Paper

Stability and Generalization of Zeroth-Order Decentralized Stochastic Gradient Descent with Changing Topology

  • Xiaolin Hu
  • Zixuan Gong
  • Gengze Xu
  • Wei Liu
  • Jian Luan
  • Bin Wang
  • Yong Liu

Zeroth-order (ZO) optimization as the gradient-free method has become a powerful tool when the first-order gradient is unavailable or expensive to obtain, especially in decentralized learning scenarios where data and computational resources are distributed across multiple clients. There have been many efforts to analyze the optimization convergence rate of zeroth-order decentralized stochastic gradient descent (ZO-DSGD) algorithms. However, the generalization of these methods has not been well studied. In this paper, we provide a generalization analysis of ZO-DSGD with changing topology, where the clients run zeroth-order SGD with local data and communicate with each other according to time-varying topology. We systematically analyze the generalization error in convex, strongly convex, and non-convex cases. The obtained results in the convex and strongly convex cases with zeroth-order oracles recover the results of SGD. Moreover, the generalization bounds derived in non-convex cases align with that of DSGD. To capture the influence of communication topology on the generalization performance, we analyze local generalization bounds concerning local models held at different clients. The obtained results reflect the influence of the number of clients, local sample size, and topology on the generalization error. To the best of our knowledge, this is the first work that provides a generalization analysis of zeroth-order decentralized stochastic gradient descent methods and recovers the results of SGD.

IJCAI Conference 2025 Conference Paper

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

  • Xinhao Yao
  • Hongjin Qian
  • Xiaolin Hu
  • Gengze Xu
  • Wei Liu
  • Jian Luan
  • Bin Wang
  • Yong Liu

Large Language Models (LLMs), built on Transformer architectures, exhibit remarkable generalization across a wide range of tasks. However, fine-tuning these models for specific tasks remains resource-intensive due to their extensive parameterization. In this paper, we explore two remarkable phenomena related to the attention mechanism during the fine-tuning of LLMs (where Wq, Wk, and Wv denote the weights of the query, key, and value layers, respectively). The first phenomenon, termed “Unequal Importance of Attention Matrices”, highlights the impact of fine-tuning different weight matrices. It shows that optimizing the Wv matrix yields significantly better performance than optimizing the Wk matrix. Fine-tuning only the Wq and Wv matrices is computationally efficient while delivering results comparable to, or even better than fine-tuning all three matrices (Wq, Wk, and Wv). The second phenomenon, “Attention Matrices with Customized Learning Rate Lead to Better Convergence”, emphasizes the importance of assigning distinct learning rates to these matrices. Specifically, a higher learning rate for the Wv matrix compared to Wq and Wk accelerates convergence and improves performance. Building on these insights, we propose a new strategy that improves fine-tuning efficiency in terms of both storage and time. Experimental results on benchmark datasets validate the effectiveness of this approach, supporting our theoretical findings. Our analysis lays the theoretical groundwork for configuring and improving algorithms in LLMs fine-tuning.

AAAI Conference 2025 Conference Paper

Towards Ship License Plate Recognition in the Wild: A Large Benchmark and Strong Baseline

  • Baolong Liu
  • Ruiqing Yang
  • Roukai Huang
  • Wenhao Xu
  • Xin Pan
  • Chuanhuang Li
  • Bin Wang
  • Xun Wang

The paper targets the challenging task of Ship License Plate (SLP) recognition. Existing methods for SLP recognition are hampered by the scarcity of large and publicly available datasets, leading to evaluations on small and non-representative datasets. To alleviate it, we have built a large dataset, called SLP34K, which consists of 34,385 images collected by an intelligent traffic surveillance system. The dataset is carefully manually annotated with text labels and attributes, and presents high data diversity by multiple installation locations and long capturing period of the cameras. Additionally, we propose a simple yet effective SLP recognition baseline method. The baseline is equipped with a strong visual encoder that benefits from initial pre-training via self-supervised learning, followed by further refinement through our devised semantic enhancement module. Extensive experiments on SLP34K verify the effectiveness of our proposed baseline. Moreover, while our baseline is designed for SLP recognition, it can also be used for common scene text recognition and achieve state-of-the-art performance on seven mainstream scene text recognition datasets.

AAAI Conference 2025 Conference Paper

Walk Wisely on Graph: Knowledge Graph Reasoning with Dual Agents via Efficient Guidance-Exploration

  • Zijian Wang
  • Bin Wang
  • Haifeng Jing
  • Huayu Li
  • Hongbo Dou

Recent years, multi-hop reasoning has been widely studied for knowledge graph (KG) reasoning due to its efficacy and interpretability. However, previous multi-hop reasoning approaches are subject to two primary shortcomings. First, agents struggle to learn effective and robust policies at the early phase due to sparse rewards. Second, these approaches often falter on specific datasets like sparse knowledge graphs, where agents are required to traverse lengthy reasoning paths. To address these problems, we propose a multi-hop reasoning model with dual agents based on hierarchical reinforcement learning (HRL), which is named FULORA. FULORA tackles the above reasoning challenges by eFficient GUidance-ExpLORAtion between dual agents. The high-level agent walks on the simplified knowledge graph to provide stage-wise hints for the low-level agent walking on the original knowledge graph. In this framework, the low-level agent optimizes a value function that balances two objectives: (1) maximizing return, and (2) integrating efficient guidance from the high-level agent. Experiments conducted on three real-word knowledge graph datasets demonstrate that FULORA outperforms RL-based baselines, especially in the case of long-distance reasoning.

NeurIPS Conference 2024 Conference Paper

DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization

  • Haowei Zhu
  • Dehua Tang
  • Ji Liu
  • Mingjie Lu
  • Jintu Zheng
  • Jinzhang Peng
  • Dong Li
  • Yu Wang

Diffusion models have achieved remarkable progress in the field of image generation due to their outstanding capabilities. However, these models require substantial computing resources because of the multi-step denoising process during inference. While traditional pruning methods have been employed to optimize these models, the retraining process necessitates large-scale training datasets and extensive computational costs to maintain generalization ability, making it neither convenient nor efficient. Recent studies attempt to utilize the similarity of features across adjacent denoising stages to reduce computational costs through simple and static strategies. However, these strategies cannot fully harness the potential of the similar feature patterns across adjacent timesteps. In this work, we propose a novel pruning method that derives an efficient diffusion model via a more intelligent and differentiable pruner. At the core of our approach is casting the model pruning process into a SubNet search process. Specifically, we first introduce a SuperNet based on standard diffusion via adding some backup connections built upon the similar features. We then construct a plugin pruner network and design optimization losses to identify redundant computation. Finally, our method can identify an optimal SubNet through few-step gradient optimization and a simple post-processing procedure. We conduct extensive experiments on various diffusion models including Stable Diffusion series and DiTs. Our DiP-GO approach achieves 4. 4 x speedup for SD-1. 5 without any loss of accuracy, significantly outperforming the previous state-of-the-art methods.

ICML Conference 2024 Conference Paper

Distributed Bilevel Optimization with Communication Compression

  • Yutong He
  • Jie Hu 0022
  • Xinmeng Huang
  • Songtao Lu
  • Bin Wang
  • Kun Yuan 0001

Stochastic bilevel optimization tackles challenges involving nested optimization structures. Its fast-growing scale nowadays necessitates efficient distributed algorithms. In conventional distributed bilevel methods, each worker must transmit full-dimensional stochastic gradients to the server every iteration, leading to significant communication overhead and thus hindering efficiency and scalability. To resolve this issue, we introduce the first family of distributed bilevel algorithms with communication compression. The primary challenge in algorithmic development is mitigating bias in hypergradient estimation caused by the nested structure. We first propose C-SOBA, a simple yet effective approach with unbiased compression and provable linear speedup convergence. However, it relies on strong assumptions on bounded gradients. To address this limitation, we explore the use of moving average, error feedback, and multi-step compression in bilevel optimization, resulting in a series of advanced algorithms with relaxed assumptions and improved convergence properties. Numerical experiments show that our compressed bilevel algorithms can achieve $10\times$ reduction in communication overhead without severe performance degradation.

NeurIPS Conference 2024 Conference Paper

Distribution-Aware Data Expansion with Diffusion Models

  • Haowei Zhu
  • Ling Yang
  • Jun-Hai Yong
  • Hongzhi Yin
  • Jiawei Jiang
  • Meng Xiao
  • Wentao Zhang
  • Bin Wang

The scale and quality of a dataset significantly impact the performance of deep models. However, acquiring large-scale annotated datasets is both a costly and time-consuming endeavor. To address this challenge, dataset expansion technologies aim to automatically augment datasets, unlocking the full potential of deep models. Current data expansion techniques include image transformation and image synthesis methods. Transformation-based methods introduce only local variations, leading to limited diversity. In contrast, synthesis-based methods generate entirely new content, greatly enhancing informativeness. However, existing synthesis methods carry the risk of distribution deviations, potentially degrading model performance with out-of-distribution samples. In this paper, we propose DistDiff, a training-free data expansion framework based on the distribution-aware diffusion model. DistDiff constructs hierarchical prototypes to approximate the real data distribution, optimizing latent data points within diffusion models with hierarchical energy guidance. We demonstrate its capability to generate distribution-consistent samples, significantly improving data expansion tasks. DistDiff consistently enhances accuracy across a diverse range of datasets compared to models trained solely on original data. Furthermore, our approach consistently outperforms existing synthesis-based techniques and demonstrates compatibility with widely adopted transformation-based augmentation methods. Additionally, the expanded dataset exhibits robustness across various architectural frameworks.

NeurIPS Conference 2024 Conference Paper

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

  • Xiaoyi Dong
  • Pan Zhang
  • Yuhang Zang
  • Yuhang Cao
  • Bin Wang
  • Linke Ouyang
  • Songyang Zhang
  • Haodong Duan

The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 $\times$ 1500 pixels and constrained to a relatively narrow resolution range. This paper represents InternLM-XComposer2-4KHD, a groundbreaking exploration into elevating LVLM resolution capabilities up to 4K HD (3840 × 1600) and beyond. Concurrently, considering the ultra-high resolution may not be necessary in all scenarios, it supports a wide range of diverse resolutions from 336 pixels to 4K standard, significantly broadening its scope of applicability. Specifically, this research advances the patch division paradigm by introducing a novel extension: dynamic resolution with automatic patch configuration. It maintains the training image aspect ratios while automatically varying patch counts and configuring layouts based on a pre-trained Vision Transformer (ViT) (336 $\times$ 336), leading to dynamic training resolution from 336 pixels to 4K standard. Our research demonstrates that scaling training resolution up to 4K HD leads to consistent performance enhancements without hitting the ceiling of potential improvements. InternLM-XComposer2-4KHD shows superb capability that matches or even surpasses GPT-4V and Gemini Pro in 10 of the 16 benchmarks.

IJCAI Conference 2024 Conference Paper

Multi-Relational Graph Attention Network for Social Relationship Inference from Human Mobility Data

  • Guangming Qin
  • Jianpeng Qi
  • Bin Wang
  • Guiyuan Jiang
  • Yanwei Yu
  • Junyu Dong

Inferring social relationships from human mobility data holds significant value in real-life spatio-temporal applications, which inspires the development of a series of graph-based methods for inferring social relationships. Despite their effectiveness, we argue that previous methods either rely solely on direct relations between users, neglecting valuable user mobility patterns, or have not fully harnessed the indirect interactions, thereby struggling to capture users' mobility preferences. To address these issues, in this work, we propose the Multi-Relational Graph Attention Network (MRGAN), a novel graph attention network, which is able to explicitly model indirect relations and effectively capture their different impact. Specifically, we first extract a multi-relational graph from heterogeneous mobility graph to explicitly model the direct and indirect relations, and then utilize influence attention and cross-relation attention to further capture the different influence between users, and different importance of relations for each user. Comprehensive experiments on three real-world mobile datasets demonstrate that the proposed model significantly outperforms state-of-the-art models in predicting social relationships between users. The source code of our model is available at https: //github. com/qinguangming1999/MRGAN_IJCAI.

NeurIPS Conference 2024 Conference Paper

PhyRecon: Physically Plausible Neural Scene Reconstruction

  • Junfeng Ni
  • Yixin Chen
  • Bohan Jing
  • Nan Jiang
  • Bin Wang
  • Bo Dai
  • Puhao Li
  • Yixin Zhu

We address the issue of physical implausibility in multi-view neural reconstruction. While implicit representations have gained popularity in multi-view 3D reconstruction, previous work struggles to yield physically plausible results, limiting their utility in domains requiring rigorous physical accuracy. This lack of plausibility stems from the absence of physics modeling in existing methods and their inability to recover intricate geometrical structures. In this paper, we introduce PHYRECON, the first approach to leverage both differentiable rendering and differentiable physics simulation to learn implicit surface representations. PHYRECON features a novel differentiable particle-based physical simulator built on neural implicit representations. Central to this design is an efficient transformation between SDF-based implicit representations and explicit surface points via our proposed Surface Points Marching Cubes (SP-MC), enabling differentiable learning with both rendering and physical losses. Additionally, PHYRECON models both rendering and physical uncertainty to identify and compensate for inconsistent and inaccurate monocular geometric priors. The physical uncertainty further facilitates physics-guided pixel sampling to enhance the learning of slender structures. By integrating these techniques, our model supports differentiable joint modeling of appearance, geometry, and physics. Extensive experiments demonstrate that PHYRECON significantly improves the reconstruction quality. Our results also exhibit superior physical stability in physical simulators, with at least a 40% improvement across all datasets, paving the way for future physics-based applications.

ICRA Conference 2024 Conference Paper

RoboKeyGen: Robot Pose and Joint Angles Estimation via Diffusion-based 3D Keypoint Generation

  • Yang Tian
  • Jiyao Zhang
  • Guowei Huang 0002
  • Bin Wang
  • Ping Wang
  • Jiangmiao Pang
  • Hao Dong 0003

Estimating robot pose and joint angles is significant in advanced robotics, enabling applications like robot collaboration and online hand-eye calibration. However, the introduction of unknown joint angles makes prediction more complex than simple robot pose estimation, due to its higher dimensionality. Previous methods either regress 3D keypoints directly or utilise a render&compare strategy. These approaches often falter in terms of performance or efficiency and grapple with the cross-camera gap problem. This paper presents a novel framework that bifurcates the high-dimensional prediction task into two manageable subtasks: 2D keypoints detection and lifting 2D keypoints to 3D. This separation promises enhanced performance without sacrificing the efficiency innate to keypoint-based techniques. A vital component of our method is the lifting of 2D keypoints to 3D keypoints. Common deterministic regression methods may falter when faced with uncertainties from 2D detection errors or self-occlusions. Leveraging the robust modeling potential of diffusion models, we reframe this issue as a conditional 3D keypoints generation task. To bolster cross-camera adaptability, we introduce the Normalised Camera Coordinate Space (NCCS), ensuring alignment of estimated 2D keypoints across varying camera intrinsics. Experimental results demonstrate that the proposed method outperforms the state-of-the-art render&compare method and achieves higher inference speed. Furthermore, the tests accentuate our method’s robust cross-camera generalisation capabilities. We intend to release both the dataset and code in https://nimolty.github.io/Robokeygen/.

AAAI Conference 2024 Conference Paper

VIGC: Visual Instruction Generation and Correction

  • Bin Wang
  • Fan Wu
  • Xiao Han
  • Jiahui Peng
  • Huaping Zhong
  • Pan Zhang
  • Xiaoyi Dong
  • Weijia Li

The integration of visual encoders and large language models (LLMs) has driven recent progress in multimodal large language models (MLLMs). However, the scarcity of high-quality instruction-tuning data for vision-language tasks remains a challenge. The current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data, which requires pre-annotated image captions and detection bounding boxes, suffering from understanding image details. A practical solution to this problem would be to utilize the available multimodal large language models to generate instruction data for vision-language tasks. However, it's worth noting that the currently accessible MLLMs are not as powerful as their LLM counterparts, as they tend to produce inadequate responses and generate false information. As a solution for addressing the current issue, this paper proposes the Visual Instruction Generation and Correction (VIGC) framework that enables multimodal large language models to generate instruction-tuning data and progressively enhance its quality on-the-fly. Specifically, Visual Instruction Generation (VIG) guides the vision-language model to generate diverse instruction-tuning data. To ensure generation quality, Visual Instruction Correction (VIC) adopts an iterative update mechanism to correct any inaccuracies in data produced by VIG, effectively reducing the risk of hallucination. Leveraging the diverse, high-quality data generated by VIGC, we finetune mainstream models and validate data quality based on various evaluations. Experimental results demonstrate that VIGC not only compensates for the shortcomings of language-only data generation methods, but also effectively enhances the benchmark performance. The models, datasets, and code are available at https://opendatalab.github.io/VIGC

AAAI Conference 2024 Conference Paper

W2P: Switching from Weak Supervision to Partial Supervision for Semantic Segmentation

  • Fangyuan Zhang
  • Tianxiang Pan
  • Jun-Hai Yong
  • Bin Wang

Current weakly-supervised semantic segmentation (WSSS) techniques concentrate on enhancing class activation maps (CAMs) with image-level annotations. Yet, the emphasis on producing these pseudo-labels often overshadows the pivotal role of training the segmentation model itself. This paper underscores the significant influence of noisy pseudo-labels on segmentation network performance, particularly in boundary region. To address above issues, we introduce a novel paradigm: Weak to Partial Supervision (W2P). At its core, W2P categorizes the pseudo-labels from WSSS into two unique supervisions: trustworthy clean labels and uncertain noisy labels. Next, our proposed partially-supervised framework adeptly employs these clean labels to rectify the noisy ones, thereby promoting the continuous enhancement of the segmentation model. To further optimize boundary segmentation, we incorporate a noise detection mechanism that specifically preserves boundary regions while eliminating noise. During the noise refinement phase, we adopt a boundary-conscious noise correction technique to extract comprehensive boundaries from noisy areas. Furthermore, we devise a boundary generation approach that assists in predicting intricate boundary zones. Evaluations on the PASCAL VOC 2012 and MS COCO 2014 datasets confirm our method's impressive segmentation capabilities across various pseudo-labels.

AAAI Conference 2023 Conference Paper

BERT-ERC: Fine-Tuning BERT Is Enough for Emotion Recognition in Conversation

  • Xiangyu Qin
  • Zhiyu Wu
  • Tingting Zhang
  • Yanran Li
  • Jian Luan
  • Bin Wang
  • Li Wang
  • Jinshi Cui

Previous works on emotion recognition in conversation (ERC) follow a two-step paradigm, which can be summarized as first producing context-independent features via fine-tuning pretrained language models (PLMs) and then analyzing contextual information and dialogue structure information among the extracted features. However, we discover that this paradigm has several limitations. Accordingly, we propose a novel paradigm, i.e., exploring contextual information and dialogue structure information in the fine-tuning step, and adapting the PLM to the ERC task in terms of input text, classification structure, and training strategy. Furthermore, we develop our model BERT-ERC according to the proposed paradigm, which improves ERC performance in three aspects, namely suggestive text, fine-grained classification module, and two-stage training. Compared to existing methods, BERT-ERC achieves substantial improvement on four datasets, indicating its effectiveness and generalization capability. Besides, we also set up the limited resources scenario and the online prediction scenario to approximate real-world scenarios. Extensive experiments demonstrate that the proposed paradigm significantly outperforms the previous one and can be adapted to various scenes.

NeurIPS Conference 2023 Conference Paper

Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization

  • Jinxin Liu
  • Hongyin Zhang
  • Zifeng Zhuang
  • Yachen Kang
  • Donglin Wang
  • Bin Wang

In this work, we decouple the iterative bi-level offline RL (value estimation and policy extraction) from the offline training phase, forming a non-iterative bi-level paradigm and avoiding the iterative error propagation over two levels. Specifically, this non-iterative paradigm allows us to conduct inner-level optimization (value estimation) in training, while performing outer-level optimization (policy extraction) in testing. Naturally, such a paradigm raises three core questions that are not fully answered by prior non-iterative offline RL counterparts like reward-conditioned policy: (q1) What information should we transfer from the inner-level to the outer-level? (q2) What should we pay attention to when exploiting the transferred information for safe/confident outer-level optimization? (q3) What are the benefits of concurrently conducting outer-level optimization during testing? Motivated by model-based optimization (MBO), we propose DROP (design from policies), which fully answers the above questions. Specifically, in the inner-level, DROP decomposes offline data into multiple subsets, and learns an MBO score model (a1). To keep safe exploitation to the score model in the outer-level, we explicitly learn a behavior embedding and introduce a conservative regularization (a2). During testing, we show that DROP permits deployment adaptation, enabling an adaptive inference across states (a3). Empirically, we evaluate DROP on various tasks, showing that DROP gains comparable or better performance compared to prior methods.

AAAI Conference 2023 Conference Paper

Dialogue Rewriting via Skeleton-Guided Generation

  • Chunlei Xin
  • Hongyu Lin
  • Shan Wu
  • Xianpei Han
  • Bo Chen
  • Wen Dai
  • Shuai Chen
  • Bin Wang

Dialogue rewriting aims to transform multi-turn, context-dependent dialogues into well-formed, context-independent text for most NLP systems. Previous dialogue rewriting benchmarks and systems assume a fluent and informative utterance to rewrite. Unfortunately, dialogue utterances from real-world systems are frequently noisy and with various kinds of errors that can make them almost uninformative. In this paper, we first present Real-world Dialogue Rewriting Corpus (RealDia), a new benchmark to evaluate how well current dialogue rewriting systems can deal with real-world noisy and uninformative dialogue utterances. RealDia contains annotated multi-turn dialogues from real scenes with ASR errors, spelling errors, redundancies and other noises that are ignored by previous dialogue rewriting benchmarks. We show that previous dialogue rewriting approaches are neither effective nor data-efficient to resolve RealDia. Then this paper presents Skeleton-Guided Rewriter (SGR), which can resolve the task of dialogue rewriting via a skeleton-guided generation paradigm. Experiments show that RealDia is a much more challenging benchmark for real-world dialogue rewriting, and SGR can effectively resolve the task and outperform previous approaches by a large margin.

IJCAI Conference 2023 Conference Paper

Efficient Multi-View Inverse Rendering Using a Hybrid Differentiable Rendering Method

  • Xiangyang Zhu
  • Yiling Pan
  • Bailin Deng
  • Bin Wang

Recovering the shape and appearance of real-world objects from natural 2D images is a long-standing and challenging inverse rendering problem. In this paper, we introduce a novel hybrid differentiable rendering method to efficiently reconstruct the 3D geometry and reflectance of a scene from multi-view images captured by conventional hand-held cameras. Our method follows an analysis-by-synthesis approach and consists of two phases. In the initialization phase, we use traditional SfM and MVS methods to reconstruct a virtual scene roughly matching the real scene. Then in the optimization phase, we adopt a hybrid approach to refine the geometry and reflectance, where the geometry is first optimized using an approximate differentiable rendering method, and the reflectance is optimized afterward using a physically-based differentiable rendering method. Our hybrid approach combines the efficiency of approximate methods with the high-quality results of physically-based methods. Extensive experiments on synthetic and real data demonstrate that our method can produce reconstructions with similar or higher quality than state-of-the-art methods while being more efficient.

NeurIPS Conference 2023 Conference Paper

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

  • Yao Mu
  • Qinglong Zhang
  • Mengkang Hu
  • Wenhai Wang
  • Mingyu Ding
  • Jun Jin
  • Bin Wang
  • Jifeng Dai

Embodied AI is a crucial frontier in robotics, capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities. To achieve this, we have made the following efforts: (i) We craft a large-scale embodied planning dataset, termed EgoCOT. The dataset consists of carefully selected videos from the Ego4D dataset, along with corresponding high-quality language instructions. Specifically, we generate a sequence of sub-goals with the "Chain of Thoughts" mode for effective embodied planning. (ii) We introduce an efficient training approach to EmbodiedGPT for high-quality plan generation, by adapting a 7B large language model (LLM) to the EgoCOT dataset via prefix tuning. (iii) We introduce a paradigm for extracting task-related features from LLM-generated planning queries to form a closed loop between high-level planning and low-level control. Extensive experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering. Notably, EmbodiedGPT significantly enhances the success rate of the embodied control task by extracting more effective features. It has achieved a remarkable 1. 6 times increase in success rate on the Franka Kitchen benchmark and a 1. 3 times increase on the Meta-World benchmark, compared to the BLIP-2 baseline fine-tuned with the Ego4D dataset.

IJCAI Conference 2023 Conference Paper

Low-Confidence Samples Mining for Semi-supervised Object Detection

  • Guandu Liu
  • Fangyuan Zhang
  • Tianxiang Pan
  • Jun-Hai Yong
  • Bin Wang

Reliable pseudo labels from unlabeled data play a key role in semi-supervised object detection (SSOD). However, the state-of-the-art SSOD methods all rely on pseudo labels with high confidence, which ignore valuable pseudo labels with lower confidence. Additionally, the insufficient excavation for unlabeled data results in an excessively low recall rate thus hurting the network training. In this paper, we propose a novel Low-confidence Samples Mining (LSM) method to utilize low confidence pseudo labels efficiently. Specifically, we develop an additional pseudo information mining (PIM) branch on account of low-resolution feature maps to extract reliable large area instances, the IoUs of which are higher than small area ones. Owing to the complementary predictions between PIM and the main branch, we further design self-distillation (SD) to compensate for both in a mutually learning manner. Meanwhile, the extensibility of the above approaches enables our LSM to apply to Faster-RCNN and Deformable-DETR respectively. On the MS-COCO benchmark, our method achieves 3. 54% mAP improvement over state-of-the-art methods under 5% labeling ratios.

AAAI Conference 2023 Conference Paper

Poisoning with Cerberus: Stealthy and Colluded Backdoor Attack against Federated Learning

  • Xiaoting Lyu
  • Yufei Han
  • Wei Wang
  • Jingkai Liu
  • Bin Wang
  • Jiqiang Liu
  • Xiangliang Zhang

Are Federated Learning (FL) systems free from backdoor poisoning with the arsenal of various defense strategies deployed? This is an intriguing problem with significant practical implications regarding the utility of FL services. Despite the recent flourish of poisoning-resilient FL methods, our study shows that carefully tuning the collusion between malicious participants can minimize the trigger-induced bias of the poisoned local model from the poison-free one, which plays the key role in delivering stealthy backdoor attacks and circumventing a wide spectrum of state-of-the-art defense methods in FL. In our work, we instantiate the attack strategy by proposing a distributed backdoor attack method, namely Cerberus Poisoning (CerP). It jointly tunes the backdoor trigger and controls the poisoned model changes on each malicious participant to achieve a stealthy yet successful backdoor attack against a wide spectrum of defensive mechanisms of federated learning techniques. Our extensive study on 3 large-scale benchmark datasets and 13 mainstream defensive mechanisms confirms that Cerberus Poisoning raises a significantly severe threat to the integrity and security of federated learning practices, regardless of the flourish of robust Federated Learning methods.

NeurIPS Conference 2023 Conference Paper

Theoretically Guaranteed Bidirectional Data Rectification for Robust Sequential Recommendation

  • Yatong Sun
  • Bin Wang
  • Zhu Sun
  • Xiaochun Yang
  • Yan Wang

Sequential recommender systems (SRSs) are typically trained to predict the next item as the target given its preceding (and succeeding) items as the input. Such a paradigm assumes that every input-target pair is reliable for training. However, users can be induced to click on items that are inconsistent with their true preferences, resulting in unreliable instances, i. e. , mismatched input-target pairs. Current studies on mitigating this issue suffer from two limitations: (i) they discriminate instance reliability according to models trained with unreliable data, yet without theoretical guarantees that such a seemingly contradictory solution can be effective; and (ii) most methods can only tackle either unreliable input or targets but fail to handle both simultaneously. To fill the gap, we theoretically unveil the relationship between SRS predictions and instance reliability, whereby two error-bounded strategies are proposed to rectify unreliable targets and input, respectively. On this basis, we devise a model-agnostic Bidirectional Data Rectification (BirDRec) framework, which can be flexibly implemented with most existing SRSs for robust training against unreliable data. Additionally, a rectification sampling strategy is devised and a self-ensemble mechanism is adopted to reduce the (time and space) complexity of BirDRec. Extensive experiments on four real-world datasets verify the generality, effectiveness, and efficiency of our proposed BirDRec.

AAAI Conference 2022 System Paper

A Trend-Driven Fashion Design System for Rapid Response Marketing in E-commerce

  • Lianghua Huang
  • Yu Liu
  • Bin Wang
  • Pan Pan
  • Rong Jin

Fashion is the way we express ourselves and has grown into one of the largest industries in the world. Despite the significant evolvement of the fashion industry over the past decades, it is still a great challenge to respond to the diverse preferences of a large number of different consumers in time and accurately. To deal with the problem, we present an innovative demonstration of a trend-driven fashion design system using deep generative modeling, which enables automatic fashion design and editing based on trend reports. Our system consists of three components, including trend-driven fashion design, interactive fashion editing, and popularity estimation. The system offers a unified framework for mass-production of fashion designs that conform to the trend, which helps businesses better respond to market demands.

JBHI Journal 2022 Journal Article

Automatic Coronary Artery Segmentation of CCTA Images With an Efficient Feature-Fusion-and-Rectification 3D-UNet

  • Along Song
  • Lisheng Xu
  • Lu Wang
  • Bin Wang
  • Xiaofan Yang
  • Bu Xu
  • Benqiang Yang
  • Stephen E. Greenwald

Automatic coronary artery segmentation is of great value in diagnosing coronary disease. In this paper, we propose an automatic coronary artery segmentation method for coronary computerized tomography angiography (CCTA) images based on a deep convolutional neural network. The proposed method consists of three steps. First, to improve the efficiency and effectiveness of the segmentation, a 2D DenseNet classification network is utilized to screen out the non-coronary-artery slices. Second, we propose a coronary artery segmentation network based on the 3D-UNet, which is capable of extracting, fusing and rectifying features efficiently for accurate coronary artery segmentation. Specifically, in the encoding process of the 3D-UNet network, we adapt the dense block into the 3D-UNet so that it can extract rich and representative features for coronary artery segmentation; In the decoding process, 3D residual blocks with feature rectification capability are applied to improve the segmentation quality further. Third, we introduce a Gaussian weighting method to obtain the final segmentation results. This operation can highlight the more reliable segmentation results at the center of the 3D data blocks while weakening the less reliable segmentations at the block boundary when merging the segmentation results of spatially overlapping data blocks. Experiments demonstrate that our proposed method achieves a Dice Similarity Coefficient (DSC) value of 0. 826 on a CCTA dataset constructed by us. The code of the proposed method is available at https://github.com/alongsong/3D_CAS.

AAAI Conference 2022 Conference Paper

Bi-CMR: Bidirectional Reinforcement Guided Hashing for Effective Cross-Modal Retrieval

  • Tieying Li
  • Xiaochun Yang
  • Bin Wang
  • Chong Xi
  • Hanzhong Zheng
  • Xiangmin Zhou

Cross-modal hashing has attracted considerable attention for large-scale multimodal data. Recent supervised cross-modal hashing methods using multi-label networks utilize the semantics of multi-labels to enhance retrieval accuracy, where label hash codes are learned independently. However, all these methods assume that label annotations reliably reflect the relevance between their corresponding instances, which is not true in real applications. In this paper, we propose a novel framework called Bidirectional Reinforcement Guided Hashing for Effective Cross-Modal Retrieval (Bi-CMR), which exploits a bidirectional learning to relieve the negative impact of this assumption. Specifically, in the forward learning procedure, we highlight the representative labels and learn the reinforced multi-label hash codes by intra-modal semantic information, and further adjust similarity matrix. In the backward learning procedure, the reinforced multi-label hash codes and adjusted similarity matrix are used to guide the matching of instances. We construct two datasets with explicit relevance labels that reflect the semantic relevance of instance pairs based on two benchmark datasets. The Bi-CMR is evaluated by conducting extensive experiments over these two datasets. Experimental results prove the superiority of Bi-CMR over four state-of-the-art methods in terms of effectiveness.

NeurIPS Conference 2022 Conference Paper

DOMINO: Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning

  • Yao Mu
  • Yuzheng Zhuang
  • Fei Ni
  • Bin Wang
  • Jianyu Chen
  • Jianye Hao
  • Ping Luo

Adapting to the changes in transition dynamics is essential in robotic applications. By learning a conditional policy with a compact context, context-aware meta-reinforcement learning provides a flexible way to adjust behavior according to dynamics changes. However, in real-world applications, the agent may encounter complex dynamics changes. Multiple confounders can influence the transition dynamics, making it challenging to infer accurate context for decision-making. This paper addresses such a challenge by decomposed mutual information optimization (DOMINO) for context learning, which explicitly learns a disentangled context to maximize the mutual information between the context and historical trajectories while minimizing the state transition prediction error. Our theoretical analysis shows that DOMINO can overcome the underestimation of the mutual information caused by multi-confounded challenges via learning disentangled context and reduce the demand for the number of samples collected in various environments. Extensive experiments show that the context learned by DOMINO benefits both model-based and model-free reinforcement learning algorithms for dynamics generalization in terms of sample efficiency and performance in unseen environments.

AAAI Conference 2022 Conference Paper

Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition

  • Hao Liu
  • Bin Wang
  • Zhimin Bao
  • Mobai Xue
  • Sheng Kang
  • Deqiang Jiang
  • Yinsong Liu
  • Bo Ren

We introduce Perceiving Stroke-Semantic Context (PerSec), a new approach to self-supervised representation learning tailored for Scene Text Recognition (STR) task. Considering scene text images carry both visual and semantic properties, we equip our PerSec with dual context perceivers which can contrast and learn latent representations from low-level stroke and high-level semantic contextual spaces simultaneously via hierarchical contrastive learning on unlabeled text image data. Experiments in un- and semi-supervised learning settings on STR benchmarks demonstrate our proposed framework can yield a more robust representation for both CTC-based and attention-based decoders than other contrastive learning methods. To fully investigate the potential of our method, we also collect a dataset of 100 million unlabeled text images, named UTI-100M, covering 5 scenes and 4 languages. By leveraging hundred-million-level unlabeled data, our PerSec shows significant performance improvement when fine-tuning the learned representation on the labeled data. Furthermore, we observe that the representation learned by PerSec presents great generalization, especially under few labeled data scenes.

IJCAI Conference 2022 Conference Paper

Self-supervised Learning and Adaptation for Single Image Dehazing

  • Yudong Liang
  • Bin Wang
  • Wangmeng Zuo
  • Jiaying Liu
  • Wenqi Ren

Existing deep image dehazing methods usually depend on supervised learning with a large number of hazy-clean image pairs which are expensive or difficult to collect. Moreover, dehazing performance of the learned model may deteriorate significantly when the training hazy-clean image pairs are insufficient and are different from real hazy images in applications. In this paper, we show that exploiting large scale training set and adapting to real hazy images are two critical issues in learning effective deep dehazing models. Under the depth guidance estimated by a well-trained depth estimation network, we leverage the conventional atmospheric scattering model to generate massive hazy-clean image pairs for the self-supervised pre-training of dehazing network. Furthermore, self-supervised adaptation is presented to adapt pre-trained network to real hazy images. Learning without forgetting strategy is also deployed in self-supervised adaptation by combining self-supervision and model adaptation via contrastive learning. Experiments show that our proposed method performs favorably against the state-of-the-art methods, and is quite efficient, i. e. , handling a 4K image in 23 ms. The codes are available at https: //github. com/DongLiangSXU/SLAdehazing.

AAAI Conference 2022 Conference Paper

Semi-supervised Object Detection with Adaptive Class-Rebalancing Self-Training

  • Fangyuan Zhang
  • Tianxiang Pan
  • Bin Wang

While self-training achieves state-of-the-art results in semisupervised object detection (SSOD), it severely suffers from foreground-background and foreground-foreground imbalances in SSOD. In this paper, we propose an Adaptive Class- Rebalancing Self-Training (ACRST) with a novel memory module called CropBank to alleviate these imbalances and generate unbiased pseudo-labels. Besides, we observe that both self-training and data-rebalancing procedures suffer from noisy pseudo-labels in SSOD. Therefore, we contribute a simple yet effective two-stage pseudo-label filtering scheme to obtain accurate supervision. Our method achieves competitive performance on MS-COCO and VOC benchmarks. When using only 1% labeled data of MS-COCO, our method achieves 17. 02 mAP improvement over the supervised method and 5. 32 mAP gains compared with state-ofthe-arts.

IJCAI Conference 2021 Conference Paper

Does Every Data Instance Matter? Enhancing Sequential Recommendation by Eliminating Unreliable Data

  • Yatong Sun
  • Bin Wang
  • Zhu Sun
  • Xiaochun Yang

Most sequential recommender systems (SRSs) predict next-item as target for each user given its preceding items as input, assuming that each input is related to its target. However, users may unintentionally click on items that are inconsistent with their preference. We empirically verify that SRSs can be misguided with such unreliable instances (i. e. targets mismatch inputs). This inspires us to design a novel SRS By Eliminating unReliable Data (BERD) guided with two observations: (1) unreliable instances generally have high training loss; and (2) high-loss instances are not necessarily unreliable but uncertain ones caused by blurry sequential pattern. Accordingly, BERD models both loss and uncertainty of each instance via a Gaussian distribution to better distinguish unreliable instances; meanwhile an uncertainty-aware graph convolution network is exploited to assist in mining unreliable instances by lowering uncertainty. Extensive experiments on four real-world datasets demonstrate the superiority of our proposed BERD.

AAAI Conference 2021 Conference Paper

Improving Tree-Structured Decoder Training for Code Generation via Mutual Learning

  • Binbin Xie
  • Jinsong Su
  • Yubin Ge
  • Xiang Li
  • Jianwei Cui
  • Junfeng Yao
  • Bin Wang

Code generation aims to automatically generate a piece of code given an input natural language utterance. Currently, among dominant models, it is treated as a sequence-to-tree task, where a decoder outputs a sequence of actions corresponding to the pre-order traversal of an Abstract Syntax Tree. However, such a decoder only exploits the preorder traversal based preceding actions, which are insufficient to ensure correct action predictions. In this paper, we first throughly analyze the context modeling difference between neural code generation models with different traversals based decodings (preorder traversal vs breadth-first traversal), and then propose to introduce a mutual learning framework to jointly train these models. Under this framework, we continuously enhance both two models via mutual distillation, which involves synchronous executions of two one-to-one knowledge transfers at each training step. More specifically, we alternately choose one model as the student and the other as its teacher, and require the student to fit the training data and the action prediction distributions of its teacher. By doing so, both models can fully absorb the knowledge from each other and thus could be improved simultaneously. Experimental results and in-depth analysis on several benchmark datasets demonstrate the effectiveness of our approach. We release our code at https: //github. com/DeepLearnXMU/CGML.

NeurIPS Conference 2021 Conference Paper

Model-Based Reinforcement Learning via Imagination with Derived Memory

  • Yao Mu
  • Yuzheng Zhuang
  • Bin Wang
  • Guangxiang Zhu
  • Wulong Liu
  • Jianyu Chen
  • Ping Luo
  • Shengbo Li

Model-based reinforcement learning aims to improve the sample efficiency of policy learning by modeling the dynamics of the environment. Recently, the latent dynamics model is further developed to enable fast planning in a compact space. It summarizes the high-dimensional experiences of an agent, which mimics the memory function of humans. Learning policies via imagination with the latent model shows great potential for solving complex tasks. However, only considering memories from the true experiences in the process of imagination could limit its advantages. Inspired by the memory prosthesis proposed by neuroscientists, we present a novel model-based reinforcement learning framework called Imagining with Derived Memory (IDM). It enables the agent to learn policy from enriched diverse imagination with prediction-reliability weight, thus improving sample efficiency and policy robustness. Experiments on various high-dimensional visual control tasks in the DMControl benchmark demonstrate that IDM outperforms previous state-of-the-art methods in terms of policy robustness and further improves the sample efficiency of the model-based method.

AAAI Conference 2021 Conference Paper

Reasoning in Dialog: Improving Response Generation by Context Reading Comprehension

  • Xiuying Chen
  • Zhi Cui
  • Jiayi Zhang
  • Chen Wei
  • Jianwei Cui
  • Bin Wang
  • Dongyan Zhao
  • Rui Yan

In multi-turn dialog, utterances do not always take the full form of sentences (Carbonell 1983), which naturally makes understanding the dialog context more difficult. However, it is essential to fully grasp the dialog context to generate a reasonable response. Hence, in this paper, we propose to improve the response generation performance by examining the model’s ability to answer a reading comprehension question, where the question is focused on the omitted information in the dialog. Enlightened by the multi-task learning scheme, we propose a joint framework that unifies these two tasks, sharing the same encoder to extract the common and task-invariant features with different decoders to learn taskspecific features. To better fusing information from the question and the dialog history in the encoding part, we propose to augment the Transformer architecture with a memory updater, which is designed to selectively store and update the history dialog information so as to support downstream tasks. For the experiment, we employ human annotators to write and examine a large-scale dialog reading comprehension dataset. Extensive experiments are conducted on this dataset, and the results show that the proposed model brings substantial improvements over several strong baselines on both tasks. In this way, we demonstrate that reasoning can indeed help better response generation and vice versa. We release our large-scale dataset for further research1.

AAAI Conference 2021 Conference Paper

Train a One-Million-Way Instance Classifier for Unsupervised Visual Representation Learning

  • Yu Liu
  • Lianghua Huang
  • Pan Pan
  • Bin Wang
  • Yinghui Xu
  • Rong Jin

This paper presents a simple unsupervised visual representation learning method with a pretext task of discriminating all images in a dataset using a parametric, instance-level classifier. The overall framework is a replica of a supervised classification model, where semantic classes (e. g. , dog, bird, and ship) are replaced by instance IDs. However, scaling up the classification task from thousands of semantic labels to millions of instance labels brings specific challenges including 1) the large-scale softmax computation; 2) the slow convergence due to the infrequent visiting of instance samples; and 3) the massive number of negative classes that can be noisy. This work presents several novel techniques to handle these difficulties. First, we introduce a hybrid parallel training framework to make large-scale training feasible. Second, we present a raw-feature initialization mechanism for classification weights, which we assume offers a contrastive prior for instance discrimination and can clearly speed up converge in our experiments. Finally, we propose to smooth the labels of a few hardest classes to avoid optimizing over very similar negative pairs. While being conceptually simple, our framework achieves competitive or superior performance compared to state-of-the-art unsupervised approaches, i. e. , SimCLR, Mo- CoV2, and PIC under ImageNet linear evaluation protocol and on several downstream visual tasks, verifying that full instance classification is a strong pretraining technique for many semantic visual tasks.

IJCAI Conference 2020 Conference Paper

An Iterative Multi-Source Mutual Knowledge Transfer Framework for Machine Reading Comprehension

  • Xin Liu
  • Kai Liu
  • Xiang Li
  • Jinsong Su
  • Yubin Ge
  • Bin Wang
  • Jiebo Luo

The lack of sufficient training data in many domains, poses a major challenge to the construction of domain-specific machine reading comprehension (MRC) models with satisfying performance. In this paper, we propose a novel iterative multi-source mutual knowledge transfer framework for MRC. As an extension of the conventional knowledge transfer with one-to-one correspondence, our framework focuses on the many-to-many mutual transfer, which involves synchronous executions of multiple many-to-one transfers in an iterative manner. Specifically, to update a target-domain MRC model, we first consider other domain-specific MRC models as individual teachers, and employ knowledge distillation to train a multi-domain MRC model, which is differentially required to fit the training data and match the outputs of these individual models according to their domain-level similarities to the target domain. After being initialized by the multi-domain MRC model, the target-domain MRC model is fine-tuned to match both its training data and the output of its previous best model simultaneously via knowledge distillation. Compared with previous approaches, our framework can continuously enhance all domain-specific MRC models by enabling each model to iteratively and differentially absorb the domain-shared knowledge from others. Experimental results and in-depth analyses on several benchmark datasets demonstrate the effectiveness of our framework.

NeurIPS Conference 2020 Conference Paper

Graph Geometry Interaction Learning

  • Shichao Zhu
  • Shirui Pan
  • Chuan Zhou
  • Jia Wu
  • Yanan Cao
  • Bin Wang

While numerous approaches have been developed to embed graphs into either Euclidean or hyperbolic spaces, they do not fully utilize the information available in graphs, or lack the flexibility to model intrinsic complex graph geometry. To utilize the strength of both Euclidean and hyperbolic geometries, we develop a novel Geometry Interaction Learning (GIL) method for graphs, a well-suited and efficient alternative for learning abundant geometric properties in graph. GIL captures a more informative internal structural features with low dimensions while maintaining conformal invariance of each space. Furthermore, our method endows each node the freedom to determine the importance of each geometry space via a flexible dual feature interaction learning and probability assembling mechanism. Promising experimental results are presented for five benchmark datasets on node classification and link prediction tasks.

AAAI Conference 2020 Conference Paper

GSSNN: Graph Smoothing Splines Neural Networks

  • Shichao Zhu
  • Lewei Zhou
  • Shirui Pan
  • Chuan Zhou
  • Guiying Yan
  • Bin Wang

Graph Neural Networks (GNNs) have achieved state-of-theart performance in many graph data analysis tasks. However, they still suffer from two limitations for graph representation learning. First, they exploit non-smoothing node features which may result in suboptimal embedding and degenerated performance for graph classification. Second, they only exploit neighbor information but ignore global topological knowledge. Aiming to overcome these limitations simultaneously, in this paper, we propose a novel, flexible, and endto-end framework, Graph Smoothing Splines Neural Networks (GSSNN), for graph classification. By exploiting the smoothing splines, which are widely used to learn smoothing fitting function in regression, we develop an effective feature smoothing and enhancement module Scaled Smoothing Splines (S3 ) to learn graph embedding. To integrate global topological information, we design a novel scoring module, which exploits closeness, degree, as well as self-attention values, to select important node features as knots for smoothing splines. These knots can be potentially used for interpreting classification results. In extensive experiments on biological and social datasets, we demonstrate that our model achieves state-of-the-arts and GSSNN is superior in learning more robust graph representations. Furthermore, we show that S3 module is easily plugged into existing GNNs to improve their performance.

IJCAI Conference 2020 Conference Paper

Overcoming Language Priors with Self-supervised Learning for Visual Question Answering

  • Xi Zhu
  • Zhendong Mao
  • Chunxiao Liu
  • Peng Zhang
  • Bin Wang
  • Yongdong Zhang

Most Visual Question Answering (VQA) models suffer from the language prior problem, which is caused by inherent data biases. Specifically, VQA models tend to answer questions (e. g. , what color is the banana? ) based on the high-frequency answers (e. g. , yellow) ignoring image contents. Existing approaches tackle this problem by creating delicate models or introducing additional visual annotations to reduce question dependency and strengthen image dependency. However, they are still subject to the language prior problem since the data biases have not been fundamentally addressed. In this paper, we introduce a self-supervised learning framework to solve this problem. Concretely, we first automatically generate labeled data to balance the biased data, and then propose a self-supervised auxiliary task to utilize the balanced data to assist the VQA model to overcome language priors. Our method can compensate for the data biases by generating balanced data without introducing external annotations. Experimental results show that our method achieves state-of-the-art performance, improving the overall accuracy from 49. 50% to 57. 59% on the most commonly used benchmark VQA-CP v2. In other words, we can increase the performance of annotation-based methods by 16% without using external annotations. Our code is available on GitHub.

IJCAI Conference 2020 Conference Paper

Triple-GAIL: A Multi-Modal Imitation Learning Framework with Generative Adversarial Nets

  • Cong Fei
  • Bin Wang
  • Yuzheng Zhuang
  • Zongzhang Zhang
  • Jianye Hao
  • Hongbo Zhang
  • Xuewu Ji
  • Wulong Liu

Generative adversarial imitation learning (GAIL) has shown promising results by taking advantage of generative adversarial nets, especially in the field of robot learning. However, the requirement of isolated single modal demonstrations limits the scalability of the approach to real world scenarios such as autonomous vehicles' demand for a proper understanding of human drivers' behavior. In this paper, we propose a novel multi-modal GAIL framework, named Triple-GAIL, that is able to learn skill selection and imitation jointly from both expert demonstrations and continuously generated experiences with data augmentation purpose by introducing an auxiliary selector. We provide theoretical guarantees on the convergence to optima for both of the generator and the selector respectively. Experiments on real driver trajectories and real-time strategy game datasets demonstrate that Triple-GAIL can better fit multi-modal behaviors close to the demonstrators and outperforms state-of-the-art methods.

IJCAI Conference 2019 Conference Paper

Beyond Word Attention: Using Segment Attention in Neural Relation Extraction

  • Bowen Yu
  • Zhenyu Zhang
  • Tingwen Liu
  • Bin Wang
  • Sujian Li
  • Quangang Li

Relation extraction studies the issue of predicting semantic relations between pairs of entities in sentences. Attention mechanisms are often used in this task to alleviate the inner-sentence noise by performing soft selections of words independently. Based on the observation that information pertinent to relations is usually contained within segments (continuous words in a sentence), it is possible to make use of this phenomenon for better extraction. In this paper, we aim to incorporate such segment information into neural relation extractor. Our approach views the attention mechanism as linear-chain conditional random fields over a set of latent variables whose edges encode the desired structure, and regards attention weight as the marginal distribution of each word being selected as a part of the relational expression. Experimental results show that our method can attend to continuous relational expressions without explicit annotations, and achieve the state-of-the-art performance on the large-scale TACRED dataset.

IJCAI Conference 2019 Conference Paper

Boundary Perception Guidance: A Scribble-Supervised Semantic Segmentation Approach

  • Bin Wang
  • Guojun Qi
  • Sheng Tang
  • Tianzhu Zhang
  • Yunchao Wei
  • Linghui Li
  • Yongdong Zhang

Semantic segmentation suffers from the fact that densely annotated masks are expensive to obtain. To tackle this problem, we aim at learning to segment by only leveraging scribbles that are much easier to collect for supervision. To fully explore the limited pixel-level annotations from scribbles, we present a novel Boundary Perception Guidance (BPG) approach, which consists of two basic components, i. e. , prediction refinement and boundary regression. Specifically, the prediction refinement progressively makes a better segmentation by adopting an iterative upsampling and a semantic feature enhancement strategy. In the boundary regression, we employ class-agnostic edge maps for supervision to effectively guide the segmentation network in localizing the boundaries between different semantic regions, leading to producing finer-grained representation of feature maps for semantic segmentation. The experiment results on the PASCAL VOC 2012 demonstrate the proposed BPG achieves mIoU of 73. 2% without fully connected Conditional Random Field (CRF) and 76. 0% with CRF, setting up the new state-of-the-art in literature.

IJCAI Conference 2019 Conference Paper

Finding Justifications by Approximating Core for Large-scale Ontologies

  • Mengyu Gao
  • Yuxin Ye
  • Dantong Ouyang
  • Bin Wang

Finding justifications for an entailment is one of the major missions in the field of ontology research. Recent advances on finding justifications w. r. t. the light-weight description logics focused on encoding this problem into a propositional formula, and using SAT-based techniques to enumerate all MUSes (minimally unsatisfiable subformulas). It's necessary to import more optimized techniques into finding justifications as emergence of large-scale real-world ontologies. In this paper, we propose a new strategy which introduce local search(in short, LS) technique to compute the approximating core before extracting an exact MUS. Although it is based on a heuristic and LS, such technique is complete in the sense that it always delivers a MUS for any unsatisfiable SAT instance. Our method will find the justifications for large-scale ontologies more effectively.

IJCAI Conference 2019 Conference Paper

Low Shot Box Correction for Weakly Supervised Object Detection

  • Tianxiang Pan
  • Bin Wang
  • Guiguang Ding
  • Jungong Han
  • Junhai Yong

Weakly supervised object detection (WSOD) has been widely studied but the accuracy of state-of-art methods remains far lower than strongly supervised methods. One major reason for this huge gap is the incomplete box detection problem which arises because most previous WSOD models are structured on classification networks and therefore tend to recognize the most discriminative parts instead of complete bounding boxes. To solve this problem, we define a low-shot weakly supervised object detection task and propose a novel low-shot box correction network to address it. The proposed task enables to train object detectors on a large data set all of which have image-level annotations, but only a small portion or few shots have box annotations. Given the low-shot box annotations, we use a novel box correction network to transfer the incomplete boxes into complete ones. Extensive empirical evidence shows that our proposed method yields state-of-art detection accuracy under various settings on the PASCAL VOC benchmark.

IJCAI Conference 2019 Conference Paper

MAT-Net: Medial Axis Transform Network for 3D Object Recognition

  • Jianwei Hu
  • Bin Wang
  • Lihui Qian
  • Yiling Pan
  • Xiaohu Guo
  • Lingjie Liu
  • Wenping Wang

3D deep learning performance depends on object representation and local feature extraction. In this work, we present MAT-Net, a neural network which captures local and global features from the Medial Axis Transform (MAT). Different from K-Nearest-Neighbor method which extracts local features by a fixed number of neighbors, our MAT-Net exploits effective modules Group-MAT and Edge-Net to process topological structure. Experimental results illustrate that MAT-Net demonstrates competitive or better performance on 3D shape recognition than state-of-the-art methods, and prove that MAT representation has excellent capacity in 3D deep learning, even in the case of low resolution.

IJCAI Conference 2019 Conference Paper

Neural Collective Entity Linking Based on Recurrent Random Walk Network Learning

  • Mengge Xue
  • Weiming Cai
  • Jinsong Su
  • Linfeng Song
  • Yubin Ge
  • Yubao Liu
  • Bin Wang

Benefiting from the excellent ability of neural networks on learning semantic representations, existing studies for entity linking (EL) have resorted to neural networks to exploit both the local mention-to-entity compatibility and the global interdependence between different EL decisions for target entity disambiguation. However, most neural collective EL methods depend entirely upon neural networks to automatically model the semantic dependencies between different EL decisions, which lack of the guidance from external knowledge. In this paper, we propose a novel end-to-end neural network with recurrent random-walk layers for collective EL, which introduces external knowledge to model the semantic interdependence between different EL decisions. Specifically, we first establish a model based on local context features, and then stack random-walk layers to reinforce the evidence for related EL decisions into high-probability decisions, where the semantic interdependence between candidate entities is mainly induced from an external knowledge base. Finally, a semantic regularizer that preserves the collective EL decisions consistency is incorporated into the conventional objective function, so that the external knowledge base can be fully exploited in collective EL decisions. Experimental results and in-depth analysis on various datasets show that our model achieves better performance than other state-of-the-art models. Our code and data are released at https: //github. com/DeepLearnXMU/RRWEL.

AAAI Conference 2019 Conference Paper

The Kelly Growth Optimal Portfolio with Ensemble Learning

  • Weiwei Shen
  • Bin Wang
  • Jian Pu
  • Jun Wang

As a competitive alternative to the Markowitz mean-variance portfolio, the Kelly growth optimal portfolio has drawn sufficient attention in investment science. While the growth optimal portfolio is theoretically guaranteed to dominate any other portfolio with probability 1 in the long run, it practically tends to be highly risky in the short term. Moreover, empirical analysis and performance enhancement studies under practical settings are surprisingly short. In particular, how to handle the challenging but realistic condition with insufficient training data has barely been investigated. In order to fill voids, especially grappling with the difficulty from small samples, in this paper, we propose a growth optimal portfolio strategy equipped with ensemble learning. We synergically leverage the bootstrap aggregating algorithm and the random subspace method into portfolio construction to mitigate estimation error. We analyze the behavior and hyperparameter selection of the proposed strategy by simulation, and then corroborate its effectiveness by comparing its out-of-sample performance with those of 10 competing strategies on four datasets. Experimental results lucidly confirm that the new strategy has superiority in extensive evaluation criteria.

IJCAI Conference 2018 Conference Paper

An Adaptive Hierarchical Compositional Model for Phrase Embedding

  • Bing Li
  • Xiaochun Yang
  • Bin Wang
  • Wei Wang
  • Wei Cui
  • Xianchao Zhang

Phrase embedding aims at representing phrases in a vector space and it is important for the performance of many NLP tasks. Existing models only regard a phrase as either full-compositional or non-compositional, while ignoring the hybrid-compositionality that widely exists, especially in long phrases. This drawback prevents them from having a deeper insight into the semantic structure for long phrases and as a consequence, weakens the accuracy of the embeddings. In this paper, we present a novel method for jointly learning compositionality and phrase embedding by adaptively weighting different compositions using an implicit hierarchical structure. Our model has the ability of adaptively adjusting among different compositions without entailing too much model complexity and time cost. To the best of our knowledge, our work is the first effort that considers hybrid-compositionality in phrase embedding. The experimental evaluation demonstrates that our model outperforms state-of-the-art methods in both similarity tasks and analogy tasks.

IJCAI Conference 2018 Conference Paper

Implicit Non-linear Similarity Scoring for Recognizing Unseen Classes

  • Yuchen Guo
  • Guiguang Ding
  • Jungong Han
  • Sicheng Zhao
  • Bin Wang

Recognizing unseen classes is an important task for real-world applications, due to: 1) it is common that some classes in reality have no labeled image exemplar for training; and 2) novel classes emerge rapidly. Recently, to address this task many zero-shot learning (ZSL) approaches have been proposed where explicit linear scores, like inner product score, are employed to measure the similarity between a class and an image. We argue that explicit linear scoring (ELS) seems too weak to capture complicated image-class correspondence. We propose a simple yet effective framework, called Implicit Non-linear Similarity Scoring (ICINESS). In particular, we train a scoring network which uses image and class features as input, fuses them by hidden layers, and outputs the similarity. Based on the universal approximation theorem, it can approximate the true similarity function between images and classes if a proper structure is used in an implicit non-linear way, which is more flexible and powerful. With ICINESS framework, we implement ZSL algorithms by shallow and deep networks, which yield consistently superior results.

AAAI Conference 2018 Conference Paper

Knowledge Graph Embedding With Iterative Guidance From Soft Rules

  • Shu Guo
  • Quan Wang
  • Lihong Wang
  • Bin Wang
  • Li Guo

Embedding knowledge graphs (KGs) into continuous vector spaces is a focus of current research. Combining such an embedding model with logic rules has recently attracted increasing attention. Most previous attempts made a one-time injection of logic rules, ignoring the interactive nature between embedding learning and logical inference. And they focused only on hard rules, which always hold with no exception and usually require extensive manual effort to create or validate. In this paper, we propose Rule-Guided Embedding (RUGE), a novel paradigm of KG embedding with iterative guidance from soft rules. RUGE enables an embedding model to learn simultaneously from 1) labeled triples that have been directly observed in a given KG, 2) unlabeled triples whose labels are going to be predicted iteratively, and 3) soft rules with various confidence levels extracted automatically from the KG. In the learning process, RUGE iteratively queries rules to obtain soft labels for unlabeled triples, and integrates such newly labeled triples to update the embedding model. Through this iterative procedure, knowledge embodied in logic rules may be better transferred into the learned embeddings. We evaluate RUGE in link prediction on Freebase and YAGO. Experimental results show that: 1) with rule knowledge injected iteratively, RUGE achieves significant and consistent improvements over state-of-the-art baselines; and 2) despite their uncertainties, automatically extracted soft rules are highly bene- ficial to KG embedding, even those with moderate confidence levels. The code and data used for this paper can be obtained from https: //github. com/iieir-km/RUGE.

AAAI Conference 2018 Conference Paper

Splitting an LPMLN Program

  • Bin Wang
  • Zhizheng Zhang
  • Hongxiang Xu
  • Jun Shen

The technique called splitting sets has been proven useful in simplifying the investigation of Answer Set Programming (ASP). In this paper, we investigate the splitting set theorem for LPMLN that is a new extension of ASP created by combining the ideas of ASP and Markov Logic Networks (MLN). Firstly, we extend the notion of splitting sets to LPMLN programs and present the splitting set theorem for LPMLN. Then, the use of the theorem for simplifying several LPMLN inference tasks is illustrated. After that, we give two parallel approaches for solving LPMLN programs via using the theorem. The preliminary experimental results show that these approaches are alternative ways to promote an LPMLN solver.

IJCAI Conference 2018 Conference Paper

Where to Prune: Using LSTM to Guide End-to-end Pruning

  • Jing Zhong
  • Guiguang Ding
  • Yuchen Guo
  • Jungong Han
  • Bin Wang

Recent years have witnessed the great success of convolutional neural networks (CNNs) in many related fields. However, its huge model size and computation complexity bring in difficulty when deploying CNNs in some scenarios, like embedded system with low computation power. To address this issue, many works have been proposed to prune filters in CNNs to reduce computation. However, they mainly focus on seeking which filters are unimportant in a layer and then prune filters layer by layer or globally. In this paper, we argue that the pruning order is also very significant for model pruning. We propose a novel approach to figure out which layers should be pruned in each step. First, we utilize a long short-term memory (LSTM) to learn the hierarchical characteristics of a network and generate a pruning decision for each layer, which is the main difference from previous works. Next, a channel-based method is adopted to evaluate the importance of filters in a to-be-pruned layer, followed by an accelerated recovery step. Experimental results demonstrate that our approach is capable of reducing 70. 1% FLOPs for VGG and 47. 5% for Resnet-56 with comparable accuracy. Also, the learning results seem to reveal the sensitivity of each network layer.

AAAI Conference 2017 Conference Paper

Efficiently Mining High Quality Phrases from Texts

  • Bing Li
  • Xiaochun Yang
  • Bin Wang
  • Wei Cui

Phrase mining is a key research problem for semantic analysis and text-based information retrieval. The existing approaches based on NLP, frequency, and statistics cannot extract high quality phrases and the processing is also time consuming, which are not suitable for dynamic on-line applications. In this paper, we propose an efficient high-quality phrase mining approach (EQPM). To the best of our knowledge, our work is the first effort that considers both intra-cohesion and inter-isolation in mining phrases, which is able to guarantee appropriateness. We also propose a strategy to eliminate order sensitiveness, and ensure the completeness of phrases. We further design efficient algorithms to make the proposed model and strategy feasible. The empirical evaluations on four real data sets demonstrate that our approach achieved a considerable quality improvement and the processing time was 2. 3× ∼ 29× faster than the state-of-the-art works.

AAAI Conference 2017 Conference Paper

Fully Convolutional Neural Networks with Full-Scale-Features for Semantic Segmentation

  • Tianxiang Pan
  • Bin Wang
  • Guiguang Ding
  • Jun-Hai Yong

In this work, we propose a novel method to involve full-scale-features into the fully convolutional neural networks (FCNs) for Semantic Segmentation. Current works on FCN has brought great advances in the task of semantic segmentation, but the receptive field, which represents region areas of input volume connected to any output neuron, limits the available information of output neuron’s prediction accuracy. We investigate how to involve the full-scale or full-image features into FCNs to enrich the receptive field. Specially, the fullscale feature network (FFN) extends the full-connected network and makes an end-to-end unified training structure. It has two appealing properties. First, the introduction of full-scale-features is beneficial for prediction. We build a unified extracting network and explore several fusion functions for concatenating features. Amounts of experiments have been carried out to prove that full-scale-features makes fair accuracy raising. Second, FFN is applicable to many variants of FCN which could be regarded as a general strategy to improve the segmentation accuracy. Our proposed method is evaluated on PASCAL VOC 2012, and achieves a state-of-art result.

IJCAI Conference 2015 Conference Paper

Knowledge Base Completion Using Embeddings and Rules

  • Quan Wang
  • Bin Wang
  • Li Guo

Knowledge bases (KBs) are often greatly incomplete, necessitating a demand for KB completion. A promising approach is to embed KBs into latent spaces and make inferences by learning and operating on latent representations. Such embedding models, however, do not make use of any rules during inference and hence have limited accuracy. This paper proposes a novel approach which incorporates rules seamlessly into embedding models for KB completion. It formulates inference as an integer linear programming (ILP) problem, with the objective function generated from embedding models and the constraints translated from rules. Solving the ILP problem results in a number of facts which 1) are the most preferred by the embedding models, and 2) comply with all the rules. By incorporating rules, our approach can greatly reduce the solution space and significantly improve the inference accuracy of embedding models. We further provide a slacking technique to handle noise in KBs, by explicitly modeling the noise with slack variables. Experimental results on two publicly available data sets show that our approach significantly and consistently outperforms state-of-the-art embedding models in KB completion. Moreover, the slacking technique is effective in identifying erroneous facts and ambiguous entities, with a precision higher than 90%.

AAAI Conference 2014 Conference Paper

Sequential Click Prediction for Sponsored Search with Recurrent Neural Networks

  • Yuyu Zhang
  • Hanjun Dai
  • Chang Xu
  • Jun Feng
  • Taifeng Wang
  • Jiang Bian
  • Bin Wang
  • Tie-Yan Liu

Click prediction is one of the fundamental problems in sponsored search. Most of existing studies took advantage of machine learning approaches to predict ad click for each event of ad view independently. However, as observed in the real-world sponsored search system, user’s behaviors on ads yield high dependency on how the user behaved along with the past time, especially in terms of what queries she submitted, what ads she clicked or ignored, and how long she spent on the landing pages of clicked ads, etc. Inspired by these observations, we introduce a novel framework based on Recurrent Neural Networks (RNN). Compared to traditional methods, this framework directly models the dependency on user’s sequential behaviors into the click prediction process through the recurrent structure in RNN. Large scale evaluations on the click-through logs from a commercial search engine demonstrate that our approach can significantly improve the click prediction accuracy, compared to sequence-independent approaches.