Arrow Research search

Author name cluster

Shijie Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers
2 author rows

Possible papers

15

EAAI Journal 2026 Journal Article

CRPointFeatureNet: A cross-resolution point cloud feature network for part machining feature classification

  • Xiaohua Zhang
  • Yuchao Zhou
  • Jingze Wang
  • Shijie Wang
  • Zongxin Lian

Machining features inherently possess multi-scale characteristics, and accurately classifying them is critical for downstream tasks such as mechanical design, manufacturing, and reverse engineering. Point cloud deep learning networks, leveraging their computational efficiency and resource-sensitive architecture design, demonstrate significant technical potential in feature recognition. However, the complexity and computational overhead of current mainstream point cloud deep learning models hinder their deployment on edge devices. The practice of downsampling point cloud data to match training data resolution, when applied, not only increases computational costs but also can lead to performance degradation in cross-resolution tasks. This study proposes a lightweight, Cross-Resolution Point Cloud Feature Network (CRPointFeatureNet), which effectively extracts and integrates global and local features through three key modules: Point-wise Feature Enhancement, Scale-Aware Spatial Feature, and Attentive Fusion. The experiments were conducted on three datasets of varying resolutions, constructed based on the FeatureNetDataset. Each includes the same 24 classes of machining features. The results show that the network achieves comparable performance to other point cloud models on the validation set, with only about 14 % of PointNet's network parameters and minimal training time, particularly exhibiting optimal performance in cross-resolution classification testing, making it more suitable for multi-scale machining feature classification. Furthermore, our model achieved an accuracy of 93. 12 % when evaluated on the ModelNet10 dataset.

IJCAI Conference 2025 Conference Paper

An Association-based Fusion Method for Speech Enhancement

  • Shijie Wang
  • Qian Guo
  • Lu Chen
  • Liang Du
  • Zikun Jin
  • Zhian Yuan
  • Xinyan Liang

Deep learning-based speech enhancement (SE) methods predominantly draw upon two architectural frameworks: generative adversarial networks and diffusion models. In the realm of SE, capturing the local and global relations between signal frames is crucial for the success of these methods. These frameworks typically employ a UNet architecture as their foundational backbone, integrating Long Short-Term Memory (LSTM) networks or attention mechanisms within the UNet to effectively model both local and global signal relations. However, the coupled relation modeling way may not fully harness the potential of these relations. In this paper, we propose an innovative Association-based Fusion Speech Enhancement method (AFSE), a decoupled method. AFSE first constructs a graph that encapsulates the association between each time window of the speech signal, and then models the global relations between frames by fusing the features of these time windows in a manner akin to graph neural networks. Furthermore, AFSE leverages a UNet with dilated convolutions to model the local relations, enabling the network to maintain a high-resolution representation while benefiting from a wider receptive field. Experimental results demonstrate that the AFSE method significantly improves performance in speech enhancement tasks, validating the effectiveness and superiority of our approach. The code is available at https: //github. com/jie019/AFSE_IJCAI2025.

TIST Journal 2025 Journal Article

Graph Machine Learning in the Era of Large Language Models (LLMs)

  • Shijie Wang
  • Jiani Huang
  • Zhikai Chen
  • Yu Song
  • Wenzhuo Tang
  • Haitao Mao
  • Wenqi Fan
  • Hui Liu

Graphs play an important role in representing complex relationships in various domains like social networks, knowledge graphs, and molecular discovery. With the advent of deep learning, Graph Neural Networks (GNNs) have emerged as a cornerstone in Graph Machine Learning (Graph ML), facilitating the representation and processing of graphs. Recently, LLMs have demonstrated unprecedented capabilities in language tasks and are widely adopted in a variety of applications, such as computer vision and recommender systems. This remarkable success has also attracted interest in applying LLMs to the graph domain. Increasing efforts have been made to explore the potential of LLMs in advancing Graph ML’s generalization, transferability, and few-shot learning ability. Meanwhile, graphs, especially knowledge graphs, are rich in reliable factual knowledge, which can be utilized to enhance the reasoning capabilities of LLMs and potentially alleviate their limitations, such as hallucinations and the lack of explainability. Given the rapid progress of this research direction, a systematic review summarizing the latest advancements for Graph ML in the era of LLMs is necessary to provide an in-depth understanding to researchers and practitioners. Therefore, in this survey, we first review the recent developments in Graph ML. We then explore how LLMs can be utilized to enhance the quality of graph features, alleviate the reliance on labeled data, and address challenges such as graph Heterophily and Out-of-Distribution (OOD) generalization. Afterward, we delve into how graphs can enhance LLMs, highlighting their abilities to enhance LLM pre-training and inference. Furthermore, we investigate various applications and discuss the potential future directions in this promising field.

IJCAI Conference 2025 Conference Paper

Tree-of-AdEditor: Heuristic Tree Reasoning for Automated Video Advertisement Editing with Large Language Model

  • Yuqi Zhang
  • Bin Guo
  • Nuo Li
  • Ying Zhang
  • Shijie Wang
  • Zhiwen Yu
  • Qing Li

Video advertising has become a popular marketing strategy on e-commerce platforms, requiring high-level semantic reasoning like selling point discovery, narrative organization. Previous rule-based methods struggle with these complex tasks, and learning-based approaches demand large datasets and high training costs. Recently, Large Language Models have opened incredible opportunities for advancing intelligent video advertisement editing. However, Input-output (IO) prompting and Chain-of-Thought (CoT) struggle to adapt to the nonlinear thinking hierarchy of video editing, where editors iteratively select shots or revert them to explore potential editing solutions. While Tree-of-Thought (ToT) offers a conceptual structure that mirrors this hierarchy, it falls short in aligning with effective video advertising strategies and lacks robust fact-checking mechanisms. To address these, we propose a novel framework, Tree-of-AdEditor (ToAE), which constructs a reasoning tree to mimic human editors, and incorporates domain-specific theories and heuristic fact-checking to identify optimal editing solutions. Specifically, motivated by effective advertisement principles, we develop a "local-global" mechanism to guide LLM in both the shot level and sequence level decision-making. We introduce a visual incoherence pruning module to provide external heuristic fact-checking, ensuring visual attractiveness and reducing computation costs. Quantitative experiments and expert evaluation demonstrate the superiority of our method compared to baselines.

ICML Conference 2025 Conference Paper

Trusted Multi-View Classification with Expert Knowledge Constraints

  • Xinyan Liang
  • Shijie Wang
  • Yuhua Qian
  • Qian Guo 0005
  • Liang Du 0003
  • Bingbing Jiang 0001
  • Tingjin Luo
  • Feijiang Li

Multi-view classification (MVC) based on the Dempster-Shafer theory has gained significant recognition for its reliability in safety-critical applications. However, existing methods predominantly focus on providing confidence levels for decision outcomes without explaining the reasoning behind these decisions. Moreover, the reliance on first-order statistical magnitudes of belief masses often inadequately capture the intrinsic uncertainty within the evidence. To address these limitations, we propose a novel framework termed Trusted Multi-view Classification Constrained with Expert Knowledge (TMCEK). TMCEK integrates expert knowledge to enhance feature-level interpretability and introduces a distribution-aware subjective opinion mechanism to derive more reliable and realistic confidence estimates. The theoretical superiority of the proposed uncertainty measure over conventional approaches is rigorously established. Extensive experiments conducted on three multi-view datasets for sleep stage classification demonstrate that TMCEK achieves state-of-the-art performance while offering interpretability at both the feature and decision levels. These results position TMCEK as a robust and interpretable solution for MVC in safety-critical domains. The code is available at https: //github. com/jie019/TMCEK_ICML2025.

AAAI Conference 2025 Conference Paper

Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning

  • Hao Ma
  • Shijie Wang
  • Zhiqiang Pu
  • Siyao Zhao
  • Xiaolin Ai

Guiding the policy of multi-agent reinforcement learning to align with human common sense is a difficult problem, largely due to the complexity of modeling common sense as a reward, especially in complex and long-horizon multi-agent tasks. Recent works have shown the effectiveness of reward shaping, such as potential-based rewards, to enhance policy alignment. The existing works, however, primarily rely on experts to design rule-based rewards, which are often labor-intensive and lack a high-level semantic understanding of common sense. To solve this problem, we propose a hierarchical vision-based reward shaping method. At the bottom layer, a visual-language model (VLM) serves as a generic potential function, guiding the policy to align with human common sense through its intrinsic semantic understanding. To help the policy adapts to uncertainty and changes in long-horizon tasks, the top layer features an adaptive skill selection module based on a visual large language model (vLLM). The module uses instructions, video replays, and training records to dynamically select suitable potential function from a pre-designed pool. Besides, our method is theoretically proven to preserve the optimal policy. Extensive experiments conducted in the Google Research Football environment demonstrate that our method not only achieves a higher win rate but also effectively aligns the policy with human common sense.

EAAI Journal 2024 Journal Article

A novel bi-stream network for image dehazing

  • Qiaoyu Ma
  • Shijie Wang
  • Guowei Yang
  • Chenglizhao Chen
  • Teng Yu

The existing learning-based image dehazing methods usually adopt the encoder–decoder architecture with convolutional neural networks to estimate latent haze-free images from hazy images. However, the limited receptive field of convolutional neural networks may not effectively capture structure-level information, causing the model to be unable to the haze density. To solve this problem, this paper proposes a bi-decoder structure with a dense non-pooling encoder to enhance the structural features that are closely related to the haze density. Compared with conventional methods, the main advantage of our method is the integration of an additional coarse decoder in the encoder–decoder architecture, where a hybrid feature convolution (HFC) block is utilized to enlarge the receptive field to extract the structure of the image. Besides the difference in the network, the inputs of the fine and coarse decoders are non-pooling and pooling respectively. Moreover, a multi-scale feature attention (MSFA) module is proposed to selectively enhance the haze-relevant feature outputs of fine and coarse decoders. Experimental results on synthetic and real-world datasets demonstrate that the proposed method outperforms most state-of-the-art methods in terms of image quality and quantitative metrics. Especially in the NH-HAZE dataset, its PSNR exceeds other methods by more than 2. 13 dB. In the end, this paper applies this dehazing technology to object detection. The code of this paper and data are available online at https: //github. com/Qiaoyu-K/Bi-Decoder-Dehazing.

ICLR Conference 2024 Conference Paper

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

  • Qi Zhao
  • Shijie Wang
  • Ce Zhang 0010
  • Changcheng Fu
  • Minh Quan Do
  • Nakul Agarwal
  • Kwonjoon Lee
  • Chen Sun 0002

Can we better anticipate an actor’s future actions (e.g. mix eggs) by knowing what commonly happens after the current action (e.g. crack eggs)? What if the actor also shares the goal (e.g. make fried rice) with us? The long-term action anticipation (LTA) task aims to predict an actor’s future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. We propose AntGPT, which represents video observations as sequences of human actions, and uses the action representation for an LLM to infer the goals and model temporal dynamics. AntGPT achieves state- of-the-art performance on Ego4D LTA v1 and v2, EPIC-Kitchens-55, as well as EGTEA GAZE+, thanks to LLMs’ goal inference and temporal dynamics modeling capabilities. We further demonstrate that these capabilities can be effectively distilled into a compact neural network 1.3% of the original LLM model size. Code and model will be released upon acceptance.

AAAI Conference 2023 Conference Paper

Fine-Grained Retrieval Prompt Tuning

  • Shijie Wang
  • Jianlong Chang
  • Zhihui Wang
  • Haojie Li
  • Wanli Ouyang
  • Qi Tian

Fine-grained object retrieval aims to learn discriminative representation to retrieve visually similar objects. However, existing top-performing works usually impose pairwise similarities on the semantic embedding spaces or design a localization sub-network to continually fine-tune the entire model in limited data scenarios, thus resulting in convergence to suboptimal solutions. In this paper, we develop Fine-grained Retrieval Prompt Tuning (FRPT), which steers a frozen pre-trained model to perform the fine-grained retrieval task from the perspectives of sample prompting and feature adaptation. Specifically, FRPT only needs to learn fewer parameters in the prompt and adaptation instead of fine-tuning the entire model, thus solving the issue of convergence to suboptimal solutions caused by fine-tuning the entire model. Technically, a discriminative perturbation prompt (DPP) is introduced and deemed as a sample prompting process, which amplifies and even exaggerates some discriminative elements contributing to category prediction via a content-aware inhomogeneous sampling operation. In this way, DPP can make the fine-grained retrieval task aided by the perturbation prompts close to the solved task during the original pre-training. Thereby, it preserves the generalization and discrimination of representation extracted from input samples. Besides, a category-specific awareness head is proposed and regarded as feature adaptation, which removes the species discrepancies in features extracted by the pre-trained model using category-guided instance normalization. And thus, it makes the optimized features only include the discrepancies among subcategories. Extensive experiments demonstrate that our FRPT with fewer learnable parameters achieves the state-of-the-art performance on three widely-used fine-grained datasets.

NeurIPS Conference 2023 Conference Paper

Goal-Conditioned Predictive Coding for Offline Reinforcement Learning

  • Zilai Zeng
  • Ce Zhang
  • Shijie Wang
  • Chen Sun

Recent work has demonstrated the effectiveness of formulating decision making as supervised learning on offline-collected trajectories. Powerful sequence models, such as GPT or BERT, are often employed to encode the trajectories. However, the benefits of performing sequence modeling on trajectory data remain unclear. In this work, we investigate whether sequence modeling has the ability to condense trajectories into useful representations that enhance policy learning. We adopt a two-stage framework that first leverages sequence models to encode trajectory-level representations, and then learns a goal-conditioned policy employing the encoded representations as its input. This formulation allows us to consider many existing supervised offline RL methods as specific instances of our framework. Within this framework, we introduce Goal-Conditioned Predictive Coding (GCPC), a sequence modeling objective that yields powerful trajectory representations and leads to performant policies. Through extensive empirical evaluations on AntMaze, FrankaKitchen and Locomotion environments, we observe that sequence modeling can have a significant impact on challenging decision making tasks. Furthermore, we demonstrate that GCPC learns a goal-conditioned latent representation encoding the future trajectory, which enables competitive performance on all three benchmarks.

NeurIPS Conference 2023 Conference Paper

Learning to Parameterize Visual Attributes for Open-set Fine-grained Retrieval

  • Shijie Wang
  • Jianlong Chang
  • Haojie Li
  • Zhihui Wang
  • Wanli Ouyang
  • Qi Tian

Open-set fine-grained retrieval is an emerging challenging task that allows to retrieve unknown categories beyond the training set. The best solution for handling unknown categories is to represent them using a set of visual attributes learnt from known categories, as widely used in zero-shot learning. Though important, attribute modeling usually requires significant manual annotations and thus is labor-intensive. Therefore, it is worth to investigate how to transform retrieval models trained by image-level supervision from category semantic extraction to attribute modeling. To this end, we propose a novel Visual Attribute Parameterization Network (VAPNet) to learn visual attributes from known categories and parameterize them into the retrieval model, without the involvement of any attribute annotations. In this way, VAPNet could utilize its parameters to parse a set of visual attributes from unknown categories and precisely represent them. Technically, VAPNet explicitly attains some semantics with rich details via making use of local image patches and distills the visual attributes from these discovered semantics. Additionally, it integrates the online refinement of these visual attributes into the training process to iteratively enhance their quality. Simultaneously, VAPNet treats these attributes as supervisory signals to tune the retrieval models, thereby achieving attribute parameterization. Extensive experiments on open-set fine-grained retrieval datasets validate the superior performance of our VAPNet over existing solutions.

AAAI Conference 2022 Conference Paper

Category-Specific Nuance Exploration Network for Fine-Grained Object Retrieval

  • Shijie Wang
  • Zhihui Wang
  • Haojie Li
  • Wanli Ouyang

Employing additional prior knowledge to model local features as a final fine-grained object representation has become a trend for fine-grained object retrieval (FGOR). A potential limitation of these methods is that they only focus on common parts across the dataset (e. g. , head, body, or even leg) by introducing additional prior knowledge, but the retrieval of a fine-grained object may rely on category-specific nuances that contribute to category prediction. To handle this limitation, we propose an end-to-end Category-specific Nuance Exploration Network (CNENet) that elaborately discovers category-specific nuances that contribute to category prediction, and semantically aligns these nuances grouped by subcategory without any additional prior knowledge, to directly emphasize the discrepancy among subcategories. Specifically, we design a Nuance Modelling Module that adaptively predicts a group of category-specific response (CARE) maps via implicitly digging into category-specific nuances, specifying the locations and scales for category-specific nuances. Upon this, two nuance regularizations are proposed: 1) semantic discrete loss that forces each CARE map to attend to different spatial regions to capture diverse nuances; 2) semantic alignment loss that constructs a consistent semantic correspondence for each CARE map of the same order with the same subcategory via guaranteeing each instance and its transformed counterpart to be spatially aligned. Moreover, we propose a Nuance Expansion Module, which exploits context appearance information of discovered nuances and refines the prediction of current nuance by its similar neighbors, leading to further improvement on nuance consistency and completeness. Extensive experiments validate that our CNENet consistently yields the best performance under the same settings against most competitive approaches on CUB Birds, Stanford Cars, and FGVC Aircraft datasets.

AAAI Conference 2021 Conference Paper

Dynamic Position-aware Network for Fine-grained Image Recognition

  • Shijie Wang
  • Haojie Li
  • Zhihui Wang
  • Wanli Ouyang

Most weakly supervised fine-grained image recognition (WF- GIR) approaches predominantly focus on learning the discriminative details which contain the visual variances and position clues. The position clues can be indirectly learnt by utilizing context information of discriminative visual content. However, this will cause the selected discriminative regions containing some non-discriminative information introduced by the position clues. These analysis motivates us to directly introduce position clues into visual content to only focus on the visual variances, achieving more precise discriminative region localization. Though important, position modelling usually requires significant pixel/region annotations and therefore is labor-intensive. To address this issue, we propose an end-to-end Dynamic Position-aware Network (DP-Net) to directly incorporate the position clues into visual content and dynamically align them without extra annotations, which eliminates the effect of position information for discriminative variances among subcategories. In particular, the DP-Net consists of: 1) Position Encoding Module, which learns a set of position-aware parts by directly adding the learnable position information into the horizontal/vertical visual content of images; 2) Position-vision Aligning Module, which dynamically aligns both visual content and learnable position information via performing graph convolution on position-aware parts; 3) Position-vision Reorganization Module, which projects the aligned position clues and visual content into the Euclidean space to construct a position-aware feature maps. Finally, the position-aware feature maps are used which is implicitly applied the aligned visual content and position clues for more accurate discriminative regions localization. Extensive experiments verify that DP-Net yields the best performance under the same settings with most competitive approaches, on CUB Bird, Stanford-Cars, and FGVC Aircraft datasets.

AAAI Conference 2020 Conference Paper

Graph-Propagation Based Correlation Learning for Weakly Supervised Fine-Grained Image Classification

  • Zhuhui Wang
  • Shijie Wang
  • Haojie Li
  • Zhi Dou
  • Jianjun Li

The key of Weakly Supervised Fine-grained Image Classi- fication (WFGIC) is how to pick out the discriminative regions and learn the discriminative features from them. However, most recent WFGIC methods pick out the discriminative regions independently and utilize their features directly, while neglecting the facts that regions’ features are mutually semantic correlated and region groups can be more discriminative. To address these issues, we propose an end-to-end Graph-propagation based Correlation Learning (GCL) model to fully mine and exploit the discriminative potentials of region correlations for WFGIC. Specifically, in discriminative region localization phase, a Criss-cross Graph Propagation (CGP) sub-network is proposed to learn region correlations, which establishes correlation between regions and then enhances each region by weighted aggregating other regions in a criss-cross way. By this means each region’s representation encodes the global image-level context and local spatial context simultaneously, thus the network is guided to implicitly discover the more powerful discriminative region groups for WFGIC. In discriminative feature representation phase, the Correlation Feature Strengthening (CFS) sub-network is proposed to explore the internal semantic correlation among discriminative patches’ feature vectors, to improve their discriminative power by iteratively enhancing informative elements while suppressing the useless ones. Extensive experiments demonstrate the effectiveness of proposed CGP and CFS sub-networks, and show that the GCL model achieves better performance both in accuracy and efficiency.

IROS Conference 2006 Conference Paper

Research on the Hierarchical Supervisory Control of Underwater Glider

  • Yu Zhang
  • Jiaping Tian
  • Donghai Su
  • Shijie Wang

An underwater glider is a buoyancy-propelled and fixed-wing vehicle with attitude controlled completely by means of internal mass redistribution. In order to independently accomplish complex missions in unstructured and unknown oceanic environment, intelligent control system is needed to provide the underwater glider with the ability of active autonomy. Based on the RW (Ramadge & Wonham) supervisory control theory of discrete event dynamic system (DEDS), a three-level hierarchical supervisory control architecture for underwater glider is presented. The DEDS formalism models of underwater glider in terms of finite state automata (FSA) are built, and the realization of the hierarchical supervisory control (HSC) system is brought forth in detail. The simulation result shows that the three-level supervisory control system can adapt to uncertain undersea environment, and makes reasonable planning