Arrow Research search

Author name cluster

Zhen Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

42 papers
2 author rows

Possible papers

42

JBHI Journal 2026 Journal Article

A Multi-Step Prediction Method Based on Small Sample Data Augmentation to Assess Wheat Flour Safety Risk

  • Wanbao Sheng
  • Huawei Jiang
  • Wenqiang Pi
  • Zhen Yang
  • Like Zhao

Food safety significantly impacts the human health, establishing an effective prediction method to assess food safety risk is crucial for food safety control. At present, in the situation of insufficient detection data, it is necessary to employ data augmentation methods to generate large-scale detection data and accurately capture the long-term variation patterns through safety risk prediction models. However, existing data augmentation methods and safety prediction models face challenges such as gradient vanishing and difficulty in capturing long-term dependencies. Therefore, this paper proposes a S mall sample D ata A ugmentation M ulti-step P rediction M ethod (SDAMPM) to assess wheat flour safety risks. Firstly, we improved time-series generative adversarial networks based on external temporal convolution and Wasserstein distance to expand wheat flour hazard factor detection data. Secondly, we employed expanded data to establish dietary exposure evaluation system for wheat flour, serving as the desired output for multi-step prediction models. Finally, we constructed a stable Informer (Stainformer) multi-step prediction model by designing symmetric ProbSparse self-attention and distilling layer based on dilated causal convolution. Experiments on the wheat flour dietary exposure evaluation system demonstrate that compared to other methods, the expanded data is similar to the distribution of the original data. This approach effectively predicts long-term safety risks associated with wheat flour consumption and can provide assistance and technical support in decision-making for relevant departments, thereby reducing the occurrence of food safety incidents.

AAAI Conference 2026 Conference Paper

CCAHCL: Multi-Level Hypergraph Contrastive Learning for Connected Component Awareness

  • Zhuo Li
  • Gengyu Lyu
  • Yuena Lin
  • Ziang Chen
  • Zhiyuan Ma
  • Zhen Yang
  • Zun Li

Hypergraph contrastive learning has emerged as a powerful unsupervised paradigm for hypergraph representation learning. Traditional hypergraph contrastive learning methods typically leverage neighbor aggregation strategy to obtain entity (node and hyperedge) representations within each connected component, and then utilize contrastive losses (e.g., node- or hyperedge-level) to update the encoders. However, since entities are usually focused equally on their respective losses, large connected components with numerous entities tend to provide a dominant contribution to the whole learning process, which inevitably hinders the effective learning of entity representations within small connected components. To address this issue, we propose a novel Connected-Component-Aware Hypergraph Contrastive Learning method (CCAHCL). Different from previous methods that only construct node or hyperedge representations, our method additionally constructs the connected component representations, and accordingly designs a hierarchical contrastive loss to balance the model's focus on different scales of connected components. Specifically, we first use the traditional neighbor aggregation strategy to aggregate and update entity (node and hyperedge) representations. Then, these entity representations are further aggregated to generate the connected component representations, where entity features are incorporated into connected components and their structural information is propagated back to enrich their corresponding entities. Afterwards, we employ node-level and hyperedge-level losses to learn the enriched entity representations, and further propose a novel connected-component-level contrastive loss to balance the model's focus on all different connected components, naturally avoiding the learning bias on large connected components. Extensive experiments on various datasets demonstrate that our proposed model achieves superior performance against other state-of-the-art methods.

AAAI Conference 2026 Conference Paper

Hypergraph-Based Multi-View Multi-Label Classification via Adaptive High-Order Semantic Fusion

  • Yi Shan
  • Liyang Gao
  • Yuena Lin
  • Zhen Yang
  • Gengyu Lyu
  • Honggui Han

In multi-view multi-label (MVML) classification, each sample is represented by multiple heterogeneous views and annotated with multiple labels. Existing methods typically exploit pairwise semantic relationships to mine intra-view correlations and align inter-view features for generating structural representations. However, these methods ignore the direct expression of high-order semantic similarities and alignments from a group perspective, which necessitates multi-step aggregation for subsequent feature fusion, leading to the inefficient and incomplete integration of key semantic information. To overcome this limitation, we propose a novel hypergraph-based MVML method with Adaptive High-Order Semantic Fusion (HyperAHSF), which leverages hypergraphs to adaptively model group-level semantic similarities within each view and group-level semantic alignments across different views, enabling more effective feature fusion. Specifically, we first construct view-specific hyperedges by selecting multiple groups of node representations exhibiting high semantic similarity, which captures the group-level semantic similarities within each view, forming view-specific hypergraphs. Furthermore, we establish cross-view hyperedges to connect the multi-view node representations of each sample, which characterizes the group-level semantic alignments across different views, accordingly forming a unified multi-view hypergraph. Afterwards, we employ hypergraph neural networks to efficiently aggregate view-specific information and consensus information from their corresponding hypergraphs via group-level message passing. During the passing process, we impose a label-driven contrastive loss on the consensus information to encourage these representations to cluster toward their corresponding class prototypes, enhancing their discriminability. Finally, the consensus information together with the view-specific information is jointly integrated for multi-label classification. Extensive experiments demonstrate that HyperAHSF outperforms other state-of-the-art methods.

AAAI Conference 2026 Conference Paper

MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning

  • Jinhao Chen
  • Zhen Yang
  • Jianxin Shi
  • Tianyu Wo
  • Jie Tang

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language answering tasks. Despite their strengths, these models often encounter challenges in achieving complex reasoning tasks such as mathematical problem-solving. Previous works have focused on fine-tuning on specialized mathematical datasets. However, these datasets are typically distilled directly from teacher models, which capture only static reasoning patterns and leaving substantial gaps compared to student models. This reliance on fixed teacher-derived datasets not only restricts the model's ability to adapt to novel or more intricate questions that extend beyond the confines of the training data, but also lacks the iterative depth needed for robust generalization. To overcome these limitations, we propose MathSE, a Mathematical Self-Evolving framework for MLLMs. In contrast to traditional one-shot fine-tuning paradigms, MathSE iteratively refines the model through cycles of inference, reflection, and reward-based feedback. Specifically, we leverage iterative fine-tuning by incorporating correct reasoning paths derived from previous-stage inference and integrating reflections from a specialized Outcome Reward Model (ORM). To verify the effectiveness of MathSE, we evaluate it on a suite of challenging benchmarks, demonstrating significant performance gains over backbone models. Notably, our experimental results on MathVL-test surpass the leading open-source multimodal mathematical reasoning model QVQ.

AAAI Conference 2026 Conference Paper

TransMamba: A Sequence-Level Hybrid Transformer-Mamba Language Model

  • Yixing Li
  • Ruobing Xie
  • Zhen Yang
  • Xingwu Sun
  • Shuaipeng Li
  • Weidong Han
  • Zhanhui Kang
  • Di Wang

Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity, offer promising efficiency gains but suffer from unstable contextual learning and multitask generalization. Some works conduct layer-level hybrid structures that combine Transformer and Mamba layers, aiming to make full use of both advantages. This paper proposes TransMamba, a novel sequence-level hybrid framework that unifies Transformer and Mamba through shared parameter matrices (QKV and CBx), and thus could dynamically switch between attention and SSM mechanisms at different token lengths and layers. We design the Memory Converter to bridge Transformer and Mamba by converting attention outputs into SSM-compatible states, ensuring seamless information flow at TransPoints where the transformation happens. The TransPoint scheduling is also thoroughly explored for balancing effectiveness and efficiency. We conducted extensive experiments demonstrating that TransMamba achieves superior training efficiency and performance compared to single and hybrid baselines, and validated the deeper consistency between Transformer and Mamba paradigms at sequence level, offering a scalable solution for next-generation language modeling.

NeurIPS Conference 2025 Conference Paper

AF-UMC: An Alignment-Free Fusion Framework for Unaligned Multi-View Clustering

  • Bohang Sun
  • Yuena Lin
  • Tao Yang
  • Zhen Zhu
  • Zhen Yang
  • Gengyu Lyu

The Unaligned Multi-view Clustering (UMC) aims to learn a discriminative cluster structure from unaligned multi-view data, where the features of samples are not completely aligned across multiple views. Most existing methods usually prioritize employing various alignment strategies to align sample representations across views and then conduct cross-view fusion on aligned representations for subsequent clustering. However, due to the heterogeneity of representations across different views, these alignment strategies often fail to achieve ideal view-alignment results, inevitably leading to unreliable alignment-based fusion. To address this issue, we propose an alignment-free consistency fusion framework named AF-UMC, which bypasses the traditional view-alignment operation and directly extracts consistent representations from each view to perform global cross-view consistency fusion. Specifically, we first construct a cross-view consistent basis space by a cross-view reconstruction loss and a designed Structural Clarity Regularization (SCR), where autoencoders extract consistent representations from each view through projecting view-specific data to the constructed basis space. Afterwards, these extracted representations are globally pulled together for further cross-view fusion according to a designed Instance Global Contrastive Fusion (IGCF). Compared with previous methods, AF-UMC directly extracts consistent representations from each view for global fusion instead of alignment for fusion, which significantly mitigates the degraded fusion performance caused by undesired view-alignment results while greatly reducing algorithm complexity and enhancing its efficiency. Extensive experiments on various datasets demonstrate that our AF-UMC exhibits superior performance against other state-of-the-art methods.

ECAI Conference 2025 Conference Paper

Attribute Guidance with Inherent Pseudo-Label for Occluded Person Re-Identification

  • Rui Zhi
  • Zhen Yang
  • Haiyang Zhang

Person re-identification (Re-ID) aims to match person images across different camera views, with occluded Re-ID addressing scenarios where pedestrians are partially visible. While pretrained vision-language models have shown effectiveness in Re-ID tasks, they face significant challenges in occluded scenarios by focusing on holistic image semantics while neglecting fine-grained attribute information. This limitation becomes particularly evident when dealing with partially occluded pedestrians or when distinguishing between individuals with subtle appearance differences. To address this limitation, we propose Attribute-Guide ReID (AG-ReID), a novel framework that leverages pre-trained models’ inherent capabilities to extract fine-grained semantic attributes without additional data or annotations. Our framework operates through a two-stage process: first generating attribute pseudo-labels that capture subtle visual characteristics, then introducing a dual-guidance mechanism that combines holistic and fine-grained attribute information to enhance image feature extraction. Extensive experiments demonstrate that AG-ReID achieves state-of-the-art results on multiple widely-used Re-ID datasets, showing significant improvements in handling occlusions and subtle attribute differences while maintaining competitive performance on standard Re-ID scenarios.

NeurIPS Conference 2025 Conference Paper

CaliGCL: Calibrated Graph Contrastive Learning via Partitioned Similarity and Consistency Discrimination

  • Yuena Lin
  • Hao Wei
  • Hai-Chun Cai
  • Bohang Sun
  • Tao Yang
  • Zhen Yang
  • Gengyu Lyu

Graph contrastive learning (GCL) aims to learn self-supervised representations by distinguishing positive and negative sample pairs generated from multiple augmented graph views. Despite showing promising performance, GCL still suffers from two critical biases: (1) Similarity estimation bias arises when feature elements that support positive pair alignment are suppressed by conflicting components within the representation, causing truly positive pairs to appear less similar. (2) Semantic shift bias occurs when random augmentations alter the underlying semantics of samples, leading to incorrect positive or negative assignments and injecting noise into training. To address these issues, we propose CaliGCL, a GCL model for calibrating the biases by integrating an exponential partitioned similarity measure and a semantics-consistency discriminator. The exponential partitioned similarity computes the similarities among fine-grained partitions obtained through splitting representation vectors and uses exponential scaling to emphasize aligned (positive) partitions while reducing the influence of misaligned (negative) ones. The discriminator dynamically identifies whether augmented sample pairs maintain semantic consistency, enabling correction of misleading contrastive supervision signals. These components jointly reduce biases in similarity estimation and sample pairing, guiding the encoder to learn more robust and semantically meaningful representations. Extensive experiments on multiple benchmarks show that CaliGCL effectively mitigates both types of biases and achieves state-of-the-art performance.

NeurIPS Conference 2025 Conference Paper

Causality Meets the Table: Debiasing LLMs for Faithful TableQA via Front-Door Intervention

  • Zhen Yang
  • Ziwei Du
  • Minghan Zhang
  • Wei Du
  • Jie Chen
  • Fulan Qian
  • Shu Zhao

Table Question Answering (TableQA) combines natural language understanding and structured data reasoning, posing challenges in semantic interpretation and logical inference. Recent advances in Large Language Models (LLMs) have improved TableQA performance through Direct Prompting and Agent paradigms. However, these models often rely on spurious correlations, as they tend to overfit to token co-occurrence patterns in pretraining corpora, rather than perform genuine reasoning. To address this issue, we propose Causal Intervention TableQA (CIT), which is based on a structural causal graph and applies front-door adjustment to eliminate bias caused by token co-occurrence. CIT formalizes TableQA as a causal graph and identifies token co-occurrence patterns as confounders. By applying front-door adjustment, CIT guides question variant generation and reasoning to reduce confounding effects. Experiments on multiple benchmarks show that CIT achieves state-of-the-art performance, demonstrating its effectiveness in mitigating bias. Consistent gains across various LLMs further confirm its generalizability.

AAAI Conference 2025 Conference Paper

CFDM: Contrastive Fusion and Disambiguation for Multi-View Partial-Label Learning

  • Qiuru Hai
  • Yongjian Deng
  • Yuena Lin
  • Zheng Li
  • Zhen Yang
  • Gengyu Lyu

When dealing with multi-view data, the heterogeneity of data attributes across different views often leads to label ambiguity. To effectively address this challenge, this paper designs a Multi-View Partial-Label Learning (MVPLL) framework, where each training instance is described by multiple view features and associated with a set of candidate labels, among which only one is correct. The key to deal with such problem lies in how to effectively fuse multi-view information and accurately disambiguate these ambiguous labels. In this paper, we propose a novel approach named CFDM, which explores the consistency and complementarity of multi-view data by multi-view contrastive fusion and reduces label ambiguity by multi-class contrastive prototype disambiguation. Specifically, we first extract view-specific representations using multiple view-specific autoencoders, and then integrate multi-view information through both inter-view and intra-view contrastive fusion to enhance the distinctiveness of these representations. Afterwards, we utilize these distinctive representations to establish and update prototype vectors for each class within each view. Based on these, we apply contrastive prototype disambiguation to learn global class prototypes and accordingly reduce label ambiguity. In our model, multi-view contrastive fusion and multi-class contrastive prototype disambiguation are conducted mutually to enhance each other within a coherent framework, leading to a more ideal classification performance. Experimental results on multiple datasets have demonstrated that our proposed method is superior to other state-of-the-art methods.

IJCAI Conference 2025 Conference Paper

Critical Node-aware Augmentation for Hypergraph Contrastive Learning

  • Zhuo Li
  • Yuena Lin
  • Yipeng Wang
  • Wenmao Liu
  • Mingliang Yu
  • Zhen Yang
  • Gengyu Lyu

Hypergraph contrastive learning enables effective representation learning for hypergraphs without requiring labels. However, existing methods typically rely on randomly deleting or replacing nodes during hypergraph augmentation, which may lead to the absence of critical nodes and further disrupt the higher-order structural relationships within augmented hypergraphs. To address this issue, we propose a Critical Node-aware hypergraph contrastive learning method, which is the first attempt to leverage hyperedge prediction to retain critical nodes and accordingly maintain the reliable higher-order structural relationships within augmented hypergraphs. Specifically, we first employ contrastive learning to align the augmented hypergraphs, and then generate hyperedge embeddings to characterize node representations and their structural correlations. During the hyperedge embedding encoding process, we introduce a hyperedge prediction discriminator to score these embeddings, which quantifies the nodes' contributions to identify the critical nodes and maintain the higher-order structural relationships within augmented hypergraphs. Compared with previous studies, our proposed method can effectively alleviate the erroneous deletion or replacement of critical nodes and steadily maintain the inherent structural relationships between original hypergraph and augmented hypergraphs, naturally guiding better hypergraph representations for downstream tasks. Extensive experiments on various tasks demonstrate that our method is significantly superior to state-of-the-art methods.

NeurIPS Conference 2025 Conference Paper

EPA: Boosting Event-based Video Frame Interpolation with Perceptually Aligned Learning

  • Yuhan Liu
  • LingHui Fu
  • Zhen Yang
  • Hao Chen
  • Youfu Li
  • Yongjian Deng

Event cameras, with their capacity to provide high temporal resolution information between frames, are increasingly utilized for video frame interpolation (VFI) in challenging scenarios characterized by high-speed motion and significant occlusion. However, prevalent issues of blur and distortion within the keyframes and ground truth data used for training and inference in these demanding conditions are frequently overlooked. This oversight impedes the perceptual realism and multi-scene generalization capabilities of existing event-based VFI (E-VFI) methods when generating interpolated frames. Motivated by the observation that semantic-perceptual discrepancies between degraded and pristine images are considerably smaller than their image-level differences, we introduce EPA. This novel E-VFI framework diverges from approaches reliant on direct image-level supervision by constructing multilevel, degradation-insensitive semantic perceptual supervisory signals to enhance the perceptual realism and multi-scene generalization of the model's predictions. Specifically, EPA operates in two phases: it first employs a DINO-based perceptual extractor, a customized style adapter, and a reconstruction generator to derive multi-layered, degradation-insensitive semantic-perceptual features ($\mathcal{S}$). Second, a novel Bidirectional Event-Guided Alignment (BEGA) module utilizes deformable convolutions to align perceptual features from keyframes to ground truth with inter-frame temporal guidance extracted from event signals. By decoupling the learning process from direct image-level supervision, EPA enhances model robustness against degraded keyframes and unreliable ground truth information. Extensive experiments demonstrate that this approach yields interpolated frames more consistent with human perceptual preferences. *The code will be released upon acceptance. *

AAAI Conference 2025 Conference Paper

ESEG: Event-Based Segmentation Boosted by Explicit Edge-Semantic Guidance

  • Yucheng Zhao
  • Gengyu Lyu
  • Ke Li
  • Zihao Wang
  • Hao Chen
  • Zhen Yang
  • Yongjian Deng

Event-based semantic segmentation (ESS) has attracted researchers' attention recently, as event cameras can solve problems such as under/over-exposure or motion blur that are difficult for RGB cameras to handle. However, event data are noisy and sparse, resulting in difficulties for the model to locate and extract reliable cues from their sparse representations, especially when performing pixel-level tasks. In this paper, we propose a novel framework ESEG to alleviate the dilemma. Given that event signals relate closely to moving edges, instead of proposing complex structures to expect them to recognize those reliable edge regions behind event signals on their own, we introduce the explicit edge-semantic supervision as a reference to let the ESS model globally optimize semantics, considering the high confidence of event data in edge regions. In addition, we propose a fusion module named Density-Aware Dynamic-Window Cross Attention Fusion (D\textsuperscript{2}CAF), in which the density perception, cross-attention, and dynamic window masking mechanisms are jointly imposed to optimize edge-dense feature fusion, leveraging the characteristics of event cameras. Experimental results on DSEC and DDD17 datasets demonstrate the efficacy of the ESEG framework and its core designs.

AAAI Conference 2025 Conference Paper

Graph Consistency and Diversity Measurement for Federated Multi-View Clustering

  • Bohang Sun
  • Yongjian Deng
  • Yuena Lin
  • Qiuru Hai
  • Zhen Yang
  • Gengyu Lyu

Federated Multi-View Clustering (FMVC) aims to learn a global clustering model from heterogeneous data distributed across different devices, where each device only stores one view of all clustering samples. The key to deal with such problem lies in how to effectively fuse these heterogeneous samples while strictly preserve the data privacy across multiple devices. In this paper, we propose a novel structural graph learning framework named MGCD, which leverages both consistency and diversity of multi-view graph structure across global view-fusion server and local view-specific clients to achieve desired clustering while better preserves data privacy. Specifically, in each local client, we design a dual autoencoder to extract the latent consensuses and specificities of each view, where self-representation construction is introduced to generate the corresponding view-specific diversity graph. In the global server, the consistency implied in uploaded diversity graphs are further distilled and then incorporated into the consistency graph for subsequent cross-view contrastive fusion. During the training process, the server generates a global consistency graph and distributes it to each client for assisting in diversity graph construction, while the clients extract view-specific information and upload it to the server for more reliable consistency graph generation. The ``server-client'' interaction is conducted in an iterative manner, where the consistency implied in each local client is gradually aggregated into the global consistency graph, and the final clustering results are obtained by spectral clustering on the desired global consistency graph. Extensive experiments on various datasets have demonstrated the effectiveness of our proposed method on clustering federated multi-view data.

AAAI Conference 2025 Conference Paper

Know Where You Are From: Event-Based Segmentation via Spatio-Temporal Propagation

  • Ke Li
  • Gengyu Lyu
  • Hao Chen
  • Bochen Xie
  • Zhen Yang
  • Youfu Li
  • Yongjian Deng

Event cameras have gained attention in segmentation due to their higher temporal resolution and dynamic range compared to traditional cameras. However, they struggle with issues like lack of color perception and triggering only at motion edges, making it hard to distinguish objects with similar contours or segment spatially continuous objects. Our work aims to address these often overlooked issues. Based on the assumption that various objects exhibit different motion patterns, we believe that embedding the historical motion states of objects into segmented scenes can effectively address these challenges. Inspired by this, we propose the ESS framework ``Know Where You Are From" (KWYAF), which incorporates past motion cues through spatio-temporal propagation embedding. This framework features two core components: the Sequential Motion Encoding Module (SME) and the Event-Based Reliable Region Selection Mechanism (ER²SM). SMEs construct prior motion features through spatio-temporal correlation modeling for boosting final segmentation, while ER²SM adapts to identify high-confidence regions, embedding motion more precisely through local window masks and reliable region selection. A large number of experiments have demonstrated the effectiveness of our proposed framework in terms of both quantity and quality.

AAAI Conference 2025 Conference Paper

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

  • Mushui Liu
  • Yuhang Ma
  • Zhen Yang
  • Jun Dan
  • Yunlong Yu
  • Zeng Zhao
  • Zhipeng Hu
  • Bai Liu

Diffusion models have exhibited substantial success in text-to-image generation. However, they often encounter challenges when dealing with complex and dense prompts involving multiple objects, attribute binding, and long descriptions. In this paper, we propose a novel framework called LLM4GEN, which enhances the semantic understanding of text-to-image diffusion models by leveraging the representation of Large Language Models (LLMs). It can be seamlessly incorporated into various diffusion models as a plug-and-play component. A specially designed Cross-Adapter Module (CAM) integrates the original text features of text-to-image models with LLM features, thereby enhancing text-to-image generation. Additionally, to facilitate and correct entity-attribute relationships in text prompts, we develop an entity-guided regularization loss to further improve generation performance. We also introduce DensePrompts, which contains 7,000 dense prompts to provide a comprehensive evaluation for the text-to-image generation task. Experiments indicate that LLM4GEN significantly improves the semantic alignment of SD1.5 and SDXL, demonstrating increases of 9.69% and 12.90% in color on T2I-CompBench, respectively. Moreover, it surpasses existing models in terms of sample quality, image-text alignment, and human evaluation.

ICLR Conference 2025 Conference Paper

MM1. 5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

  • Haotian Zhang 0005
  • Mingfei Gao
  • Zhe Gan
  • Philipp Dufter
  • Nina Wenzel
  • Forrest Huang
  • Dhruti Shah
  • Xianzhi Du

We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.

AAAI Conference 2025 Conference Paper

MSV-PCT: Multi-Sparse-View Enhanced Transformer Framework for Salient Object Detection in Point Clouds

  • Zihao Wang
  • Yiming Huang
  • Gengyu Lyu
  • Yucheng Zhao
  • Ziyu Zhou
  • Bochen Xie
  • Zhen Yang
  • Yongjian Deng

Salient object detection (SOD) methods for 2D images have great significance in the field of human-computer interaction (HCI). However, as a common data format in HCI, the SOD research in the form of 3D point cloud data remains limited. Previous works commonly treat this task as point cloud segmentation, which perceives all points in the scene for prediction. However, these methods neglect that SOD is designed to simulate human visual perception where human can only see the surfaces rather than occluded point clouds. Thereby, these methods may fail when meet such situations. This paper aims to solve this problem by approximately simulating the perception paradigm of humans towards 3D scenes. Thus, we propose a framework based on the 3D visual point cloud backbone and its multi-view projection named MSV-PCT. Specifically, instead of relying solely on general point cloud learning frameworks, we additionally introduce multi-sparse-view learning branches to supplement the SOD perception. Furthermore, we propose a novel point cloud edge detection loss function to effectively address artifacts, enabling the accurate segmentation of the edges of salient objects from the background. Finally, to evaluate the generalization of point cloud SOD methods, we introduce a new approach to generate simulated PC-SOD datasets from RGBD-SOD data. Experiments on the simulated datasets show that MSV-PCT achieves better accuracy and robustness.

AAAI Conference 2025 Conference Paper

Multi-View Multi-Label Classification via View-Label Matching Selection

  • Hao Wei
  • Yongjian Deng
  • Qiuru Hai
  • Yuena Lin
  • Zhen Yang
  • Gengyu Lyu

In multi-view multi-label classification (MVML), each object is described by several heterogeneous views while annotated with multiple related labels. The key to learn from such complicate data lies in how to fuse cross-view features and explore multi-label correlations, while accordingly obtain correct assignments between each object and its corresponding labels. In this paper, we proposed an advanced MVML method named VAMS, which treats each object as a bag of views and reformulates the task of MVML as a “view-label” matching selection problem. Specifically, we first construct an object graph and a label graph respectively. In the object graph, nodes represent the multi-view representation of an object, and each view node is connected to its K-nearest neighbor within its own view. In the label graph, nodes represent the semantic representation of a label. Then, we connect each view node with all labels to generate the unified “view-label” matching graph. Afterwards, a graph network block is introduced to aggregate and update all nodes and edges on the matching graph, and further generating a structural representation that fuses multi-view heterogeneity and multi-label correlations for each view and label. Finally, we derive a prediction score for each view-label matching and select the optimal matching via optimizing a weighted cross-entropy loss. Extensive results on various datasets have verified that our proposed VAMS can achieve superior or comparable performance against state-of-the-art methods.

ICML Conference 2025 Conference Paper

Scaling Laws for Floating-Point Quantization Training

  • Xingwu Sun
  • Shuaipeng Li
  • Ruobing Xie
  • Weidong Han 0006
  • Kan Wu
  • Zhen Yang
  • Yixing Li
  • An Wang

Low-precision training is considered an effective strategy for reducing both training and downstream inference costs. Previous scaling laws for precision mainly focus on integer quantization, which pay less attention to the constituents in floating-point (FP) quantization, and thus cannot well fit the LLM losses in this scenario. In contrast, while FP quantization training is more commonly implemented in production, it’s research has been relatively superficial. In this paper, we thoroughly explore the effects of FP quantization targets, exponent bits, mantissa bits, and the calculation granularity of the scaling factor in FP quantization training performance of LLM models. In addition to an accurate FP quantization unified scaling law, we also provide valuable suggestions for the community: (1) Exponent bits contribute slightly more to the model performance than mantissa bits. We provide the optimal exponent-mantissa bit ratio for different bit numbers, which is available for future reference by hardware manufacturers; (2) We discover the formation of the critical data size in low-precision LLM training. Too much training data exceeding the critical data size will inversely bring in degradation of LLM performance; (3) The optimal FP quantization precision is directly proportional to the computational power, but within a wide computational power range. We estimate that the best cost-performance precision should lie between 4-8 bits.

AAAI Conference 2025 Conference Paper

Thought-Path Contrastive Learning via Premise-Oriented Data Augmentation for Logical Reading Comprehension

  • Chenxu Wang
  • Ping Jian
  • Zhen Yang

Logical reading comprehension is a challenging task that entails grasping the underlying semantics of text and applying reasoning to deduce the correct answer. Prior researches have primarily focused on enhancing logical reasoning capabilities through Chain-of-Thought (CoT) or data augmentation. However, previous work constructing chain-of-thought rationales concentrates solely on analyzing correct options, neglecting the incorrect alternatives. Addtionally, earlier efforts on data augmentation by altering contexts rely on rule-based methods, which result in generated contexts that lack diversity and coherence. To address these issues, we propose a Premise-Oriented Data Augmentation (PODA) framework. This framework can generate CoT rationales including analyses for both correct and incorrect options, while constructing diverse and high-quality counterfactual contexts from incorrect candidate options. We integrate summarizing premises and identifying premises for each option into rationales. Subsequently, we employ multi-step prompts with identified premises to construct counterfactual context. To facilitate the model's capabilities to better differentiate the reasoning process associated with each option, we introduce a novel thought-path contrastive learning method that compares reasoning path between the original and counterfactual samples. Experimental results on three representative LLMs demonstrate that our method can improve the baselines substantially across two challenging logical reasoning benchmarks (ReClor and LogiQA 2.0).

ICML Conference 2025 Conference Paper

Trustworthy Machine Learning through Data-Specific Indistinguishability

  • Hanshen Xiao
  • Zhen Yang
  • G. Edward Suh

This paper studies a range of AI/ML trust concepts, including memorization, data poisoning, and copyright, which can be modeled as constraints on the influence of data on a (trained) model, characterized by the outcome difference from a processing function (training algorithm). In this realm, we show that provable trust guarantees can be efficiently provided through a new framework termed Data-Specific Indistinguishability (DSI) to select trust-preserving randomization tightly aligning with targeted outcome differences, as a relaxation of the classic Input-Independent Indistinguishability (III). We establish both the theoretical and algorithmic foundations of DSI with the optimal multivariate Gaussian mechanism. We further show its applications to develop trustworthy deep learning with black-box optimizers. The experimental results on memorization mitigation, backdoor defense, and copyright protection show both the efficiency and effectiveness of the DSI noise mechanism.

IJCAI Conference 2024 Conference Paper

Common-Individual Semantic Fusion for Multi-View Multi-Label Learning

  • Gengyu Lyu
  • Weiqi Kang
  • Haobo Wang
  • Zheng Li
  • Zhen Yang
  • Songhe Feng

In Multi-View Multi-Label Learning, each instance is described by several heterogeneous features and associated with multiple valid labels simultaneously. Existing methods mainly focus on leveraging feature-level view fusion to capture a common representation for multi-label classifier induction. In this paper, we take a new perspective and propose a new semantic-level fusion model named Common-Individual Semantic Fusion Multi-View Multi-Label Learning Method (CISF). Different from previous feature-level fusion model, our proposed method directly focuses on semantic-level view fusion and simultaneously take both the common semantic across different views and the individual semantic of each specific view into consideration. Specifically, we first assume each view involves some common semantic labels while owns a few exclusive semantic labels. Then, the common and exclusive semantic labels are separately forced to be consensus and diverse to excavate the consistences and complementarities among different views. Afterwards, we introduce the low-rank and sparse constraint to highlight the label co-occurrence relationship of common semantics and the view-specific expression of individual semantics. We provide theoretical guarantee for the strict convexity of our method by properly setting parameters. Extensive experiments on various data sets have verified the superiority of our method.

TIST Journal 2024 Journal Article

Discovering Expert-Level Air Combat Knowledge via Deep Excitatory-Inhibitory Factorized Reinforcement Learning

  • Hai Yin Piao
  • Shengqi Yang
  • Hechang Chen
  • Junnan Li
  • Jin Yu
  • Xuanqi Peng
  • Xin Yang
  • Zhen Yang

Artificial Intelligence (AI) has achieved a wide range of successes in autonomous air combat decision-making recently. Previous research demonstrated that AI-enabled air combat approaches could even acquire beyond human-level capabilities. However, there remains a lack of evidence regarding two major difficulties. First, the existing methods with fixed decision intervals are mostly devoted to solving what to act but merely pay attention to when to act, which occasionally misses optimal decision opportunities. Second, the method of an expert-crafted finite maneuver library leads to a lack of tactics diversity, which is vulnerable to an opponent equipped with new tactics. In view of this, we propose a novel Deep Reinforcement Learning (DRL) and prior knowledge hybrid autonomous air combat tactics discovering algorithm, namely deep E xcitatory-i N hibitory f ACT or I zed maneu VE r ( ENACTIVE ) learning. The algorithm consists of two key modules, i.e., ENHANCE and FACTIVE. Specifically, ENHANCE learns to adjust the air combat decision-making intervals and appropriately seize key opportunities. FACTIVE factorizes maneuvers and then jointly optimizes them with significant tactics diversity increments. Extensive experimental results reveal that the proposed method outperforms state-of-the-art algorithms with a 62% winning rate and further obtains a margin of a 2.85-fold increase in terms of global tactic space coverage. It also demonstrates that a variety of discovered air combat tactics are comparable to human experts’ knowledge.

IJCAI Conference 2024 Conference Paper

SDformer: Transformer with Spectral Filter and Dynamic Attention for Multivariate Time Series Long-term Forecasting

  • Ziyu Zhou
  • Gengyu Lyu
  • Yiming Huang
  • Zihao Wang
  • Ziyu Jia
  • Zhen Yang

Transformer has gained widespread adoption in modeling time series due to the exceptional ability of its self-attention mechanism in capturing long-range dependencies. However, when processing time series data with numerous variates, the vanilla self-attention mechanism tends to distribute attention weights evenly and smoothly, causing row-homogenization in attention maps and further hampering time series forecasting. To tackle this issue, we propose an advanced Transformer architecture entitled SDformer, which designs two novel modules, Spectral-Filter-Transform (SFT) and Dynamic-Directional-Attention (DDA), and integrates them into the encoder of Transformer to achieve more intensive attention allocation. Specifically, the SFT module utilizes the Fast Fourier Transform to select the most prominent frequencies, along with a Hamming Window to smooth and denoise the filtered series data; The DDA module applies a specialized kernel function to the query and key vectors projected from the denoised data, concentrating this innovative attention mechanism more effectively on the most informative variates to obtain a sharper attention distribution. These two modules jointly enable attention weights to be more salient among numerous variates, which in turn enhances the attention's ability to capture multivariate correlations, improving the performance in forecasting. Extensive experiments on public datasets demonstrate its superior performance over other state-of-the-art models. Code is available at https: //github. com/zhouziyu02/SDformer.

AAAI Conference 2024 Conference Paper

TriSampler: A Better Negative Sampling Principle for Dense Retrieval

  • Zhen Yang
  • Zhou Shao
  • Yuxiao Dong
  • Jie Tang

Negative sampling stands as a pivotal technique in dense retrieval, essential for training effective retrieval models and significantly impacting retrieval performance. While existing negative sampling methods have made commendable progress by leveraging hard negatives, a comprehensive guiding principle for constructing negative candidates and designing negative sampling distributions is still lacking. To bridge this gap, we embark on a theoretical analysis of negative sampling in dense retrieval. This exploration culminates in the unveiling of the quasi-triangular principle, a novel framework that elucidates the triangular-like interplay between query, positive document, and negative document. Fueled by this guiding principle, we introduce TriSampler, a straightforward yet highly effective negative sampling method. The keypoint of TriSampler lies in its ability to selectively sample more informative negatives within a prescribed constrained region. Experimental evaluation show that TriSampler consistently attains superior retrieval performance across a diverse of representative retrieval models.

TIST Journal 2023 Journal Article

Prior Knowledge Constrained Adaptive Graph Framework for Partial Label Learning

  • Gengyu Lyu
  • Songhe Feng
  • Shaokai Wang
  • Zhen Yang

Partial label learning (PLL) aims to learn a robust multi-class classifier from the ambiguous data, where each instance is given with several candidate labels, among which only one label is real. Most existing methods usually cope with such problem by utilizing a feature similarity graph to conduct label disambiguation. However, these methods construct the feature graph by only employing original features, while the influences of latent outliers and the contributions of label space are regrettably ignored. To tackle these issues, in this article, we propose a P rior Kn O wledge Cons T rained A daptive G raph Fram E work ( POTAGE ) for partial label learning, which utilizes an adaptive graph fused with label information to accurately describe the instance relationship and guide the desired model training. Compared with the feature-induced fixed graph, the adaptive graph is deemed to be more robust and accurate to reveal the intrinsic manifold structure within the data, and the embedding label information is expected to effectively alleviate the label ambiguities and enlarge the gap of label confidences between two instances from different classes. Extensive experiments demonstrate that POTAGE achieves state-of-the-art performance.

AAAI Conference 2022 Conference Paper

Laneformer: Object-Aware Row-Column Transformers for Lane Detection

  • Jianhua Han
  • Xiajun Deng
  • Xinyue Cai
  • Zhen Yang
  • Hang Xu
  • Chunjing Xu
  • Xiaodan Liang

We present Laneformer, a conceptually simple yet powerful transformer-based architecture tailored for lane detection that is a long-standing research topic for visual perception in autonomous driving. The dominant paradigms rely on purely CNN-based architectures which often fail in incorporating relations of long-range lane points and global contexts induced by surrounding objects (e. g. , pedestrians, vehicles). Inspired by recent advances of the transformer encoder-decoder architecture in various vision tasks, we move forwards to design a new end-to-end Laneformer architecture that revolutionizes the conventional transformers into better capturing the shape and semantic characteristics of lanes, with minimal overhead in latency. First, coupling with deformable pixel-wise selfattention in the encoder, Laneformer presents two new row and column self-attention operations to efficiently mine point context along with the lane shapes. Second, motivated by the appearing objects would affect the decision of predicting lane segments, Laneformer further includes the detected object instances as extra inputs of multi-head attention blocks in the encoder and decoder to facilitate the lane point detection by sensing semantic contexts. Specifically, the bounding box locations of objects are added into Key module to provide interaction with each pixel and query while the ROI-aligned features are inserted into Value module. Extensive experiments demonstrate our Laneformer achieves state-of-the-art performances on CULane benchmark, in terms of 77. 1% F1 score. We hope our simple and effective Laneformer will serve as a strong baseline for future research in self-attention models for lane detection.

NeurIPS Conference 2022 Conference Paper

Semi-supervised Semantic Segmentation with Prototype-based Consistency Regularization

  • Haiming Xu
  • Lingqiao Liu
  • Qiuchen Bian
  • Zhen Yang

Semi-supervised semantic segmentation requires the model to effectively propagate the label information from limited annotated images to unlabeled ones. A challenge for such a per-pixel prediction task is the large intra-class variation, i. e. , regions belonging to the same class may exhibit a very different appearance even in the same picture. This diversity will make the label propagation hard from pixels to pixels. To address this problem, we propose a novel approach to regularize the distribution of within-class features to ease label propagation difficulty. Specifically, our approach encourages the consistency between the prediction from a linear predictor and the output from a prototype-based predictor, which implicitly encourages features from the same pseudo-class to be close to at least one within-class prototype while staying far from the other between-class prototypes. By further incorporating CutMix operations and a carefully-designed prototype maintenance strategy, we create a semi-supervised semantic segmentation algorithm that demonstrates superior performance over the state-of-the-art methods from extensive experimental evaluation on both Pascal VOC and Cityscapes benchmarks.

AAAI Conference 2022 System Paper

Silence or Outbreak – a Real-Time Emergent Topic Identification System (RealTIS) for Social Media

  • Ning Lu
  • Zhen Yang
  • Jian Huang
  • Yaxi Wu
  • Hesong Wang

This paper presents RealTIS, a Real-time emergent Topic Identification System for user-generated content on the web via social networking services such as Twitter, Weibo, and Facebook. Without user intervention, our proposed RealTIS system can efficiently collect necessary social media posts, construct a quality topic summarization from the vast sea of data, and then automatically identify whether the emerging topics will be out-breaking or just fading into silence. RealTIS uses a time-sliding window to compute the statistics about the basic structure (motifs) variation of the propagation network for a specific topic. These statistics are then used to predict unusual shifts in correlations, make early warning and detect outbreak. Besides, this work also illustrates the mechanism by which our proposed system makes early warning happen.

NeurIPS Conference 2021 Conference Paper

Learning Transferable Features for Point Cloud Detection via 3D Contrastive Co-training

  • Zeng Yihan
  • Chunwei Wang
  • Yunbo Wang
  • Hang Xu
  • Chaoqiang Ye
  • Zhen Yang
  • Chao Ma

Most existing point cloud detection models require large-scale, densely annotated datasets. They typically underperform in domain adaptation settings, due to geometry shifts caused by different physical environments or LiDAR sensor configurations. Therefore, it is challenging but valuable to learn transferable features between a labeled source domain and a novel target domain, without any access to target labels. To tackle this problem, we introduce the framework of 3D Contrastive Co-training (3D-CoCo) with two technical contributions. First, 3D-CoCo is inspired by our observation that the bird-eye-view (BEV) features are more transferable than low-level geometry features. We thus propose a new co-training architecture that includes separate 3D encoders with domain-specific parameters, as well as a BEV transformation module for learning domain-invariant features. Second, 3D-CoCo extends the approach of contrastive instance alignment to point cloud detection, whose performance was largely hindered by the mismatch between the fictitious distribution of BEV features, induced by pseudo-labels, and the true distribution. The mismatch is greatly reduced by 3D-CoCo with transformed point clouds, which are carefully designed by considering specific geometry priors. We construct new domain adaptation benchmarks using three large-scale 3D datasets. Experimental results show that our proposed 3D-CoCo effectively closes the domain gap and outperforms the state-of-the-art methods by large margins.

AAAI Conference 2019 Conference Paper

Adapting Translation Models for Transcript Disfluency Detection

  • Qianqian Dong
  • Feng Wang
  • Zhen Yang
  • Wei Chen
  • Shuang Xu
  • Bo Xu

Transcript disfluency detection (TDD) is an important component of the real-time speech translation system, which arouses more and more interests in recent years. This paper presents our study on adapting neural machine translation (NMT) models for TDD. We propose a general training framework for adapting NMT models to TDD task rapidly. In this framework, the main structure of the model is implemented similar to the NMT model. Additionally, several extended modules and training techniques which are independent of the NMT model are proposed to improve the performance, such as the constrained decoding, denoising autoencoder initialization and a TDD-specific training object. With the proposed training framework, we achieve significant improvement. However, it is too slow in decoding to be practical. To build a feasible and production-ready solution for TDD, we propose a fast non-autoregressive TDD model following the non-autoregressive NMT model emerged recently. Even we do not assume the specific architecture of the NMT model, we build our TDD model on the basis of Transformer, which is the state-of-the-art NMT model. We conduct extensive experiments on the publicly available set, Switchboard, and in-house Chinese set. Experimental results show that the proposed model significantly outperforms previous state-ofthe-art models.