EAAI Journal 2026 Journal Article
Breaking multilayer perceptron limitations for traffic flow forecasting with structured patch learning
- Wei Sun
- Gong Wang
- Junbo Gao
- Chunyu Wang
- Zihao Zhang
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
We propose unsupervised multi-scenario (UMS) person re-identification (ReID) as a new task that expands ReID across diverse scenarios (cross-resolution, clothing change, etc.) within a single coherent framework. To tackle UMS-ReID, we introduce image-text knowledge modeling (ITKM) -- a three-stage framework that effectively exploits the representational power of vision-language models. We start with a pre-trained CLIP model with an image encoder and a text encoder. In Stage I, we introduce a scenario embedding in the image encoder and fine-tune the encoder to adaptively leverage knowledge from multiple scenarios. In Stage II, we optimize a set of learned text embeddings to associate with pseudo-labels from Stage I and introduce a multi-scenario separation loss to increase the divergence between inter-scenario text representations. In Stage III, we first introduce cluster-level and instance-level heterogeneous matching modules to obtain reliable heterogeneous positive pairs (e.g., a visible image and an infrared image of the same person) within each scenario. Next, we propose a dynamic text representation update strategy to maintain consistency between text and image supervision signals. Experimental results across multiple scenarios demonstrate the superiority and generalizability of ITKM; it not only outperforms existing scenario-specific methods but also enhances overall performance by integrating knowledge from multiple scenarios.
AAAI Conference 2026 Conference Paper
CLIP is a seminal multimodal model that maps images and text into a shared representation space by contrastive learning on billions of image–caption pairs. Inspired by the rapid progress of large language models (LLMs), we investigate how the superior linguistic understanding and broad world knowledge of LLMs can further strengthen CLIP—particularly in handling long, complex captions. We introduce an efficient fine-tuning framework that embeds an LLM into a pretrained CLIP while incurring almost the same training cost as regular CLIP fine-tuning. Our method first “embedding-izes” the LLM for the CLIP setting, then couples it to the pretrained CLIP vision encoder through a lightweight adaptor trained on only a few million image–caption pairs. With this strategy we achieve large performance gains—without large-scale retraining—over state-of-the-art CLIP variants such as EVA02 and SigLIP-2. The LLM-enhanced CLIP delivers consistent improvements across a wide spectrum of downstream tasks, including linear-probe classification, zero-shot image–text retrieval with both short and long captions (in English and other languages), zero-shot/supervised image segmentation, object detection, and used as tokenizer for multimodal large-model benchmarks.
NeurIPS Conference 2025 Conference Paper
Visible-infrared person re-identification (VI-ReID) aims to match visible and infrared images of the same individual. Supervised VI-ReID (SVI-ReID) methods have achieved promising performance under the guidance of manually annotated identity labels. However, the substantial annotation cost severely limits their scalability in real-world applications. As a result, unsupervised VI-ReID (UVI-ReID) methods have attracted increasing attention. These methods typically rely on pseudo-labels generated by clustering and matching algorithms to replace manual annotations. Nevertheless, the quality of pseudo-labels is often difficult to guarantee, and low-quality pseudo-labels can significantly hinder model performance improvements. To address these challenges, we explore the use of attribute arrays extracted by a large vision-language model (LVLM) to enhance VI-ReID, and propose a novel LVLM-driven attribute-aware modeling (LVLM-AAM) approach. Specifically, we first design an attribute-aware reliable labeling strategy, which refines intra-modality clustering results based on image-level attributes and improves inter-modality matching by grouping clusters according to cluster-level attributes. Next, we develop an explicit-implicit attribute fusion module, which integrates explicit and implicit attributes to obtain more fine-grained identity-related text features. Finally, we introduce an attribute-aware contrastive learning module, which jointly leverages static and dynamic text features to promote modality-invariant feature learning. Extensive experiments conducted on VI-ReID datasets validate the effectiveness of the proposed LVLM-AAM and its individual components. LVLM-AAM not only significantly outperforms existing unsupervised methods but also surpasses several supervised methods.
JBHI Journal 2025 Journal Article
Circular RNA (circRNA) is a class of noncoding RNA that is highly conserved and exhibit exceptional stability. Due to its function as a microRNA sponge, circRNA has gained significant attention as an essential biomarker and potential drug target in the pathogenesis of several cancers. Although many circRNAs have been identified to play a role in cancer resistance, traditional methods are time-consuming and expensive. In this context, computational methods offer a promising way to facilitate the discovery process. However, most existing prediction models focus on the association between circRNAs and drug resistance, without considering the corresponding disease-related information in the circRNA-drug resistance association. Incorporating disease-related information into the prediction of circRNA-drug resistance associations could potentially improve the efficiency and speed of discovering and developing circRNA-targeting drugs. We propose a computational framework, named GraphCDD, for predicting the association between circRNA and drug resistance. Our model utilizes data from three sources, namely circRNA, disease, and drug, to construct three similarity networks that represent the features of circRNA, disease, and drug, respectively. We utilize a multimodal graph neural network to acquire efficient representations of circRNAs, diseases, and drugs by integrating various types of information, and establish a predictive model. The experimental results have validated the effectiveness of our model and provided a promising method in predicting potential associations between circRNA and drug resistance.
NeurIPS Conference 2025 Conference Paper
Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments confirm that incorporating long CoT reasoning significantly enhances the accuracy of reward signals. Notably, after mastering CoT reasoning, the model exhibits implicit reasoning capabilities, allowing it to surpass existing baselines even without explicit reasoning traces.
JBHI Journal 2024 Journal Article
Accurate prediction of small molecule modulators targeting protein-protein interactions (PPIMs) remains a significant challenge in drug discovery. Existing machine learning-based models rely on manual feature engineering, which is tedious and task-specific. Recently, deep learning models based on graph neural networks have made remarkable progress in molecular representation learning. However, many graph-based approaches ignore molecular hierarchical structure modeling guided by domain knowledge. In chemistry, the functional groups of a molecule determine its interaction with specific targets. Therefore, we propose a hierarchical graph neural network framework (called HiGPPIM) for predicting PPIMs by integrating atom-level and functional group-level features of molecules. HiGPPIM constructs atom-level and functional group-level graphs based on chemical knowledge and learns graph representations using graph attention networks. Furthermore, a hypergraph attention network is designed in HiGPPIM to aggregate and transform two-level graph information. We evaluate the performance of HiGPPIM on eight PPI families and two prediction tasks, namely PPIM identification and potency prediction. Experimental results demonstrate that HiGPPIM achieves state-of-the-art performance on both tasks and that using functional group information to guide PPIM prediction is effective.
NeurIPS Conference 2024 Conference Paper
We introduce a radiance representation that is both structured and fully explicit and thus greatly facilitates 3D generative modeling. Existing radiance representations either require an implicit feature decoder, which significantly degrades the modeling power of the representation, or are spatially unstructured, making them difficult to integrate with mainstream 3D diffusion methods. We derive GaussianCube by first using a novel densification-constrained Gaussian fitting algorithm, which yields high-accuracy fitting using a fixed number of free Gaussians, and then rearranging these Gaussians into a predefined voxel grid via Optimal Transport. Since GaussianCube is a structured grid representation, it allows us to use standard 3D U-Net as our backbone in diffusion modeling without elaborate designs. More importantly, the high-accuracy fitting of the Gaussians allows us to achieve a high-quality representation with orders of magnitude fewer parameters than previous structured representations for comparable quality, ranging from one to two orders of magnitude. The compactness of GaussianCube greatly eases the difficulty of 3D generative modeling. Extensive experiments conducted on unconditional and class-conditioned object generation, digital avatar creation, and text-to-3D synthesis all show that our model achieves state-of-the-art generation results both qualitatively and quantitatively, underscoring the potential of GaussianCube as a highly accurate and versatile radiance representation for 3D generative modeling.
JBHI Journal 2024 Journal Article
Circular RNAs (circRNAs) exist in vivo and are a class of noncoding RNA molecules. They have a single-stranded, closed, annular structure. Many studies have shown that circRNAs and diseases are linked. Therefore, it is critical to build a reliable and accurate predictor to find the circRNA-disease association. In this paper, we presented a meta-learning model named MAMLCDA to identify the circRNA-disease association, which is based on model-agnostic meta-learning (MAML) combined with CNN classification. Specifically, similarities between diseases and circRNAs are extracted and integrated to characterize their relationships, and k-means is used to cluster majority samples and select a certain number of samples from each cluster to obtain the same number of negative samples as the positive samples. To further reduce the dimension of the features and save operation time, we applied probabilistic principal component analysis (PPCA) to compact the integrated circRNA and disease similarity network feature vectors. The feature vectors are converted into images. At this time, the prediction problem is transformed into the 2-way 1-shot problem of the image and input into the model with MAML as the meta-learner and CNN as the base-learner. Comparison results of five-fold cross-validation on two benchmark datasets illustrate that MAMLCDA outperforms several state-of-the-art approaches with the best accuracies of 95. 33% and 98%. Therefore, MAMLCDA can help to understand the pathogenesis of complex diseases at the circRNA level.
NeurIPS Conference 2022 Conference Paper
One of the most overlooked challenges in dance generation is that the auto-regressive frameworks are prone to freezing motions due to noise accumulation. In this paper, we present two modules that can be plugged into the existing models to enable them to generate non-freezing and high fidelity dances. Since the high-dimensional motion data are easily swamped by noise, we propose to learn a low-dimensional manifold representation by an auto-encoder with a bank of latent codes, which can be used to reduce the noise in the predicted motions, thus preventing from freezing. We further extend the bank to provide explicit priors about the future motions to disambiguate motion prediction, which helps the predictors to generate motions with larger magnitude and higher fidelity than possible before. Extensive experiments on AIST++, a public large-scale 3D dance motion benchmark, demonstrate that our method notably outperforms the baselines in terms of quality, diversity and time length.
NeurIPS Conference 2021 Conference Paper
Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i. e. , motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1&V2, Diving48, and FineGym.
AAAI Conference 2019 Conference Paper
Estimating 3D human poses from 2D joint positions is an illposed problem, and is further complicated by the fact that the estimated 2D joints usually have errors to which most of the 3D pose estimators are sensitive. In this work, we present an approach to refine inaccurate 3D pose estimations. The core idea of the approach is to learn a number of bases to obtain tight approximations of the low-dimensional pose manifold where a 3D pose is represented by a convex combination of the bases. The representation requires that globally the refined poses are close to the pose manifold thus avoiding generating illegitimate poses. Second, the designed bases also have the property to guarantee that the distances among the body joints of a pose are within reasonable ranges. Experiments on benchmark datasets show that our approach obtains more legitimate poses over the baselines. In particular, the limb lengths are closer to the ground truth.
AAAI Conference 2017 Conference Paper
We address the task of action recognition from a sequence of 3D human poses. This is a challenging task firstly because the poses of the same class could have large intra-class variations either caused by inaccurate 3D pose estimation or various performing styles. Also different actions, e.g., walking vs. jogging, may share similar poses which makes the representation not discriminative to differentiate the actions. To solve the problems, we propose a novel representation for 3D poses by a mixture of Discriminative Activated Simplices (DAS). Each DAS consists of a few bases and represent pose data by their convex combinations. The discriminative power of DAS is firstly realized by learning discriminative bases across classes with a block diagonal constraint enforced on the basis coefficient matrix. Secondly, the DAS provides tight characterization of the pose manifolds thus reducing the chance of generating overlapped DAS between similar classes. We justify the power of the model on benchmark datasets and witness consistent performance improvements.
AAAI Conference 2016 Conference Paper