Author name cluster

Ji Zhang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

32 papers

2 author rows

EAAI Journal 2026 Journal Article

A few-shot learning for image semantic segmentation with weak annotations

Josh Jia-Ching Ying
Jin-Qun Liao
Ji Zhang

Details DOI

AAAI Conference 2026 Conference Paper

Hierarchical Attention Network with Correction for Cross-Domain User Association

Wenlong Liu
Ze Wang
Chenlong Wu
Yude Bai
Ji Zhang

Despite the rich spatiotemporal patterns contained in trajectory data from multiple Location-Based Social Network (LBSN) platforms, heterogeneous formats, semantic inconsistencies, and unequal user scales across platforms create substantial barriers to reliable identity mapping. Furthermore, GPS drift and sparse sampling result in degraded data quality and distribution imbalance, which render existing trajectory representation methods inadequate for capturing high-order dependencies and dynamic spatiotemporal evolution patterns in heterogeneous multi-relational graphs. To this end, we propose HANCUA (Hierarchical Attention Network with Correction for User Association), a novel framework that employs a dual-stage correction mechanism to enhance cross-domain trajectory analysis. The approach constructs hierarchical multi-relational graphs comprising location, trajectory, and correction layers to capture fine-grained mobility patterns, behavioral associations, and inter-platform distribution differences. We design relation-aware multi-head graph attention networks to model complex interactions among heterogeneous node types, which enables comprehensive spatial relationship modeling. A spatiotemporal semantic collaborative learning module integrates temporal information with mobility patterns through interaction-aware attention mechanisms, while an ensemble correction decision module incorporates ensemble learning principles to systematically correct user association biases and address distribution imbalance problems. Extensive experiments on two real-world LBSN cross-domain datasets reveals that HANCUA significantly outperforms state-of-the-art methods in user identity linking accuracy.

PDF Details DOI

EAAI Journal 2026 Journal Article

Integrated modeling and modal layered control strategy for flatness regulation in variable crown temper rolling

Ji Zhang
Zhixuan Wang
Zhuo Wang
Haibo Yuan
Renhao Wu
Hyoung Seop Kim
Zhenhua Bai

Details DOI

EAAI Journal 2026 Journal Article

Physical information-guided ensemble learning model for surface roughness prediction of galvannealed strip

Ji Zhang
Zhixuan Wang
Qi Lu
Sihua Zhu
Zhenhua Bai

Details DOI

AAAI Conference 2026 Conference Paper

ProFuser: Progressive Fusion of Large Language Models

Tianyuan Shi
Fanqi Wan
Canbin Huang
Xiaojun Quan
Chenliang Li
Ming Yan
Ji Zhang
Minhua Huang

While fusing the capacities and advantages of various large language models offers a pathway to construct more powerful and versatile models, a fundamental challenge is to properly select advantageous model during training. Existing fusion methods primarily focus on the training mode that uses cross entropy on ground truth in a teacher-forcing setup to measure a model's advantage, which may provide limited insight towards model advantage. In this paper, we introduce a novel approach that enhances the fusion process by incorporating both the training and inference modes. Our method evaluates model advantage not only through cross entropy during training but also by considering inference outputs, providing a more comprehensive assessment. To combine the two modes effectively, we introduce ProFuser to progressively transition from inference mode to training mode. To validate ProFuser's effectiveness, we fused three models, including Vicuna-7B-v1.5, Llama-2-7B-Chat, and MPT-7B-8K-Chat, and demonstrated the improved performance in knowledge, reasoning, and safety compared to baseline methods.

PDF Details DOI

ICML Conference 2025 Conference Paper

EGPlace: An Efficient Macro Placement Method via Evolutionary Search with Greedy Repositioning Guided Mutation

Ji Deng
Zhao Li
Ji Zhang
Jun Gao

Macro placement, which involves optimizing the positions of modules, is a critical phase in modern integrated circuit design and significantly influences chip performance. The growing complexity of integrated circuits demands increasingly sophisticated placement solutions. Existing approaches have evolved along two primary paths (e. g. , constructive and adjustment methods), but they face significant practical limitations that affect real-world chip design. Recent hybrid frameworks such as WireMask-EA have attempted to combine these strategies, but significant technical barriers still remain, including the computational overhead from separated layout adjustment and reconstruction that often require complete layout rebuilding, the inefficient exploration of design spaces due to random mutation operations, and the computational complexity of mask-based construction methods that limit scalability. To overcome these limitations, we introduce EGPlace, a novel evolutionary optimization framework that combines guided mutation strategies with efficient layout reconstruction. EGPlace introduces two key innovations: a greedy repositioning-guided mutation operator that systematically identifies and optimizes critical layout regions, and an efficient mask computation algorithm that accelerates layout evaluation. Our extensive evaluation using ISPD2005 and Ariane RISC-V CPU benchmarks demonstrate that EGPlace reduces wirelength by 10. 8% and 9. 3% compared to WireMask-EA and the state-of-the-art reinforcement learning-based constructive method EfficientPlace, respectively, while achieving speedups of 7. 8$\times$ and 2. 8$\times$ over these methods.

Details

IJCAI Conference 2025 Conference Paper

Filling the Missings: Spatiotemporal Data Imputation by Conditional Diffusion

Wenying He
Jieling Huang
Junhua Gu
Ji Zhang
Yude Bai

Missing data in spatiotemporal systems presents a significant challenge for modern applications, ranging from environmental monitoring to urban traffic management. The integrity of spatiotemporal data often deteriorates due to hardware malfunctions and software failures in real-world deployments. Current approaches based on machine learning and deep learning struggle to model the intricate interdependencies between spatial and temporal dimensions effectively and, more importantly, suffer from cumulative errors during the data imputation process, which propagate and amplify through iterations. To address these limitations, we propose CoFILL, a novel Conditional Diffusion Model for spatiotemporal data imputation. CoFILL builds on the inherent advantages of diffusion models to generate high-quality imputations without relying on potentially error-prone prior estimates. It incorporates an innovative dual-stream architecture that processes temporal and frequency domain features in parallel. By fusing these complementary features, CoFILL captures both rapid fluctuations and underlying patterns in the data, which enables more robust imputation. The extensive experiments demonstrate that CoFILL's noise prediction network successfully transforms random noise into meaningful values that align with the true data distribution. The results also show that CoFILL outperforms state-of-the-art methods in terms of imputation accuracy. The source code is publicly available at https: //github. com/joyHJL/CoFILL.

PDF Details DOI

EAAI Journal 2025 Journal Article

Research on optimization strategy for steel strip temper rolling elongation based on model predictive control

Zhixuan Wang
Ji Zhang
Zhuo Wang
Hao Wang
Zhenhua Bai

Details DOI

ICML Conference 2025 Conference Paper

Score as Action: Fine Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning

Hanyang Zhao
Haoxian Chen 0002
Ji Zhang
David D. Yao
Wenpin Tang

Reinforcement learning from human feedback (RLHF), which aligns a diffusion model with input prompt, has become a crucial step in building reliable generative AI models. Most works in this area uses a discrete-time formulation, which is prone to induced errors, and often not applicable to models with higher-order/black-box solvers. The objective of this study is to develop a disciplined approach to fine-tuning diffusion models using continuous-time RL, formulated as a stochastic control problem with a reward function that aligns the end result (terminal state) with input prompt. The key idea is to treat score matching as controls or actions, and thereby connecting to policy optimization and regularization in continuous-time RL. To carry out this idea, we lay out a new policy optimization framework for continuous-time RL, and illustrate its potential in enhancing the value networks design space via leveraging the structural property of diffusion models. We validate the advantages of our method by experiments in downstream tasks of fine-tuning large-scale Text2Image models, Stable Diffusion v1. 5.

Details

IROS Conference 2025 Conference Paper

SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models

Nader Zantout
Haochen Zhang
Pujith Kachana
Jinkai Qiu
Guofei Chen
Ji Zhang
Wenshan Wang

Interpreting object-referential language and grounding objects in 3D with spatial relations and attributes is essential for robots operating alongside humans. However, this task is often challenging due to the diversity of scenes, large number of fine-grained objects, and complex free-form nature of language references. Furthermore, in the 3D domain, obtaining large amounts of natural language training data is difficult. Thus, it is important for methods to learn from little data and zero-shot generalize to new environments. To address these challenges, we propose SORT3D, an approach that utilizes rich object attributes from 2D data and merges a heuristics-based spatial reasoning toolbox with the ability of large language models (LLMs) to perform sequential reasoning. Importantly, our method does not require text-to-3D data for training and can be applied zero-shot to unseen environments. We show that SORT3D achieves state-of-the-art zero-shot performance on complex view-dependent grounding tasks on two benchmarks. We also implement the pipeline to run real-time on two autonomous vehicles and demonstrate that our approach can be used for object-goal navigation on previously unseen real-world environments. All source code for the system pipeline is publicly released. 1

Details

NeurIPS Conference 2025 Conference Paper

VLM-R³: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought

Chaoya Jiang
Yongrui Heng
Wei Ye
Haiyang Xu
Ming Yan
Ji Zhang
Fei Huang
Shikun Zhang

Recently, reasoning-based MLLMs have achieved a degree of success in generating long-form textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on and revisiting of visual regions to achieve precise grounding of textual reasoning in visual evidence. We introduce VLM-R³ (Visual Language Model with Region Recognition, Reasoning, and Refinement ), a framework that equips an MLLM with the ability to (i) decide when additional visual evidence is needed, (ii) determine where to ground within the image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved chain-of-thought. The core of our method is \textbf{Region-Conditioned Reinforcement Policy Optimization (R-GRPO)}, a training paradigm that rewards the model for selecting informative regions, formulating appropriate transformations (e. g. crop, zoom), and integrating the resulting visual context into subsequent reasoning steps. To bootstrap this policy, we compile a modest but carefully curated Visuo-Lingual Interleaved Rationale (VLIR) corpus that provides step-level supervision on region selection and textual justification. Extensive experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R$^3$ sets a new state of the art in zero-shot and few-shot settings, with the largest gains appearing on questions demanding subtle spatial reasoning or fine-grained visual cue extraction.

PDF Details

NeurIPS Conference 2025 Conference Paper

WritingBench: A Comprehensive Benchmark for Generative Writing

Yuning Wu
Jiahao Mei
Ming Yan
Chenliang Li
Shaopeng Lai
Yuran Ren
Zijia Wang
Ji Zhang

Recent advancements in large language models (LLMs) have significantly enhanced text generation capabilities, yet evaluating their performance in generative writing remains a challenge. Existing benchmarks primarily focus on generic text generation or limited in writing tasks, failing to capture the diverse requirements of high-quality written contents across various domains. To bridge this gap, we present WritingBench, a comprehensive benchmark designed to evaluate LLMs across 6 core writing domains and 100 subdomains. We further propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria. This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length. The framework's validity is further demonstrated by its data curation capability, which enables a 7B-parameter model to outperform the performance of GPT-4o in writing. We open-source the benchmark, along with evaluation tools and modular framework components, to advance the development of LLMs in writing.

PDF Details

IJCAI Conference 2024 Conference Paper

Breaking Barriers of System Heterogeneity: Straggler-Tolerant Multimodal Federated Learning via Knowledge Distillation

Jinqian Chen
Haoyu Tang
Junhao Cheng
Ming Yan
Ji Zhang
Mingzhu Xu
Yupeng Hu
Liqiang Nie

Internet of Things (IoT) devices possess valuable yet private multimodal data, calling for a decentralized machine learning scheme. Though several multimodal federated learning (MFL) methods have been proposed, most of them merely overlook the system heterogeneity across IoT devices, resulting in the inadaptability to real world applications. Aiming at this, we conduct theoretical analysis and exploration experiments on straggler impacts and uncover the fact that stragglers caused by system heterogeneity are fatal to MFL, resulting in catastrophic time overhead. Motivated by this, we propose a novel Multimodal Federated Learning with Accelerated Knowledge Distillation (MFL-AKD) framework, which is the first attempt to integrate knowledge distillation to combat stragglers in complex multimodal federated scenarios. Concretely, given the pretrained large-scale vision-language models deployed in the central server, we apply a fast knowledge transfer mechanism to conduct early training of local models with part of the local data. The early-trained model is then enhanced through the distillation of the pretrained large model and further trained on the remaining data. Extensive experiments on two datasets for video moment retrieval and two datasets for image-text retrieval demonstrate that our method achieves superior results with high straggler robustness.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

From Skepticism to Acceptance: Simulating the Attitude Dynamics Toward Fake News

Yuhan Liu
Xiuying Chen
Xiaoqing Zhang
Xing Gao
Ji Zhang
Rui Yan

In the digital era, the rapid propagation of fake news and rumors via social networks brings notable societal challenges and impacts public opinion regulation. Traditional fake news modeling typically forecasts the general popularity trends of different groups or numerically represents opinions shift. However, these methods often oversimplify real-world complexities and overlook the rich semantic information of news text. The advent of large language models (LLMs) provides the possibility of modeling subtle dynamics of opinion. Consequently, in this work, we introduce a Fake news Propagation Simulation framework (FPS) based on LLM, which studies the trends and control of fake news propagation in detail. Specifically, each agent in the simulation represents an individual with a distinct personality. They are equipped with both short-term and long-term memory, as well as a reflective mechanism to mimic human-like thinking. Every day, they engage in random opinion exchanges, reflect on their thinking, and update their opinions. Our simulation results uncover patterns in fake news propagation related to topic relevance, and individual traits, aligning with real-world observations. Additionally, we evaluate various intervention strategies and demonstrate that early and appropriately frequent interventions strike a balance between governance cost and effectiveness, offering valuable insights for practical applications. Our study underscores the significant utility and potential of LLMs in combating fake news.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

Chaoya Jiang
Hongrui Jia
Haiyang Xu
Wei Ye
Mengfan Dong
Ming Yan
Ji Zhang
Fei Huang

This paper presents MaVEn, an innovative Multi-granularity Visual Encoding framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning. Current MLLMs primarily focus on single-image visual understanding, limiting their ability to interpret and integrate information across multiple images. MaVEn addresses this limitation by combining discrete visual symbol sequences, which abstract coarse-grained semantic concepts, with traditional continuous representation sequences that model fine-grained features. This dual approach bridges the semantic gap between visual and textual data, thereby improving the model's ability to process and interpret information from multiple images effectively. Additionally, we design a dynamic reduction mechanism by for long-sequence continuous features to enhance multi-image processing efficiency. Experimental results demonstrate that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Junyang Wang
Haiyang Xu
Haitao Jia
Xi Zhang
Ming Yan
Weizhou Shen
Ji Zhang
Fei Huang

Mobile device operation tasks are increasingly becoming a popular multi-modal AI application scenario. Current Multi-modal Large Language Models (MLLMs), constrained by their training data, lack the capability to function effectively as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are gradually being applied to this scenario. However, the two major navigation challenges in mobile device operation tasks — task progress navigation and focus content navigation — are difficult to effectively solve under the single-agent architecture of existing work. This is due to the overly long token sequences and the interleaved text-image data format, which limit performance. To address these navigation challenges effectively, we propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. The planning agent condenses lengthy, interleaved image-text history operations and screens summaries into a pure-text task progress, which is then passed on to the decision agent. This reduction in context length makes it easier for decision agent to navigate the task progress. To retain focus content, we design a memory unit that updates with task progress by decision agent. Additionally, to correct erroneous operations, the reflection agent observes the outcomes of each operation and handles any mistake accordingly. Experimental results indicate that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent. The code is open-sourced at https: //github. com/X-PLUG/MobileAgent.

PDF Details DOI

AAAI Conference 2024 Conference Paper

TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training

Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Ming Yan
Ji Zhang
Shikun Zhang

Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities. Due to noises in web-harvested text-image pairs, however, scaling up training data volume in SMCL presents considerable obstacles in terms of computational cost and data inefficiency. To improve data efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates mix-based data augmentation techniques into SMCL, yielding significant performance improvements without significantly increasing computational overhead. We provide a theoretical analysis of TiMix from a mutual information (MI) perspective, showing that mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. The experimental results demonstrate that TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods. This work empirically and theoretically demonstrates the potential of data mixing for data-efficient and computationally viable VLP, benefiting broader VLP model adoption in practical scenarios. Our code is available on https://github.com/chaoyajiang/TiMiX/tree/main.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

ContrastMotion: Self-supervised Scene Motion Learning for Large-Scale LiDAR Point Clouds

Xiangze Jia
Hui Zhou
Xinge Zhu
Yandong Guo
Ji Zhang
Yuexin Ma

In this paper, we propose a novel self-supervised motion estimator for LiDAR-based autonomous driving via BEV representation. Different from usually adopted self-supervised strategies for data-level structure consistency, we predict scene motion via feature-level consistency between pillars in consecutive frames, which can eliminate the effect caused by noise points and view-changing point clouds in dynamic scenes. Specifically, we propose Soft Discriminative Loss that provides the network with more pseudo-supervised signals to learn discriminative and robust features in a contrastive learning manner. We also propose Gated Multi-Frame Fusion block that learns valid compensation between point cloud frames automatically to enhance feature extraction. Finally, pillar association is proposed to predict pillar correspondence probabilities based on feature distance, and whereby further predicts scene motion. Extensive experiments show the effectiveness and superiority of our ContrastMotion on both scene flow and motion prediction tasks.

PDF Details DOI

IS Journal 2023 Journal Article

Irregularly Sampled Multivariate Time Series Classification: A Graph Learning Approach

Zhen Wang
Ting Jiang
Zenghui Xu
Ji Zhang
Jianliang Gao

To date, graph-based learning methods are proven to be effective for modeling spatial and structural dependencies. However, when applied to IS-MTS, they encounter three major challenges due to the complex data characteristics of IS-MTS: 1) variable time intervals between observations; 2) asynchronous time points across dimensions; and 3) a lack of prior knowledge of connectivity structure for message propagation. To fill these gaps, we propose a multivariate temporal graph network to coherently capture structural interactions, learn temporal dependencies, and handle challenging characteristics of IS-MTS data. Specifically, we first build a multivariate interaction module to handle frequent missing values and extract the graph structure relation automatically. Second, we design a novel adjacent graph propagation mechanism to aggregate the neighbor information from multistep snapshots. Third, we construct a masked temporal-aware attention module to explicitly consider the timestamp context and interval irregularity. Based on an extensive experimental evaluation, we demonstrate the superior performance of the proposed method.

Details DOI

ICLR Conference 2023 Conference Paper

Self-Supervised Category-Level Articulated Object Pose Estimation with Part-Level SE(3) Equivariance

Xueyi Liu
Ji Zhang
Ruizhen Hu
Haibin Huang
He Wang 0010
Li Yi 0001

Category-level articulated object pose estimation aims to estimate a hierarchy of articulation-aware object poses of an unseen articulated object from a known category. To reduce the heavy annotations needed for supervised learning methods, we present a novel self-supervised strategy that solves this problem without any human labels. Our key idea is to factorize canonical shapes and articulated object poses from input articulated shapes through part-level equivariant shape analysis. Specifically, we first introduce the concept of part-level SE(3) equivariance and devise a network to learn features of such property. Then, through a carefully designed fine-grained pose-shape disentanglement strategy, we expect that canonical spaces to support pose estimation could be induced automatically. Thus, we could further predict articulated object poses as per-part rigid transformations describing how parts transform from their canonical part spaces to the camera space. Extensive experiments demonstrate the effectiveness of our method on both complete and partial point clouds from synthetic and real articulated object datasets.

Details

IJCAI Conference 2022 Conference Paper

DictBERT: Dictionary Description Knowledge Enhanced Language Model Pre-training via Contrastive Learning

Qianglong Chen
Feng-Lin Li
Guohai Xu
Ming Yan
Ji Zhang
Yin Zhang

Although pre-trained language models (PLMs) have achieved state-of-the-art performance on various natural language processing (NLP) tasks, they are shown to be lacking in knowledge when dealing with knowledge driven tasks. Despite the many efforts made for injecting knowledge into PLMs, this problem remains open. To address the challenge, we propose DictBERT, a novel approach that enhances PLMs with dictionary knowledge which is easier to acquire than knowledge graph (KG). During pre-training, we present two novel pre-training tasks to inject dictionary knowledge into PLMs via contrastive learning: dictionary entry prediction and entry description discrimination. In fine-tuning, we use the pre-trained DictBERT as a plugin knowledge base (KB) to retrieve implicit knowledge for identified entries in an input sequence, and infuse the retrieved knowledge into the input to enhance its representation via a novel extra-hop attention mechanism. We evaluate our approach on a variety of knowledge driven and language understanding tasks, including NER, relation extraction, CommonsenseQA, OpenBookQA and GLUE. Experimental results demonstrate that our model can significantly improve typical PLMs: it gains a substantial improvement of 0. 5%, 2. 9%, 9. 0%, 7. 1% and 3. 3% on BERT-large respectively, and is also effective on RoBERTa-large.

PDF Details DOI

AAAI Conference 2022 Conference Paper

Logit Perturbation

Mengyang Li
Fengguang Su
Ou Wu
Ji Zhang

Features, logits, and labels are the three primary data when a sample passes through a deep neural network. Feature perturbation and label perturbation receive increasing attention in recent years. They have been proven to be useful in various deep learning approaches. For example, (adversarial) feature perturbation can improve the robustness or even generalization capability of learned models. However, limited studies have explicitly explored for the perturbation of logit vectors. This work discusses several existing methods related to logit perturbation. Based on a unified viewpoint between positive/negative data augmentation and loss variations incurred by logit perturbation, a new method is proposed to explicitly learn to perturb logits. A comparative analysis is conducted for the perturbations used in our and existing methods. Extensive experiments on benchmark image classification data sets and their long-tail versions indicated the competitive performance of our learning method. In addition, existing methods can be further improved by utilizing our method.

PDF Details

IJCAI Conference 2021 Conference Paper

AdaVQA: Overcoming Language Priors with Adapted Margin Cosine Loss

Yangyang Guo
Liqiang Nie
Zhiyong Cheng
Feng Ji
Ji Zhang
Alberto Del Bimbo

A number of studies point out that current Visual Question Answering (VQA) models are severely affected by the language prior problem, which refers to blindly making predictions based on the language shortcut. Some efforts have been devoted to overcoming this issue with delicate models. However, there is no research to address it from the view of the answer feature space learning, despite the fact that existing VQA methods all cast VQA as a classification task. Inspired by this, in this work, we attempt to tackle the language prior problem from the viewpoint of the feature space learning. An adapted margin cosine loss is designed to discriminate the frequent and the sparse answer feature space under each question type properly. In this way, the limited patterns within the language modality can be largely reduced to eliminate the language priors. We apply this loss function to several baseline models and evaluate its effectiveness on two VQA-CP benchmarks. Experimental results demonstrate that our proposed adapted margin cosine loss can enhance the baseline models with an absolute performance gain of 15\% on average, strongly verifying the potential of tackling the language prior problem in VQA from the angle of the answer feature space learning.

PDF Details DOI

IJCAI Conference 2021 Conference Paper

MDNN: A Multimodal Deep Neural Network for Predicting Drug-Drug Interaction Events

Tengfei Lyu
Jianliang Gao
Ling Tian
Zhao Li
Peng Zhang
Ji Zhang

The interaction of multiple drugs could lead to serious events, which causes injuries and huge medical costs. Accurate prediction of drug-drug interaction (DDI) events can help clinicians make effective decisions and establish appropriate therapy programs. Recently, many AI-based techniques have been proposed for predicting DDI associated events. However, most existing methods pay less attention to the potential correlations between DDI events and other multimodal data such as targets and enzymes. To address this problem, we propose a Multimodal Deep Neural Network (MDNN) for DDI events prediction. In MDNN, we design a two-pathway framework including drug knowledge graph (DKG) based pathway and heterogeneous feature (HF) based pathway to obtain drug multimodal representations. Finally, a multimodal fusion neural layer is designed to explore the complementary among the drug multimodal representations. We conduct extensive experiments on real-world dataset. The results show that MDNN can accurately predict DDI events and outperform the state-of-the-art models.

PDF Details DOI

TIST Journal 2021 Journal Article

TARA-Net: A Fusion Network for Detecting Takeaway Rider Accidents

Yifan He
Zhao Li
Lei Fu
Anhui Wang
Peng Zhang
Shuigeng Zhou
Ji Zhang
Ting Yu

In the emerging business of food delivery, rider traffic accidents raise financial cost and social traffic burden. Although there has been much effort on traffic accident forecasting using temporal-spatial prediction models, none of the existing work studies the problem of detecting the takeaway rider accidents based on food delivery trajectory data. In this article, we aim to detect whether a takeaway rider meets an accident on a certain time period based on trajectories of food delivery and riders’ contextual information. The food delivery data has a heterogeneous information structure and carries contextual information such as weather and delivery history, and trajectory data are collected as a spatial-temporal sequence. In this article, we propose a TakeAway Rider Accident detection fusion network TARA-Net to jointly model these heterogeneous and spatial-temporal sequence data. We utilize the residual network to extract basic contextual information features and take advantage of a transformer encoder to capture trajectory features. These embedding features are concatenated into a pyramidal feed-forward neural network. We jointly train the above three components to combine the benefits of spatial-temporal trajectory data and sparse basic contextual data for early detecting traffic accidents. Furthermore, although traffic accidents rarely happen in food delivery, we propose a sampling mechanism to alleviate the imbalance of samples when training the model. We evaluate the model on a transportation mode classification dataset Geolife and a real-world Ele.me dataset with over 3 million riders. The experimental results show that the proposed model is superior to the state-of-the-art.

Details DOI

AAAI Conference 2021 Conference Paper

Testing Independence Between Linear Combinations for Causal Discovery

Hao Zhang
Kun Zhang
Shuigeng Zhou
Jihong Guan
Ji Zhang

Recently, regression based conditional independence (CI) tests have been employed to solve the problem of causal discovery. These methods provide an alternative way to test for CI by transforming CI to independence between residuals. Generally, it is nontrivial to check for independence when these residuals are linearly uncorrelated. With the ability to represent high-order moments, kernel-based methods are usually used to achieve this goal, but at a cost of considerable time. In this paper, we investigate the independence between two linear combinations under linear non-Gaussian structural equation model (SEM). We show that generally the 1-st to 4-th moments of the two linear combinations contain enough information to infer whether or not they are independent. The proposed method provides a simpler but more effective way to measure CIs, with only calculating the 1-st to 4-th moments of the input variables. When applied to causal discovery, the proposed method outperforms kernel-based methods in terms of both speed and accuracy. which is validated by extensive experiments.

PDF Details

AAAI Conference 2019 Conference Paper

A Deep Cascade Model for Multi-Document Reading Comprehension

Ming Yan
Jiangnan Xia
Chen Wu
Bin Bi
Zhongzhou Zhao
Ji Zhang
Luo Si
Rui Wang

A fundamental trade-off between effectiveness and efficiency needs to be balanced when designing an online question answering system. Effectiveness comes from sophisticated functions such as extractive machine reading comprehension (MRC), while efficiency is obtained from improvements in preliminary retrieval components such as candidate document selection and paragraph ranking. Given the complexity of the real-world multi-document MRC scenario, it is difficult to jointly optimize both in an end-to-end system. To address this problem, we develop a novel deep cascade learning model, which progressively evolves from the documentlevel and paragraph-level ranking of candidate texts to more precise answer extraction with machine reading comprehension. Specifically, irrelevant documents and paragraphs are first filtered out with simple functions for efficiency consideration. Then we jointly train three modules on the remaining texts for better tracking the answer: the document extraction, the paragraph extraction and the answer extraction. Experiment results show that the proposed method outperforms the previous state-of-the-art methods on two large-scale multidocument benchmark datasets, i. e. , TriviaQA and DuReader. In addition, our online system can stably serve typical scenarios with millions of daily requests in less than 50ms.

PDF Details

AAAI Conference 2019 Conference Paper

Large-Scale Visual Relationship Understanding

Ji Zhang
Yannis Kalantidis
Marcus Rohrbach
Manohar Paluri
Ahmed Elgammal
Mohamed Elhoseiny

Large scale visual understanding is challenging, as it requires a model to handle the widely-spread and imbalanced distribution of hsubject, relation, objecti triples. In real-world scenarios with large numbers of objects and relations, some are seen very commonly while others are barely seen. We develop a new relationship detection model that embeds objects and relations into two vector spaces where both discriminative capability and semantic affinity are preserved. We learn a visual and a semantic module that map features from the two modalities into a shared space, where matched pairs of features have to discriminate against those unmatched, but also maintain close distances to semantically similar ones. Benefiting from that, our model can achieve superior performance even when the visual entity categories scale up to more than 80, 000, with extremely skewed class distribution. We demonstrate the efficacy of our model on a large and imbalanced benchmark based of Visual Genome that comprises 53, 000+ objects and 29, 000+ relations, a scale at which no previous work has been evaluated at. We show superiority of our model over competitive baselines on the original Visual Genome dataset with 80, 000+ categories. We also show state-of-the-art performance on the VRD dataset and the scene graph dataset which is a subset of Visual Genome with 200 categories.

PDF Details

AAAI Conference 2010 Conference Paper

Error Aware Monocular Visual Odometry using Vertical Line Pairs for Small Robots in Urban Areas

Ji Zhang
Dezhen Song

We report a new error-aware monocular visual odometry method that only uses vertical lines, such as vertical edges of buildings and poles in urban areas as landmarks. Since vertical lines are easy to extract, insensitive to lighting conditions/shadows, and sensitive to robot movements on the ground plane, they are robust features if compared with regular point features or line features. We derive a recursive visual odometry method based on the vertical line pairs. We analyze how errors are propagated and introduced in the continuous odometry process by deriving the closed form representation of covariance matrix. We formulate the minimum variance ego-motion estimation problem and present a method that outputs weights for different vertical line pairs. The resulting visual odometry method is tested in physical experiments and compared with two existing methods that are based on point features and line features, respectively. The experiment results show that our method outperforms its two counterparts in robustness, accuracy, and speed. The relative errors of our method are less than 2% in experiments.

PDF Details

ICRA Conference 2009 Conference Paper

Effect of energy feedbacks on Virtual Slope Walking: I. Complementary Energy Feedback

Ji Zhang
Mingguo Zhao
Hao Dong 0001

This paper presents our study over the effect of complementary energy feedback on virtual slope walking, while virtual slope walking is our new biped gait generation method inspired by passive dynamic walking. The energy feedback strength is defined and the walking is modeled as a step-to-step function. The Jacobi matrix eigenvalues of the function are calculated together with the basin of attraction. From the analysis, we find the characteristic of Complementary Energy Feedback is being effective on a fast gait but weak on a slow one. By making use of the complementary energy feedback in walking experiment, our robot achieves speed change from 1. 5 leg/s to 4. 1 leg/s.

Details

IROS Conference 2009 Conference Paper

On the error analysis of vertical line pair-based monocular visual odometry in urban area

Ji Zhang
Dezhen Song

When a robot travels in urban area, Global Positional System (GPS) signals might be obstructed by buildings. Hence visual odometry is a choice. We notice that the vertical edges from high buildings and poles of street lights are a very stable set of features that can be easily extracted. Thus, we develop a monocular vision-based odometry system that utilizes the vertical edges from the scene to estimate the robot ego-motion. Since it only takes a single vertical line pair to estimate the robot ego-motion on the road plane, here we model the ego-motion estimation process and analyze how the choice of different vertical line pair impacts the accuracy of the ego-motion estimation process. The resulting closed form error model can assist to choose an appropriate pair of vertical lines to reduce the error in computation. We have implemented the proposed method and validated the error analysis results in physical experiments.

Details

IROS Conference 2006 Conference Paper

Gait Planning Of Quadruped Robot Based On Third-Order Spline Interpolation

Hao Dong 0001
Mingguo Zhao
Ji Zhang
Zongying Shi
Naiyao Zhang

This paper presented a brief description for the gait planning of quadruped robot named Aibo ERS-7 which is a standard platform in the RoboCup 4-legged league. We approach a spline shaped locus to reduce the dimension of the parameter optimizing space and solve the problem of the significant bias between the planned locus and the real one. The result shows that the spline shaped locus is effective in finding the optimized locus shape in a short time. Finally, the robot achieves a gait faster than any previously known learned gait for Aibo

Details