Arrow Research search

Author name cluster

Hao Jiang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

45 papers
2 author rows

Possible papers

45

AAAI Conference 2026 Conference Paper

Agentmandering: A Game-Theoretic Framework for Fair Redistricting via Large Language Model Agents

  • Hao Li
  • Haotian Chen
  • Ruoyuan Gong
  • Juanjuan Wang
  • Hao Jiang

Redistricting plays a central role in shaping how votes are translated into political power. While existing computational methods primarily aim to generate large ensembles of legally valid districting plans, they often neglect the strategic dynamics involved in the selection process. This oversight creates opportunities for partisan actors to cherry-pick maps that, while technically compliant, are politically advantageous. Simply satisfying formal constraints does not ensure fairness when the selection process itself can be manipulated. We propose Agentmandering, a framework that reimagines redistricting as a turn-based negotiation between two agents representing opposing political interests. Drawing inspiration from game-theoretic ideas, particularly the Choose-and-Freeze protocol, our method embeds strategic interaction into the redistricting process via large language model (LLM) agents. Agents alternate between selecting and freezing districts from a small set of candidate maps, gradually partitioning the state through constrained and interpretable choices. Evaluation on post-2020 U.S. Census data across all states shows that Agentmandering significantly reduces partisan bias and unfairness, while achieving 2 to 3 orders of magnitude lower variance than standard baselines. These results demonstrate both fairness and stability, especially in swing-state scenarios.

AAAI Conference 2026 Conference Paper

CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

  • Guanghao Zhang
  • Tao Zhong
  • Yan Xia
  • Mushui Liu
  • Zhelun Yu
  • Haoyuan Li
  • Wanggui He
  • Dong She

While previous multimodal slow-thinking methods have demonstrated remarkable success in single-image understanding scenarios, their effectiveness becomes fundamentally constrained when extended to more complex multi-image comprehension tasks. This limitation stems from their predominant reliance on text-based intermediate reasoning processes. While for human, when engaging in sophisticated multi-image analysis, they typically perform two complementary cognitive operations: (1) continuous cross-image visual comparison through region-of-interest matching, and (2) dynamic memorization of critical visual concepts throughout the reasoning chain. Motivated by these observations, we propose the Complex Multi-Modal Chain-of-Thought (CMMCoT) framework, a multi-step reasoning framework that mimics human-like "slow thinking" for multi-image understanding. Our approach incorporates two key innovations: (1) The construction of interleaved multimodal multi-step reasoning chains, which utilize critical visual region tokens, extracted from intermediate reasoning steps, as supervisory signals. This mechanism not only facilitates comprehensive cross-modal understanding but also enhances model interpretability. (2) The introduction of a test-time memory augmentation module that expands the model’s reasoning capacity during inference while preserving parameter efficiency. Furthermore, to facilitate research in this direction, we have curated a novel multi-image slow-thinking dataset. Extensive experiments demonstrate the effectiveness of our model.

AAAI Conference 2026 Conference Paper

FUSE: Fine-Grained and Semantic-Aware Learning for Unified Image Understanding and Generation

  • Peng Zhang
  • Wanggui He
  • Mushui Liu
  • Wenyi Xiao
  • Siyu Zou
  • Yuan Li
  • Xingjian Wang
  • Guanghao Zhang

Recent unified models have demonstrated that the reasoning capacity of Multimodal Large Language Models (MLLMs) can be leveraged to facilitate diffusion-based image generation with impressive flexibility and performance. However, approaches that rely heavily on MLLMs for high-level semantic encoding often struggle with fine-grained visual tasks like image editing and virtual try-on. To address this gap, we propose FUSE, a unified framework excelling at both high-level vision–language understanding and fine-grained generation. First, we introduce a Semantic-to-Detail Connector that pre-aligns fine-grained visual features with the MLLM's semantic space. This design counteracts the low-level information loss inherent in MLLM encodings, creating a unified representation that steers the diffusion process with both global semantics and rich local details. Second, to further enhance semantic awareness and detail preservation, we introduce Adaptive-GRPO, a post-training objective that dynamically balances semantic coherence against pixel-level fidelity. The integration of these two innovations allows FUSE to generate images that are both semantically faithful and visually fine-grained. Comprehensive experiments on text-to-image and instruction-guided editing benchmarks show that FUSE significantly outperforms existing unified baselines, achieving 0.89 on Geneval, 0.65 on WISE, and 3.88 on ImageEdit.

AAAI Conference 2026 Conference Paper

LLM-Aligned Geographic Item Tokenization for Local-Life Recommendation

  • Hao Jiang
  • Guoquan Wang
  • Donglin Zhou
  • Sheng Yu
  • Yang Zeng
  • Wencong Zeng
  • Kun Gai
  • Guorui Zhou

Recent advances in Large Language Models (LLMs) have enhanced text-based recommendation by enriching traditional ID-based methods with semantic generalization capabilities. Text-based methods typically encode item textual information via prompt design and generate discrete semantic IDs through item tokenization. However, in domain-specific tasks such as local-life services, simply injecting location information into prompts fails to capture fine-grained spatial characteristics and real-world distance awareness among items. To address this, we propose LGSID, an LLM-Aligned Geographic Item Tokenization Framework for Local-life Recommendation. This framework consists of two key components: (1) RL-based Geographic LLM Alignment, and (2) Hierarchical Geographic Item Tokenization. In the RL-based alignment module, we initially train a list-wise reward model to capture real-world spatial relationships among items. We then introduce a novel G-DPO algorithm that uses pre-trained reward model to inject generalized spatial knowledge and collaborative signals into LLMs while preserving their semantic understanding. Furthermore, we propose a hierarchical geographic item tokenization strategy, where primary tokens are derived from discrete spatial and content attributes, and residual tokens are refined using the aligned LLM’s geographic representation vectors. Extensive experiments on real-world Kuaishou industry datasets show that LGSID consistently outperforms state-of-the-art discriminative and generative recommendation models. Ablation studies, visualizations, and case studies further validate its effectiveness.

AAAI Conference 2026 Conference Paper

MAU-GPT: Enhancing Multi-type Industrial Anomaly Understanding via Anomaly-aware and Generalist Experts Adaptation

  • Zhuonan Wang
  • Zhenxuan Fan
  • Siwen Tan
  • Yu Zhong
  • Yuqian Yuan
  • Haoyuan Li
  • Hao Jiang
  • Wenqiao Zhang

As industrial manufacturing scales, automating fine-grained product image analysis has become critical for quality control. However, existing approaches are hindered by limited dataset coverage and poor model generalization across diverse and complex anomaly patterns. To address these challenges, we introduce MAU-Set, a comprehensive dataset for Multi-type industrial Anomaly Understanding. It spans multiple industrial domains and features a hierarchical task structure, ranging from binary classification to complex reasoning. Alongside this dataset, we establish a rigorous evaluation protocol to facilitate fair and comprehensive model assessment. Building upon this foundation, we further present MAU-GPT, a domain-adapted multimodal large model specifically designed for industrial anomaly understanding. It incorporates a novel AMoE-LoRA mechanism that unifies anomaly-aware and generalist experts adaptation, enhancing both understanding and reasoning across diverse defect classes. Extensive experiments show that MAU-GPT consistently outperforms prior state-of-the-art methods across all domains, demonstrating strong potential for scalable and automated industrial inspection.

AAAI Conference 2026 Conference Paper

Optimizing Ride-Pooling Operations with Extended Pickup and Drop-Off Flexibility

  • Hao Jiang
  • Yixin Xu
  • Pradeep Varakantham

The core of efficient on-demand ride-pooling lies in solving the Ride-Pool Matching Problem (RMP), which involves assigning multiple customer requests to single vehicles under various service constraints (e.g., pickup windows, detour allowances, and vehicle occupancy). A significant missed opportunity in most current RMP approaches is the assumption that passengers must be picked up and dropped off exactly at their requested locations. Allowing passengers to walk even short distances to meet vehicles could unlock substantial improvements in ride-pooling operations. Building upon the limitations of existing Ride-Pool Matching Problem (RMP) solutions that neglect passenger walkability, this paper introduces a novel matching method that strategically incorporates flexible pickup and drop-off locations. Our approach simultaneously determines the optimal assignment of vehicles to requests (one vehicle to potentially multiple requests and each request to at most one vehicle), identifies advantageous meeting points for passengers, and plans efficient vehicle routes. This comprehensive optimization respects all service constraints and considers the long-term implications of routing decisions. To achieve this integrated solution, we first employ a tree-based approach to enumerate all feasible pairings between passengers and vehicles. Subsequently, we calculate an optimal route for each of these feasible matches. Finally, we evaluate the quality of all possible assignments and select the most advantageous matching for implementation. In our experimental evaluation on city-scale taxi datasets, we demonstrate that our method improves the number of served requests by up to 13% and reduces the average vehicle travel distance by up to 21%. By serving more passengers with less driving distance, our approach achieves greater efficiency in a more sustainable manner — using fewer resources to deliver better service and creating a win-win outcome for all stakeholders, including customers, drivers, the aggregator, and the environment.

AAAI Conference 2026 Conference Paper

Themis: Automated Constraint-Aware Test Synthesis Framework for Code Reinforcement Learning

  • Shengyu Ye
  • Qi Liu
  • Hao Jiang
  • Zheng Zhang
  • Heng Yu
  • Zhenya Huang

Reinforcement learning (RL) has shown promise for enhancing code generation capabilities in large language models (LLMs), yet its effectiveness critically depends on high-quality test suites for reliable reward signals. Current approaches suffer from inadequate test case quantity and quality, leading to false positives (incorrect solutions passing verification) and slow positives (valid but suboptimal implementations), which corrupt RL training dynamics. We address these challenges through three key contributions: (1) We systematically analyze how low-quality test suites degrade Code RL performance via reward misalignment; (2) We propose Themis, an automated framework that transforms test case generation into code synthesis—first extracting problem constraints via template-guided parsing, then generating executable test generators through LLM-powered code synthesis, and finally validating tests through constraint-aware filtering; (3) We develop an error-guided test case reduction method that preserves error detection efficacy while reducing test set cardinality, thereby enhancing reinforcement learning training efficiency. Evaluated on programming competition datasets, Themis achieves 95 percent error detection rates, outperforming original test suites in most of the cases. When integrated into RL pipelines, models trained with Themis-generated tests demonstrate consistent 3-5 percent improvements across HumanEval, MBPP, and LiveCodeBench compared to the baseline, matching performance levels achieved with manually curated test suites. Our constraint-aware test synthesis framework ensures full automation while preserving semantic validity—critical for scaling RL training to complex code generation tasks. The framework's modular design also enables seamless integration with existing code data synthesis frameworks.

AAAI Conference 2025 Conference Paper

All-in-One: Transferring Vision Foundation Models into Stereo Matching

  • Jingyi Zhou
  • Haoyu Zhang
  • Jiakang Yuan
  • Peng Ye
  • Tao Chen
  • Hao Jiang
  • Meiya Chen
  • Yangyang Zhang

As a fundamental vision task, stereo matching has made remarkable progress. While recent iterative optimization-based methods have achieved promising performance, their feature extraction capabilities still have room for improvement. Inspired by the ability of vision foundation models (VFMs) to extract general representations, in this work, we propose AIO-Stereo which can flexibly select and transfer knowledge from multiple heterogeneous VFMs to a single stereo matching model. To better reconcile features between heterogeneous VFMs and the stereo matching model and fully exploit prior knowledge from VFMs, we proposed a dual-level feature utilization mechanism that aligns heterogeneous features and transfers multi-level knowledge. Based on the mechanism, a dual-level selective knowledge transfer module is designed to selectively transfer knowledge and integrate the advantages of multiple VFMs. Experimental results show that AIO-Stereo achieves start-of-the-art performance on multiple datasets and ranks 1st on the Middlebury dataset and outperforms all the published work on the ETH3D benchmark.

AAAI Conference 2025 Conference Paper

D^2-DPM: Dual Denoising for Quantized Diffusion Probabilistic Models

  • Qian Zeng
  • Jie Song
  • Han Zheng
  • Hao Jiang
  • Mingli Song

Diffusion models have achieved cutting-edge performance in image generation. However, their lengthy denoising process and computationally intensive score estimation network impede their scalability in low-latency and resource-constrained scenarios. Post-training quantization (PTQ) compresses and accelerates diffusion models without retraining, but it inevitably introduces additional quantization noise, resulting in mean and variance deviations. In this work, we propose D2-DPM, a dual denoising mechanism aimed at precisely mitigating the adverse effects of quantization noise on the noise estimation network. Specifically, we first unravel the impact of quantization noise on the sampling equation into two components: the mean deviation and the variance deviation. The mean deviation alters the drift coefficient of the sampling equation, influencing the trajectory trend, while the variance deviation magnifies the diffusion coefficient, impacting the convergence of the sampling trajectory. The proposed D2-DPM is thus devised to denoise the quantization noise at each time step, and then denoise the noisy sample through the inverse diffusion iterations. Experimental results demonstrate that D2-DPM achieves superior generation quality, yielding a 1.42 lower FID than the full-precision model while achieving 3.99x compression and 11.67x bit-operation acceleration.

AAAI Conference 2025 Conference Paper

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback

  • Wenyi Xiao
  • Ziwei Huang
  • Leilei Gan
  • Wanggui He
  • Haoyuan Li
  • Zhelun Yu
  • Fangxun Shu
  • Hao Jiang

The rapidly developing Large Vision Language Models (LVLMs) still face the hallucination phenomena where the generated responses do not align with the given contexts, significantly restricting the usages of LVLMs. Most previous work detects and mitigates hallucination at the coarse-grained level or requires expensive annotation (e.g., labeling by human experts or proprietary models). To address these issues, we propose detecting and mitigating hallucinations in LVLMs via fine-grained AI feedback. The basic idea is that we generate a small-size sentence-level hallucination annotation dataset by proprietary models, whereby we train a detection model which can perform sentence-level hallucination detection. Then, we propose a detect-then-rewrite pipeline to automatically construct preference dataset for hallucination mitigation training. Furthermore, we propose differentiating the severity of hallucinations, and introducing a Hallucination Severity-Aware Direct Preference Optimization (HSA-DPO) which prioritizes the mitigation of critical hallucination in LVLMs by incorporating the severity of hallucinations into preference learning. Extensive experiments on hallucination detection and mitigation benchmarks demonstrate that our method sets a new state-of-the-art in hallucination detection on MHaluBench, surpassing GPT-4V and Gemini, and reduces the hallucination rate by 36.1% on AMBER and 76.3% on Object HalBench compared to the base model.

AAAI Conference 2025 Conference Paper

Granularity-Adaptive Spatial Evidence Tokenization for Video Question Answering

  • Hao Jiang
  • Yang Jin
  • Zhicheng Sun
  • Kun Xu
  • Liwei Chen
  • Yang Song
  • Kun Gai
  • Yadong Mu

Video question answering plays a vital role in computer vision, and recent advances in large language models have further propelled the development of this field. However, existing video question answering techniques often face limitations in grasping fine-grained video content in spatial dimensions. It mainly stems from the fixed and low-resolution input of video frames. While some approaches using high-resolution inputs partially alleviate this problem, they introduce excessive computational burdens by encoding the entire high-resolution image. In this work, we propose a granularity-adaptive spatial evidence tokenization model for video question answering. Our method introduces multi-granular visual tokenization in the spatial dimension to produce video tokens at various granularities based on the question. It highlights spatially activated patches at low resolutions through a granularity weighting module and then adaptively encodes these activated patches at high resolution for detail supplementation. To mitigate the computational overhead associated with high-resolution frame encoding, a masking and acceleration module is developed for efficient visual tokenization. Moreover, a granularity compression module is designed to dynamically select and compress visual tokens of varying granularities based on questions. We conduct extensive experiments on 11 mainstream video question answering datasets and the experimental results demonstrate the effectiveness of our proposed method.

NeurIPS Conference 2025 Conference Paper

HAIF-GS: Hierarchical and Induced Flow-Guided Gaussian Splatting for Dynamic Scene

  • Jianing Chen
  • Zehao Li
  • Yujun Cai
  • Hao Jiang
  • Chengxuan Qian
  • Juyuan Kang
  • Shuqin Gao
  • Honglong Zhao

Reconstructing dynamic 3D scenes from monocular videos remains a fundamental challenge in 3D vision. While 3D Gaussian Splatting (3DGS) achieves real-time rendering in static settings, extending it to dynamic scenes is challenging due to the difficulty of learning structured and temporally consistent motion representations. This challenge often manifests as three limitations in existing methods: redundant Gaussian updates, insufficient motion supervision, and weak modeling of complex non-rigid deformations. These issues collectively hinder coherent and efficient dynamic reconstruction. To address these limitations, we propose HAIF-GS, a unified framework that enables structured and consistent dynamic modeling through sparse anchor-driven deformation. It first identifies motion-relevant regions via an Anchor Filter to suppress redundant updates in static areas. A self-supervised Induced Flow-Guided Deformation module induces anchor motion using multi-frame feature aggregation, eliminating the need for explicit flow labels. To further handle fine-grained deformations, a Hierarchical Anchor Propagation mechanism increases anchor resolution based on motion complexity and propagates multi-level transformations. Extensive experiments on synthetic and real-world benchmarks validate that HAIF-GS significantly outperforms prior dynamic 3DGS methods in rendering quality, temporal coherence, and reconstruction efficiency.

ICRA Conference 2025 Conference Paper

Learning to Predict the Future from Monocular Vision for Efficient Human-Aware Navigation

  • Yushuang Huang
  • Hao Jiang
  • Zihan Liu
  • Wanli Ouyang
  • Zhaoqi Wang

Human-aware navigation (HAN) aims to build autonomous agents that robustly and naturally navigate in human-centered environments. Due to the complex and dynamic nature of this task, existing approaches typically rely on sophisticated pipelines that separately process perception and decision-making to solve it. In this work, we propose an Obstruction Distance Vector based End-to-End Model (ODVEEM), using monocular vision for navigation around humans. The Obstruction Distance Vector (ODV) is an intermediate representation in our model, leveraged to describe the Obstruction Distance to the first future collision in all possible directions in the horizontal field of view. As ODV cannot be calculated directly in the real world, we design a neural network for ODV estimation, formulating it as a classification problem with auxiliary proxy tasks, which play a key role in effectively predicting the implicit future motion of nearby humans. Taking advantage of ODV, ODVEEM supervised by human behavioral heuristics is employed to guide the agent to reach a goal efficiently and avoid potential collisions. Several challenging experiments show our method's substantial improvement over a number of baseline methods, attaining solid performance with zero-shot transfer to unseen simulated and real-world environments.

AAAI Conference 2025 Conference Paper

MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

  • Wanggui He
  • Siming Fu
  • Mushui Liu
  • Xierui Wang
  • Wenyi Xiao
  • Fangxun Shu
  • Yi Wang
  • Lei Zhang

Auto-regressive models have made significant progress in the realm of text-to-image synthesis, yet devising an appropriate model architecture and training strategy to achieve a satisfactory level remains an important avenue of exploration. In this work, we introduce MARS, a novel framework for T2I generation that incorporates a specially designed Semantic Vision-Language Integration Expert (SemVIE). This innovative component integrates pre-trained LLMs by independently processing linguistic and visual information—freezing the textual component while fine-tuning the visual component. This methodology preserves the NLP capabilities of LLMs while imbuing them with exceptional visual understanding. Building upon the powerful base of the pre-trained Qwen-7B, MARS stands out with its bilingual generative capabilities corresponding to both English and Chinese language prompts and the capacity for joint image and text generation. The flexibility of this framework lends itself to migration towards any-to-any task adaptability. Furthermore, MARS employs a multi-stage training strategy that first establishes robust image-text alignment through complementary bidirectional tasks and subsequently concentrates on refining the T2I generation process, significantly augmenting text-image synchrony and the granularity of image details. Notably, MARS requires only 9% of the GPU days needed by SD1.5, yet it achieves remarkable results across a variety of benchmarks, illustrating the training efficiency and the potential for swift deployment in various applications.

AAAI Conference 2025 Conference Paper

Political Actor Agent: Simulating Legislative System for Roll Call Votes Prediction with Large Language Models

  • Hao Li
  • Ruoyuan Gong
  • Hao Jiang

Predicting roll call votes through modeling political actors has emerged as a focus in quantitative political science and computer science. Widely used embedding-based methods generate vectors for legislators from diverse data sets to predict legislative behaviors. However, these methods often contend with challenges such as the need for manually predefined features, reliance on extensive training data, and a lack of interpretability. Achieving more interpretable predictions under flexible conditions remains an unresolved issue. This paper introduces the Political Actor Agent (PAA), a novel agent-based framework that utilizes Large Language Models to overcome these limitations. By employing role-playing architectures and simulating legislative system, PAA provides a scalable and interpretable paradigm for predicting roll-call votes. Our approach not only enhances the accuracy of predictions but also offers multi-view, human-understandable decision reasoning, providing new insights into political actor behaviors. We conducted comprehensive experiments using voting records from the 117-118th U.S. House of Representatives, validating the superior performance and interpretability of PAA. This study not only demonstrates PAA's effectiveness but also its potential in political science research.

ICLR Conference 2025 Conference Paper

Pyramidal Flow Matching for Efficient Video Generative Modeling

  • Yang Jin
  • Zhicheng Sun 0001
  • Ningyuan Li 0002
  • Kun Xu 0005
  • Hao Jiang
  • Nan Zhuang
  • Quzhe Huang
  • Yang Song 0008

Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage. To reduce the complexity, the prevailing approaches employ a cascaded architecture to avoid direct training with full resolution latent. Despite reducing computational demands, the separate optimization of each sub-stage hinders knowledge sharing and sacrifices flexibility. This work introduces a unified pyramidal flow matching algorithm. It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling. Through our sophisticated design, the flows of different pyramid stages can be interlinked to maintain continuity. Moreover, we craft autoregressive video generation with a temporal pyramid to compress the full-resolution history. The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). Extensive experiments demonstrate that our method supports generating high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours. All code and models are open-sourced at https://pyramid-flow.github.io.

AAAI Conference 2025 Conference Paper

Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation

  • Qihan Huang
  • Siming Fu
  • Jinlong Liu
  • Hao Jiang
  • Yipeng Yu
  • Jie Song

Personalized text-to-image generation methods can generate customized images based on the reference images, which have garnered wide research interest. Recent methods propose a finetuning-free approach with a decoupled cross-attention mechanism to generate personalized images requiring no test-time finetuning. However, when multiple reference images are provided, the current decoupled cross-attention mechanism encounters the object confusion problem and fails to map each reference image to its corresponding object, thereby seriously limiting its scope of application. To address the object confusion problem, in this work we investigate the relevance of different positions of the latent image features to the target object in diffusion model, and accordingly propose a weighted-merge method to merge multiple reference image features into the corresponding objects. Next, we integrate this weighted-merge method into existing pre-trained models and continue to train the model on a multi-object dataset constructed from the open-sourced SA-1B dataset. To mitigate object confusion and reduce training costs, we propose an object quality score to estimate the image quality for the selection of high-quality training samples. Furthermore, our weighted-merge training framework can be employed on single-object generation when a single object has multiple reference images. The experiments verify that our method achieves superior performance to the state-of-the-arts on the Concept101 dataset and DreamBooth dataset of multi-object personalized image generation, and remarkably improves the performance on single-object personalized image generation.

AAAI Conference 2025 Conference Paper

VERSE: Verification-based Self-Play for Code Instructions

  • Hao Jiang
  • Qi Liu
  • Rui Li
  • Yuze Zhao
  • Yixiao Ma
  • Shengyu Ye
  • Junyu Lu
  • Yu Su

Instruction-tuned Code Large Language Models (Code LLMs) have excelled in diverse code-related tasks, such as program synthesis, automatic program repair, and code explanation. To collect training datasets for instruction-tuning, a popular method involves having models autonomously generate instructions and corresponding responses. However, the direct generation of responses does not ensure functional correctness, a crucial requirement for generating responses to code instructions. To overcome this, we present Verification-Based Self-Play (VERSE), aiming to enhance model proficiency in generating correct responses. VERSE establishes a robust verification framework that covers various code instructions. Employing VERSE, Code LLMs engage in self-play to generate instructions and corresponding verifications. They evaluate execution results and self-consistency as verification outcomes, using them as scores to rank generated data for self-training. Experiments show that VERSE improves multiple base Code LLMs (average 7.6%) across various languages and tasks on many benchmarks, affirming its effectiveness.

IJCAI Conference 2024 Conference Paper

An Embarrassingly Simple Approach to Enhance Transformer Performance in Genomic Selection for Crop Breeding

  • Renqi Chen
  • Wenwei Han
  • Haohao Zhang
  • Haoyang Su
  • Zhefan Wang
  • Xiaolei Liu
  • Hao Jiang
  • Wanli Ouyang

Genomic selection (GS), as a critical crop breeding strategy, plays a key role in enhancing food production and addressing the global hunger crisis. The predominant approaches in GS currently revolve around employing statistical methods for prediction. However, statistical methods often come with two main limitations: strong statistical priors and linear assumptions. A recent trend is to capture the non-linear relationships between markers by deep learning. However, as crop datasets are commonly long sequences with limited samples, the robustness of deep learning models, especially Transformers, remains a challenge. In this work, to unleash the unexplored potential of attention mechanism for the task of interest, we propose a simple yet effective Transformer-based framework that enables end-to-end training of the whole sequence. Via experiments on rice3k and wheat3k datasets, we show that, with simple tricks such as k-mer tokenization and random masking, Transformer can achieve overall superior performance against seminal methods on GS tasks of interest.

IROS Conference 2024 Conference Paper

Calibration-Free Vision-Assisted Container Loading of RTG Cranes

  • Jianbing Yang
  • Yuanzhe Wang
  • Hao Jiang
  • Bin Zhao
  • Yiming Li
  • Danwei Wang

Vision-assisted container loading of Rubber Tyred Gantry (RTG) cranes are facing two primary challenges. Firstly, the uncertainty inherent in Covolutional Neural Network (CNN) based detection hinders its direct application in the safety-critical operation of such heavy-duty machinery. Secondly, sensor calibration introduces additional complexities and errors into the system. However, existing studies have not adequately addressed these challenges. Motivated by this gap, this paper proposes an integrated approach for target detection and alignment control in container loading of RTG cranes. To ensure reliable target marker identification, a heuristic post-processing algorithm is developed as a complement to CNN-based foreground segmentation, thereby ensuring safety during the container handling process. On this basis, a pixel-based control scheme is designed to align the container with the target markers, which eliminates the need for offline or online sensor calibrations. The proposed approach has been successfully implemented on a real RTG crane manufactured by Shanghai Zhenhua Heavy Industries Co. , Ltd. (ZPMC) and validated at the Port of Ningbo, China. Experimental results demonstrate the superiority of the proposed approach over current manual operations in port industries, highlighting its potential for crane automation.

NeurIPS Conference 2024 Conference Paper

LG-CAV: Train Any Concept Activation Vector with Language Guidance

  • Qihan Huang
  • Jie Song
  • Mengqi Xue
  • Haofei Zhang
  • Bingde Hu
  • Huiqiong Wang
  • Hao Jiang
  • Xingen Wang

Concept activation vector (CAV) has attracted broad research interest in explainable AI, by elegantly attributing model predictions to specific concepts. However, the training of CAV often necessitates a large number of high-quality images, which are expensive to curate and thus limited to a predefined set of concepts. To address this issue, we propose Language-Guided CAV (LG-CAV) to harness the abundant concept knowledge within the certain pre-trained vision-language models (e. g. , CLIP). This method allows training any CAV without labeled data, by utilizing the corresponding concept descriptions as guidance. To bridge the gap between vision-language model and the target model, we calculate the activation values of concept descriptions on a common pool of images (probe images) with vision-language model and utilize them as language guidance to train the LG-CAV. Furthermore, after training high-quality LG-CAVs related to all the predicted classes in the target model, we propose the activation sample reweighting (ASR), serving as a model correction technique, to improve the performance of the target model in return. Experiments on four datasets across nine architectures demonstrate that LG-CAV achieves significantly superior quality to previous CAV methods given any concept, and our model correction method achieves state-of-the-art performance compared to existing concept-based methods. Our code is available at https: //github. com/hqhQAQ/LG-CAV.

NeurIPS Conference 2024 Conference Paper

RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance

  • Zhicheng Sun
  • Zhenhao Yang
  • Yang Jin
  • Haozhe Chi
  • Kun Xu
  • Liwei Chen
  • Hao Jiang
  • Yang Song

Customizing diffusion models to generate identity-preserving images from user-provided reference images is an intriguing new problem. The prevalent approaches typically require training on extensive domain-specific images to achieve identity preservation, which lacks flexibility across different use cases. To address this issue, we exploit classifier guidance, a training-free technique that steers diffusion models using an existing classifier, for personalized image generation. Our study shows that based on a recent rectified flow framework, the major limitation of vanilla classifier guidance in requiring a special classifier can be resolved with a simple fixed-point solution, allowing flexible personalization with off-the-shelf image discriminators. Moreover, its solving procedure proves to be stable when anchored to a reference flow trajectory, with a convergence guarantee. The derived method is implemented on rectified flow with different off-the-shelf image discriminators, delivering advantageous personalization results for human faces, live subjects, and certain objects. Code is available at https: //github. com/feifeiobama/RectifID.

AAAI Conference 2024 Conference Paper

Reward Penalties on Augmented States for Solving Richly Constrained RL Effectively

  • Hao Jiang
  • Tien Mai
  • Pradeep Varakantham
  • Huy Hoang

Constrained Reinforcement Learning employs trajectory-based cost constraints (such as expected cost, Value at Risk, or Conditional VaR cost) to compute safe policies. The challenge lies in handling these constraints effectively while optimizing expected reward. Existing methods convert such trajectory-based constraints into local cost constraints, but they rely on cost estimates, leading to either aggressive or conservative solutions with regards to cost. We propose an unconstrained formulation that employs reward penalties over states augmented with costs to compute safe policies. Unlike standard primal-dual methods, our approach penalizes only infeasible trajectories through state augmentation. This ensures that increasing the penalty parameter always guarantees a feasible policy, a feature lacking in primal-dual methods. Our approach exhibits strong empirical performance and theoretical properties, offering a fresh paradigm for solving complex Constrained RL problems, including rich constraints like expected cost, Value at Risk, and Conditional Value at Risk. Our experimental results demonstrate superior performance compared to leading approaches across various constraint types on multiple benchmark problems.

AAAI Conference 2024 Conference Paper

SD-MVS: Segmentation-Driven Deformation Multi-View Stereo with Spherical Refinement and EM Optimization

  • Zhenlong Yuan
  • Jiakai Cao
  • Zhaoxin Li
  • Hao Jiang
  • Zhaoqi Wang

In this paper, we introduce Segmentation-Driven Deformation Multi-View Stereo (SD-MVS), a method that can effectively tackle challenges in 3D reconstruction of textureless areas. We are the first to adopt the Segment Anything Model (SAM) to distinguish semantic instances in scenes and further leverage these constraints for pixelwise patch deformation on both matching cost and propagation. Concurrently, we propose a unique refinement strategy that combines spherical coordinates and gradient descent on normals and pixelwise search interval on depths, significantly improving the completeness of reconstructed 3D model. Furthermore, we adopt the Expectation-Maximization (EM) algorithm to alternately optimize the aggregate matching cost and hyperparameters, effectively mitigating the problem of parameters being excessively dependent on empirical tuning. Evaluations on the ETH3D high-resolution multi-view stereo benchmark and the Tanks and Temples dataset demonstrate that our method can achieve state-of-the-art results with less time consumption.

AAAI Conference 2024 Conference Paper

Transferable Video Moment Localization by Moment-Guided Query Prompting

  • Hao Jiang
  • Yang Yizhang
  • Yadong Mu

Video moment localization stands as a crucial task within the realm of computer vision, entailing the identification of temporal moments in untrimmed videos that bear semantic relevance to the supplied natural language queries. This work delves into a relatively unexplored facet of the task: the transferability of video moment localization models. This concern is addressed by evaluating moment localization models within a cross-domain transfer setting. In this setup, we curate multiple datasets distinguished by substantial domain gaps. The model undergoes training on one of these datasets, while validation and testing are executed using the remaining datasets. To confront the challenges inherent in this scenario, we draw inspiration from the recently introduced large-scale pre-trained vision-language models. Our focus is on exploring how the strategic utilization of these resources can bolster the capabilities of a model designed for video moment localization. Nevertheless, the distribution of language queries in video moment localization usually diverges from the text used by pre-trained models, exhibiting distinctions in aspects such as length, content, expression, and more. To mitigate the gap, this work proposes a Moment-Guided Query Prompting (MGQP) method for video moment localization. Our key idea is to generate multiple distinct and complementary prompt primitives through stratification of the original queries. Our approach is comprised of a prompt primitive constructor, a multimodal prompt refiner, and a holistic prompt incorporator. We carry out extensive experiments on Charades-STA, TACoS, DiDeMo, and YouCookII datasets, and investigate the efficacy of the proposed method using various pre-trained models, such as CLIP, ActionCLIP, CLIP4Clip, and VideoCLIP. The experimental results demonstrate the effectiveness of our proposed method.

ICML Conference 2024 Conference Paper

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

  • Yang Jin
  • Zhicheng Sun 0001
  • Kun Xu 0005
  • Li-Wei Chen
  • Hao Jiang
  • Quzhe Huang
  • Chengru Song
  • Yuliang Liu

In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static images, video poses unique challenges for effective large-scale pre-training due to the modeling of its spatiotemporal dynamics. In this paper, we address such limitations in video-language pre-training with an efficient video decomposition that represents each video as keyframes and temporal motions. These are then adapted to an LLM using well-designed tokenizers that discretize visual and temporal information as a few tokens, thus enabling unified generative pre-training of videos, images, and text. At inference, the generated tokens from the LLM are carefully recovered to the original continuous pixel space to create various video content. Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation. Our code and models are available at https: //video-lavit. github. io.

NeurIPS Conference 2023 Conference Paper

FairLISA: Fair User Modeling with Limited Sensitive Attributes Information

  • Zheng Zhang
  • Qi Liu
  • Hao Jiang
  • Fei Wang
  • Yan Zhuang
  • Le Wu
  • Weibo Gao
  • Enhong Chen

User modeling techniques profile users' latent characteristics (e. g. , preference) from their observed behaviors, and play a crucial role in decision-making. Unfortunately, traditional user models may unconsciously capture biases related to sensitive attributes (e. g. , gender) from behavior data, even when this sensitive information is not explicitly provided. This can lead to unfair issues and discrimination against certain groups based on these sensitive attributes. Recent studies have been proposed to improve fairness by explicitly decorrelating user modeling results and sensitive attributes. However, most existing approaches assume that fully sensitive attribute labels are available in the training set, which is unrealistic due to collection limitations like privacy concerns, and hence bear the limitation of performance. In this paper, we focus on a practical situation with limited sensitive data and propose a novel FairLISA framework, which can efficiently utilize data with known and unknown sensitive attributes to facilitate fair model training. We first propose a novel theoretical perspective to build the relationship between data with both known and unknown sensitive attributes with the fairness objective. Then, based on this, we provide a general adversarial framework to effectively leverage the whole user data for fair user modeling. We conduct experiments on representative user modeling tasks including recommender system and cognitive diagnosis. The results demonstrate that our FairLISA can effectively improve fairness while retaining high accuracy in scenarios with different ratios of missing sensitive attributes.

AAAI Conference 2023 Conference Paper

Future Aware Pricing and Matching for Sustainable On-Demand Ride Pooling

  • Xianjie Zhang
  • Pradeep Varakantham
  • Hao Jiang

The popularity of on-demand ride pooling is owing to the benefits offered to customers (lower prices), taxi drivers (higher revenue), environment (lower carbon footprint due to fewer vehicles) and aggregation companies like Uber (higher revenue). To achieve these benefits, two key interlinked challenges have to be solved effectively: (a) pricing -- setting prices to customer requests for taxis; and (b) matching -- assignment of customers (that accepted the prices) to taxis/cars. Traditionally, both these challenges have been studied individually and using myopic approaches (considering only current requests), without considering the impact of current matching on addressing future requests. In this paper, we develop a novel framework that handles the pricing and matching problems together, while also considering the future impact of the pricing and matching decisions. In our experimental results on a real-world taxi dataset, we demonstrate that our framework can significantly improve revenue (up to 17% and on average 6.4%) in a sustainable manner by reducing the number of vehicles (up to 14% and on average 10.6%) required to obtain a given fixed revenue and the overall distance travelled by vehicles (up to 11.1% and on average 3.7%). That is to say, we are able to provide an ideal win-win scenario for all stakeholders (customers, drivers, aggregator, environment) involved by obtaining higher revenue for customers, drivers, aggregator (ride pooling company) while being good for the environment (due to fewer number of vehicles on the road and lesser fuel consumed).

JAIR Journal 2023 Journal Article

Generating Random SAT Instances: Multiple Solutions could be Predefined and Deeply Hidden

  • Dongdong Zhao
  • Lei Liao
  • Wenjian Luo
  • Jianwen Xiang
  • Hao Jiang
  • Xiaoyi Hu

The generation of SAT instances is an important issue in computer science, and it is useful for researchers to verify the effectiveness of SAT solvers. Addressing this issue could inspire researchers to propose new search strategies. SAT problems exist in various real-world applications, some of which have more than one solution. However, although several algorithms for generating random SAT instances have been proposed, few can be used to generate hard instances that have multiple predefined solutions. In this paper, we propose the KHidden-M algorithm to generate SAT instances with multiple predefined solutions that could be hard to solve by the local search strategy when the number of predefined solutions is small enough and the Hamming distance between them is not less than half of the solution length. Specifically, first, we generate an SAT instance that is satisfied by all of the predefined solutions. Next, if the generated SAT instance does not satisfy the hardness condition, then a strategy will be conducted to adjust clauses through multiple iterations to improve the hardness of the whole instance. We propose three strategies to generate the SAT instance in the first part. The first strategy is called the random strategy, which randomly generates clauses that are satisfied by all of the predefined solutions. The other two strategies are called the estimating strategy and greedy strategy, and using them, we attempt to generate an instance that directly satisfies or is closer to the hardness condition for the local search strategy. We employ two SAT solvers (i.e., WalkSAT and Kissat) to investigate the hardness of the SAT instances generated by our algorithm in the experiments. The experimental results show the effectiveness of the random, estimating and greedy strategies. Compared to the state-of-the-art algorithm for generating SAT instances with predefined solutions, namely, M-hidden, our algorithm could be more effective in generating hard SAT instances.

ECAI Conference 2023 Conference Paper

On Sustainable Ride Pooling Through Conditional Expected Value Decomposition

  • Avinandan Bose
  • Hao Jiang
  • Pradeep Varakantham
  • Zichang Ge

Centralized Multi-Agent Reinforcement Learning (MARL) presents itself as an ideal framework for aggregation companies (e. g. , Uber, Lyft, Deliveroo) that have to take a sequential set of centralized decisions on assigning individual agents (typically resources like taxis, food delivery personnel) to customer requests online in the presence of demand uncertainty. However, centralized learning is especially challenging in such very large scale environments, with thousands of agents/resources and hundreds of thousands of requests coming in each day. In this paper, we provide a novel value decomposition mechanism that is able to tackle the scale and provide high quality (matching) decisions at each time step. We show that our value decomposition approach, Conditional Expectation based Value Decomposition (CEVD) is more sustainable (requires 9. 9% fewer vehicles to serve equal number of requests) and more efficient (serves 9. 76% more requests, while traveling 13. 32% lesser distance) than the current best approach over two different city scale (New York and Chicago) benchmarks for ride pooling using taxis.

NeurIPS Conference 2022 Conference Paper

BMU-MoCo: Bidirectional Momentum Update for Continual Video-Language Modeling

  • Yizhao Gao
  • Nanyi Fei
  • Haoyu Lu
  • Zhiwu Lu
  • Hao Jiang
  • Yijie Li
  • Zhao Cao

Video-language models suffer from forgetting old/learned knowledge when trained with streaming data. In this work, we thus propose a continual video-language modeling (CVLM) setting, where models are supposed to be sequentially trained on five widely-used video-text datasets with different data distributions. Although most of existing continual learning methods have achieved great success by exploiting extra information (e. g. , memory data of past tasks) or dynamically extended networks, they cause enormous resource consumption when transferred to our CVLM setting. To overcome the challenges (i. e. , catastrophic forgetting and heavy resource consumption) in CVLM, we propose a novel cross-modal MoCo-based model with bidirectional momentum update (BMU), termed BMU-MoCo. Concretely, our BMU-MoCo has two core designs: (1) Different from the conventional MoCo, we apply the momentum update to not only momentum encoders but also encoders (i. e. , bidirectional) at each training step, which enables the model to review the learned knowledge retained in the momentum encoders. (2) To further enhance our BMU-MoCo by utilizing earlier knowledge, we additionally maintain a pair of global momentum encoders (only initialized at the very beginning) with the same BMU strategy. Extensive results show that our BMU-MoCo remarkably outperforms recent competitors w. r. t. video-text retrieval performance and forgetting rate, even without using any extra data or dynamic networks.

NeurIPS Conference 2022 Conference Paper

Conditional Diffusion Process for Inverse Halftoning

  • Hao Jiang
  • Yadong Mu

Inverse halftoning is a technique used to recover realistic images from ancient prints (\textit{e. g. }, photographs, newspapers, books). The rise of deep learning has led to the gradual incorporation of neural network designs into inverse halftoning methods. Most of existing inverse halftoning approaches adopt the U-net architecture, which uses an encoder to encode halftone prints, followed by a decoder for image reconstruction. However, the mainstream supervised learning paradigm with element-wise regression commonly adopted in U-net based methods has poor generalization ability in practical applications. Specifically, when there is a large gap between the dithering patterns of the training and test halftones, the reconstructed continuous-tone images have obvious artifacts. This is an important issue in practical applications, since the algorithms for generating halftones are ever-evolving. Even for the same algorithm, different parameter choices will result in different halftone dithering patterns. In this paper, we propose the first generative halftoning method in the literature, which regards the black pixels in halftones as physically moving particles, and makes the randomly distributed particles move under some certain guidance through reverse diffusion process, so as to obtain desired halftone patterns. In particular, we propose a Conditional Diffusion model for image Halftoning (CDH), which consists of a halftone dithering process and an inverse halftoning process. By changing the initial state of the diffusion model, our method can generate visually plausible halftones with different dithering patterns under the condition of image gray level and Laplacian prior. To avoid introducing redundant patterns and undesired artifacts, we propose a meta-halftone guided network to incorporate blue noise guidance in the diffusion process. In this way, halftone images subject to more diverse distributions are fed into the inverse halftoning model, which helps the model to learn a more robust mapping from halftone distributions to continuous-tone distributions, thereby improving the generalization ability to unseen samples. Quantitative and qualitative experimental results demonstrate that the proposed method achieves state-of-the-art results.

IJCAI Conference 2022 Conference Paper

Hyperbolic Knowledge Transfer with Class Hierarchy for Few-Shot Learning

  • Baoquan Zhang
  • Hao Jiang
  • Shanshan Feng
  • Xutao Li
  • Yunming Ye
  • Rui Ye

Few-shot learning (FSL) aims to recognize a novel class with very few instances, which is a challenging task since it suffers from a data scarcity issue. One way to effectively alleviate this issue is introducing explicit knowledge summarized from human past experiences to achieve knowledge transfer for FSL. Based on this idea, in this paper, we introduce the explicit knowledge of class hierarchy (i. e. , the hierarchy relations between classes) as FSL priors and propose a novel hyperbolic knowledge transfer framework for FSL, namely, HyperKT. Our insight is, in the hyperbolic space, the hierarchy relation between classes can be well preserved by resorting to the exponential growth characters of hyperbolic volume, so that better knowledge transfer can be achieved for FSL. Specifically, we first regard the class hierarchy as a tree-like structure. Then, 1) a hyperbolic representation learning module and a hyperbolic prototype inference module are employed to encode/infer each image and class prototype to the hyperbolic space, respectively; and 2) a novel hierarchical classification and relation reconstruction loss are carefully designed to learn the class hierarchy. Finally, the novel class prediction is performed in a nearest-prototype manner. Extensive experiments on three datasets show our method achieves superior performance over state-of-the-art methods, especially on 1-shot tasks.

IJCAI Conference 2019 Conference Paper

Heavy-ball Algorithms Always Escape Saddle Points

  • Tao Sun
  • Dongsheng Li
  • Zhe Quan
  • Hao Jiang
  • Shengguo Li
  • Yong Dou

Nonconvex optimization algorithms with random initialization have attracted increasing attention recently. It has been showed that many first-order methods always avoid saddle points with random starting points. In this paper, we answer a question: can the nonconvex heavy-ball algorithms with random initialization avoid saddle points? The answer is yes! Direct using the existing proof technique for the heavy-ball algorithms is hard due to that each iteration of the heavy-ball algorithm consists of current and last points. It is impossible to formulate the algorithms as iteration like xk+1= g(xk) under some mapping g. To this end, we design a new mapping on a new space. With some transfers, the heavy-ball algorithm can be interpreted as iterations after this mapping. Theoretically, we prove that heavy-ball gradient descent enjoys larger stepsize than the gradient descent to escape saddle points to escape the saddle point. And the heavy-ball proximal point algorithm is also considered; we also proved that the algorithm can always escape the saddle point.

AAAI Conference 2019 Conference Paper

Non-Ergodic Convergence Analysis of Heavy-Ball Algorithms

  • Tao Sun
  • Penghang Yin
  • Dongsheng Li
  • Chun Huang
  • Lei Guan
  • Hao Jiang

In this paper, we revisit the convergence of the Heavy-ball method, and present improved convergence complexity results in the convex setting. We provide the first non-ergodic O(1/k) rate result of the Heavy-ball algorithm with constant step size for coercive objective functions. For objective functions satisfying a relaxed strongly convex condition, the linear convergence is established under weaker assumptions on the step size and inertial parameter than made in the existing literature. We extend our results to multi-block version of the algorithm with both the cyclic and stochastic update rules. In addition, our results can also be extended to decentralized optimization, where the ergodic analysis is not applicable.

TAAS Journal 2013 Journal Article

Fast, Accurate Event Classification on Resource-Lean Embedded Sensors

  • Hao Jiang
  • Jason O. Hallstrom

Due to the limited computational and energy resources available on existing wireless sensor platforms, achieving high-precision classification of high-level events in-network is a challenge. In this article, we present in-network implementations of a Bayesian classifier and a condensed kd-tree classifier for identifying events of interest on resource-lean embedded sensors. The first approach uses preprocessed sensor readings to derive a multidimensional Bayesian classifier used to classify sensor data in real time. The second introduces an innovative condensed kd-tree to represent preprocessed sensor data and uses a fast nearest-neighbor search to determine the likelihood of class membership for incoming samples. Both classifiers consume limited resources and provide high-precision classification. To evaluate each approach, two case studies are considered, in the contexts of human movement and vehicle navigation, respectively. The classification accuracy is above 85% for both classifiers across the two case studies.

IJCAI Conference 2011 Conference Paper

Mining User Dwell Time for Personalized Web Search Re-Ranking

  • Songhua Xu
  • Hao Jiang
  • Francis Chi-Moon Lau

We propose a personalized re-ranking algorithm through mining user dwell times derived from a user's previously online reading or browsing activities. We acquire document level user dwell times via a customized web browser, from which we then infer concept word level user dwell times in order to understand a user's personal interest. According to the estimated concept word level user dwell times, our algorithm can estimate a user's potential dwell time over a new document, based on which personalized webpage re-ranking can be carried out. We compare the rankings produced by our algorithm with rankings generated by popular commercial search engines and a recently proposed personalized ranking algorithm. The results clearly show the superiority of our method.

AAAI Conference 2008 Conference Paper

A User-Oriented Webpage Ranking Algorithm Based on User Attention Time

  • Songhua Xu
  • Hao Jiang

We propose a new webpage ranking algorithm which is personalized. Our idea is to rely on the attention time spent on a document by the user as the essential clue for producing the user-oriented webpage ranking. The prediction of the attention time of a new webpage is based on the attention time of other previously browsed pages by this user. To acquire the attention time of the latter webpages, we developed a browser plugin which is able to record the time a user spends reading a certain webpage and then automatically send that data to a server. Once the user attention time is acquired, we calibrate it to account for potential repetitive occurrences of the webpage before using it in the prediction process. After the user’s attention times of a collection of documents are known, our algorithm can predict the user’s attention time of a new document through document content similarity analysis, which is applied to both texts and images. We evaluate the webpage ranking results from our algorithm by comparing them with the ones produced by Google’s Pagerank algorithm.

ICRA Conference 2006 Conference Paper

Image Extraction by Wide Angle Foveated Lens for Overt-attention

  • Sota Shimizu
  • Hao Jiang
  • Joel W. Burdick

This paper defines wide angle foveated (WAF) imaging. A proposed model combines Cartesian coordinate system, a log-polar coordinate system, and a unique camera model composed of planar projection and spherical projection for all-purpose use of a single imaging device. The central field-of-view (FOV) and intermediate FOV are given translation-invariance and, rotation and scale-invariance for pattern recognition, respectively. Further, the peripheral FOV is more useful for camera's view direction control, because its image height is linear to an incident angle to the camera model's optical center point. Thus, this imaging model improves its usability especially when a camera is dynamically moved, that is, overt-attention. Moreover, simulation results of image extraction show advantages of the proposed model, in view of its magnification factor of the central FOV, accuracy of scale-invariance and flexibility to describe other WAF vision sensors

ICRA Conference 2005 Conference Paper

Machine Vision System to Induct Binocular Wide-Angle Foveated Information into Both the Human and Computers - Feature Generation Algorithm based on DFT for Binocular Fixation -

  • Sota Shimizu
  • Shinsuke Shimojo
  • Hao Jiang
  • Joel W. Burdick

This paper introduces a machine vision system, which is suitable for cooperative works between the human and computer. This system provides images inputted from a stereo camera head not only to the processor but also to the user’s sight as binocular wide-angle foveated (WAF) information, thus it is applicable for Virtual Reality (VR) systems such as tele-existence or training experts. The stereo camera head plays a role to get required input images foveated by special wide-angle optics under camera view direction control and 3D head mount display (HMD) displays fused 3D images to the user. Moreover, an analog video signal processing device much inspired from a structure of the human visual system realizes a unique way to provide WAF information to plural processors and the user. Therefore, this developed vision system is also much expected to be applicable for the human brain and vision research, because the design concept is to mimic the human visual system. Further, an algorithm to generate features using Discrete Fourier Transform (DFT) for binocular fixation in order to provide well-fused 3D images to 3D HMD is proposed. This paper examines influences of applying this algorithm to space variant images such as WAF images, based on experimental results.