Author name cluster

Hao Luo

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

1 author row

IJCAI Conference 2025 Conference Paper

Guiding LLM-based Smart Contract Generation with Finite State Machine

Hao Luo
Yuhao Lin
Xiao Yan
Xintong Hu
Yuxiang Wang
Qiming Zeng
Hao Wang
Jiawei Jiang

Smart contract is a kind of self-executing code based on blockchain technology with a wide range of application scenarios, but the traditional generation method relies on manual coding and expert auditing, which has a high threshold and low efficiency. Although Large Language Models (LLMs) show great potential in programming tasks, they still face challenges in smart contract generation w. r. t. effectiveness and security. To solve these problems, we propose FSM-SCG, a smart contract generation framework based on finite state machine (FSM) and LLMs, which significantly improves the quality of the generated code by abstracting user requirements to generate FSM, guiding LLMs to generate smart contracts, and iteratively optimizing the code with the feedback of compilation and security checks. The experimental results show that FSM-SCG significantly improves the quality of smart contract generation. Compared to the best baseline, FSM-SCG improves the compilation success rate of generated smart contract code by at most 48%, and reduces the average vulnerability risk score by approximately 68%.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

OpenMMEgo: Enhancing Egocentric Understanding for LMMs with Open Weights and Data

Hao Luo
Zihao Yue
Wanpeng Zhang
Yicheng Feng
Sipeng Zheng
Deheng Ye
Zongqing Lu

Recent advances in large multimodal models have significantly advanced video comprehension, yet their performance remains limited in first-person scenarios. The interactive nature of egocentric videos is critical for applications like embodied intelligence, but introduces complex visual contexts that conventional models struggle to capture. To bridge this gap, we introduce OpenMMEgo with innovations across three dimensions: data, model, and training strategy. To provide rich spatiotemporal visual knowledge, we curate a large-scale, high-quality dataset named OME10M, comprising over 8. 2M egocentric video QA pairs synthesized from Ego4D series. We also establish OMEBench, a comprehensive benchmark for rigorous egocentric understanding assessment. To alleviate the frequent viewpoint shifts inherent in egocentric videos, we implement semantic-aware visual token compression. Further, a curriculum learning strategy is complemented to foster stable learning across various data complexities. OpenMMEgo consistently improves the performance of LMMs on egocentric benchmarks without sacrificing general video understanding performance. Notably, Qwen2. 5-VL tuned with OpenMMEgo substantially outperforms other models of the same size in egocentric video understanding. The data, weights and training code will be put at https: //github. com/BeingBeyond/OpenMMEgo.

PDF Details

NeurIPS Conference 2025 Conference Paper

PlayerOne: Egocentric World Simulator

Yuanpeng Tu
Hao Luo
Xi Chen
Xiang Bai
Fan Wang
Hengshuang Zhao

We introduce PlayerOne, the first egocentric realistic world simulator, facilitating immersive and unrestricted exploration within vividly dynamic environments. Given an egocentric scene image from the user, PlayerOne can accurately construct the corresponding world and generate egocentric videos that are strictly aligned with the real-scene human motion of the user captured by an exocentric camera. PlayerOne is trained in a coarse-to-fine pipeline that first performs pretraining on large-scale egocentric text-video pairs for coarse-level egocentric understanding, followed by finetuning on synchronous motion-video data extracted from egocentric-exocentric video datasets with our automatic construction pipeline. Besides, considering the varying importance of different components, we design a part-disentangled motion injection scheme, enabling precise control of part-level movements. In addition, we devise a joint reconstruction framework that progressively models both the 4D scene and video frames, ensuring scene consistency in the long-form video generation. Experimental results demonstrate its great generalization ability in precise control of varying human movements and world-consistent modeling of diverse scenarios. It marks the first endeavor into egocentric real-world simulation and can pave the way for the community to delve into fresh frontiers of world modeling and its diverse applications.

PDF Details

AAAI Conference 2024 Conference Paper

BVT-IMA: Binary Vision Transformer with Information-Modified Attention

Zhenyu Wang
Hao Luo
Xuemei Xie
Fan Wang
Guangming Shi

As a compression method that can significantly reduce the cost of calculations and memories, model binarization has been extensively studied in convolutional neural networks. However, the recently popular vision transformer models pose new challenges to such a technique, in which the binarized models suffer from serious performance drops. In this paper, an attention shifting is observed in the binary multi-head self-attention module, which can influence the information fusion between tokens and thus hurts the model performance. From the perspective of information theory, we find a correlation between attention scores and the information quantity, further indicating that a reason for such a phenomenon may be the loss of the information quantity induced by constant moduli of binarized tokens. Finally, we reveal the information quantity hidden in the attention maps of binary vision transformers and propose a simple approach to modify the attention values with look-up information tables so that improve the model performance. Extensive experiments on CIFAR-100/TinyImageNet/ImageNet-1k demonstrate the effectiveness of the proposed information-modified attention on binary vision transformers.

PDF Details DOI

AAAI Conference 2024 Conference Paper

CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer

Yabing Wang
Fan Wang
Jianfeng Dong
Hao Luo

Cross-lingual cross-modal retrieval has garnered increasing attention recently, which aims to achieve the alignment between vision and target language (V-T) without using any annotated V-T data pairs. Current methods employ machine translation (MT) to construct pseudo-parallel data pairs, which are then used to learn a multi-lingual and multi-modal embedding space that aligns visual and target-language representations. However, the large heterogeneous gap between vision and text, along with the noise present in target language translations, poses significant challenges in effectively aligning their representations. To address these challenges, we propose a general framework, Cross-Lingual to Cross-Modal (CL2CM), which improves the alignment between vision and target language using cross-lingual transfer. This approach allows us to fully leverage the merits of multi-lingual pre-trained models (e.g., mBERT) and the benefits of the same modality structure, i.e., smaller gap, to provide reliable and comprehensive semantic correspondence (knowledge) for the cross-modal network. We evaluate our proposed approach on two multilingual image-text datasets, Multi30K and MSCOCO, and one video-text dataset, VATEX. The results clearly demonstrate the effectiveness of our proposed method and its high potential for large-scale retrieval.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Enhancing Hyperspectral Images via Diffusion Model and Group-Autoencoder Super-resolution Network

Zhaoyang Wang
Dongyang Li
Mingyang Zhang
Hao Luo
Maoguo Gong

Existing hyperspectral image (HSI) super-resolution (SR) methods struggle to effectively capture the complex spectral-spatial relationships and low-level details, while diffusion models represent a promising generative model known for their exceptional performance in modeling complex relations and learning high and low-level visual features. The direct application of diffusion models to HSI SR is hampered by challenges such as difficulties in model convergence and protracted inference time. In this work, we introduce a novel Group-Autoencoder (GAE) framework that synergistically combines with the diffusion model to construct a highly effective HSI SR model (DMGASR). Our proposed GAE framework encodes high-dimensional HSI data into low-dimensional latent space where the diffusion model works, thereby alleviating the difficulty of training the diffusion model while maintaining band correlation and considerably reducing inference time. Experimental results on both natural and remote sensing hyperspectral datasets demonstrate that the proposed method is superior to other state-of-the-art methods both visually and metrically.

PDF Details DOI

TMLR Journal 2023 Journal Article

A Survey on Transformers in Reinforcement Learning

Wenzhe Li
Hao Luo
Zichuan Lin
Chongjie Zhang
Zongqing Lu
Deheng Ye

Transformer has been considered the dominating neural architecture in NLP and CV, mostly under supervised settings. Recently, a similar surge of using Transformers has appeared in the domain of reinforcement learning (RL), but it is faced with unique design choices and challenges brought by the nature of RL. However, the evolution of Transformers in RL has not yet been well unraveled. In this paper, we seek to systematically review motivations and progress on using Transformers in RL, provide a taxonomy on existing works, discuss each sub-field, and summarize future prospects.

PDF Details

EAAI Journal 2023 Journal Article

Counterfactual-based minority oversampling for imbalanced classification

Shu Wang
Hao Luo
Shanshan Huang
Qingsong Li
Li Liu
Guoxin Su
Ming Liu

A key challenge of oversampling in imbalanced classification is that the generation of new minority samples often neglects the usage of majority classes, resulting in most new minority sampling spreading the whole minority space. In view of this, we present a new oversampling framework based on the counterfactual theory. Our framework introduces a counterfactual objective by leveraging the rich inherent information of majority classes and explicitly perturbing majority samples to generate new samples in the territory of minority space. It can be analytically shown that the new minority samples satisfy the minimum inversion. Therefore, most of them are located near the decision boundary. The empirical evaluation of the six benchmark datasets shows that our approach clearly outperforms the state-of-the-art methods.

Details DOI

AAAI Conference 2023 Conference Paper

Frequency Domain Disentanglement for Arbitrary Neural Style Transfer

Dongyang Li
Hao Luo
Pichao Wang
Zhibin Wang
Shang Liu
Fan Wang

Arbitrary neural style transfer has been a popular research topic due to its rich application scenarios. Effective disentanglement of content and style is the critical factor for synthesizing an image with arbitrary style. The existing methods focus on disentangling feature representations of content and style in the spatial domain where the content and style components are innately entangled and difficult to be disentangled clearly. Therefore, these methods always suffer from low-quality results because of the sub-optimal disentanglement. To address such a challenge, this paper proposes the frequency mixer (FreMixer) module that disentangles and re-entangles the frequency spectrum of content and style components in the frequency domain. Since content and style components have different frequency-domain characteristics (frequency bands and frequency patterns), the FreMixer could well disentangle these two components. Based on the FreMixer module, we design a novel Frequency Domain Disentanglement (FDD) framework for arbitrary neural style transfer. Qualitative and quantitative experiments verify that the proposed method can render better stylized results compared to the state-of-the-art methods.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Good Helper Is around You: Attention-Driven Masked Image Modeling

Zhengqi Liu
Jie Gui
Hao Luo

It has been witnessed that masked image modeling (MIM) has shown a huge potential in self-supervised learning in the past year. Benefiting from the universal backbone vision transformer, MIM learns self-supervised visual representations through masking a part of patches of the image while attempting to recover the missing pixels. Most previous works mask patches of the image randomly, which underutilizes the semantic information that is beneficial to visual representation learning. On the other hand, due to the large size of the backbone, most previous works have to spend much time on pre-training. In this paper, we propose Attention-driven Masking and Throwing Strategy (AMT), which could solve both problems above. We first leverage the self-attention mechanism to obtain the semantic information of the image during the training process automatically without using any supervised methods. Masking strategy can be guided by that information to mask areas selectively, which is helpful for representation learning. Moreover, a redundant patch throwing strategy is proposed, which makes learning more efficient. As a plug-and-play module for masked image modeling, AMT improves the linear probing accuracy of MAE by 2.9% ~ 5.9% on CIFAR-10/100, STL-10, Tiny ImageNet, and ImageNet-1K, and obtains an improved performance with respect to fine-tuning accuracy of MAE and SimMIM. Moreover, this design also achieves superior performance on downstream detection and segmentation tasks.

PDF Details DOI

AIIM Journal 2023 Journal Article

SDA-Net: Self-distillation driven deformable attentive aggregation network for thyroid nodule identification in ultrasound images

Minglei Li
Hang Zhou
Xiang Li
Pengfei Yan
Yuchen Jiang
Hao Luo
Xianli Zhou
Shen Yin

Early detection and accurate identification of thyroid nodules are the major challenges in controlling and treating thyroid cancer that can be difficult even for expert physicians. Currently, many computer-aided diagnosis (CAD) systems have been developed to assist this clinical process. However, most of these systems are unable to well capture geometrically diverse thyroid nodule representations from ultrasound images with subtle and various characteristic differences, resulting in suboptimal diagnosis and lack of clinical interpretability, which may affect their credibility in the clinic. In this context, a novel end-to-end network equipped with a deformable attention network and a distillation-driven interaction aggregation module (DIAM) is developed for thyroid nodule identification. The deformable attention network learns to identify discriminative features of nodules under the guidance of the deformable attention module (DAM) and an online class activation mapping (CAM) mechanism and suggests the location of diagnostic features to provide interpretable predictions. DIAM is designed to take advantage of the complementarities of adjacent layers, thus enhancing the representation capabilities of aggregated features; driven by an efficient self-distillation mechanism, the identification process is complemented with more multi-scale semantic information to calibrate the diagnosis results. Experimental results on a large dataset with varying nodule appearances show that the proposed network can achieve competitive performance in nodule diagnosis and provide interpretability suitable for clinical needs.

Details DOI

AIIM Journal 2022 Journal Article

Lesion-attention pyramid network for diabetic retinopathy grading

Xiang Li
Yuchen Jiang
Jiusi Zhang
Minglei Li
Hao Luo
Shen Yin

As one of the most common diabetic complications, diabetic retinopathy (DR) can cause retinal damage, vision loss and even blindness. Automated DR grading technology has important clinical significance, which can help ophthalmologists achieve rapid and early diagnosis. With the popularity of deep learning, DR grading based on the convolutional neural networks (CNNs) has become the mainstream method. Unfortunately, although the CNN-based method can achieve satisfactory diagnostic accuracy, it lacks significant clinical information. In this paper, a lesion-attention pyramid network (LAPN) is presented. The pyramid network integrates the subnetworks with different resolutions to get multi-scale features. In order to take the lesion regions in the high-resolution image as the diagnostic evidence, the low-resolution network calculates the lesion activation map (using the weakly-supervised localization method) and guides the high-resolution network to concentrate on the lesion regions. Furthermore, a lesion attention module (LAM) is designed to capture the complementary relationship between the high-resolution features and the low-resolution features, and to fuse the lesion activation map. Experiment results show that the proposed scheme outperforms other existing approaches, and the proposed method can provide lesion activation map with lesion consistency as an additional evidence for clinical diagnosis.

Details DOI

AAAI Conference 2022 Conference Paper

Scaled ReLU Matters for Training Vision Transformers

Pichao Wang
Xue Wang
Hao Luo
Jingkai Zhou
Zhipeng Zhou
Fan Wang
Hao Li
Rong Jin

Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs). However, the training of ViTs is much harder than CNNs, as it is sensitive to the training parameters, such as learning rate, optimizer and warmup epoch. The reasons for training difficulty are empirically analysed in the paper Early Convolutions Help Transformers See Better, and the authors conjecture that the issue lies with the patchify-stem of ViT models. In this paper, we further investigate this problem and extend the above conclusion: only early convolutions do not help for stable training, but the scaled ReLU operation in the convolutional stem (conv-stem) matters. We verify, both theoretically and empirically, that scaled ReLU in conv-stem not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops. In addition, extensive experiments are conducted to demonstrate that previous ViTs are far from being well trained, further showing that ViTs have great potential to be a better substitute of CNNs.

PDF Details

NeurIPS Conference 2022 Conference Paper

VTC-LFC: Vision Transformer Compression with Low-Frequency Components

Zhenyu Wang
Hao Luo
Pichao Wang
Feng Ding
Fan Wang
Hao Li

Although Vision transformers (ViTs) have recently dominated many vision tasks, deploying ViT models on resource-limited devices remains a challenging problem. To address such a challenge, several methods have been proposed to compress ViTs. Most of them borrow experience in convolutional neural networks (CNNs) and mainly focus on the spatial domain. However, the compression only in the spatial domain suffers from a dramatic performance drop without fine-tuning and is not robust to noise, as the noise in the spatial domain can easily confuse the pruning criteria, leading to some parameters/channels being pruned incorrectly. Inspired by recent findings that self-attention is a low-pass filter and low-frequency signals/components are more informative to ViTs, this paper proposes compressing ViTs with low-frequency components. Two metrics named low-frequency sensitivity (LFS) and low-frequency energy (LFE) are proposed for better channel pruning and token pruning. Additionally, a bottom-up cascade pruning scheme is applied to compress different dimensions jointly. Extensive experiments demonstrate that the proposed method could save 40% ～ 60% of the FLOPs in ViTs, thus significantly increasing the throughput on practical devices with less than 1% performance drop on ImageNet-1K.

PDF Details

TIST Journal 2019 Journal Article

Crowdsourcing Mechanism for Trust Evaluation in CPCS Based on Intelligent Mobile Edge Computing

Tian Wang
Hao Luo
Xi Zheng
Mande Xie

Both academia and industry have directed tremendous interest toward the combination of Cyber Physical Systems and Cloud Computing, which enables a new breed of applications and services. However, due to the relative long distance between remote cloud and end nodes, Cloud Computing cannot provide effective and direct management for end nodes, which leads to security vulnerabilities. In this article, we first propose a novel trust evaluation mechanism using crowdsourcing and Intelligent Mobile Edge Computing. The mobile edge users with relatively strong computation and storage ability are exploited to provide direct management for end nodes. Through close access to end nodes, mobile edge users can obtain various information of the end nodes and determine whether the node is trustworthy. Then, two incentive mechanisms, i.e., Trustworthy Incentive and Quality-Aware Trustworthy Incentive Mechanisms, are proposed for motivating mobile edge users to conduct trust evaluation. The first one aims to motivate edge users to upload their real information about their capability and costs. The purpose of the second one is to motivate edge users to make trustworthy effort to conduct tasks and report results. Detailed theoretical analysis demonstrates the validity of Quality-Aware Trustworthy Incentive Mechanism from data trustfulness, effort trustfulness, and quality trustfulness, respectively. Extensive experiments are carried out to validate the proposed trust evaluation and incentive mechanisms. The results corroborate that the proposed mechanisms can efficiently stimulate mobile edge users to perform evaluation task and improve the accuracy of trust evaluation.

Details DOI

AAAI Conference 2019 Conference Paper

Detect or Track: Towards Cost-Effective Video Object Detection/Tracking

Hao Luo
Wenxuan Xie
Xinggang Wang
Wenjun Zeng

State-of-the-art object detectors and trackers are developing fast. Trackers are in general more efficient than detectors but bear the risk of drifting. A question is hence raised – how to improve the accuracy of video object detection/tracking by utilizing the existing detectors and trackers within a given time budget? A baseline is frame skipping – detecting every N-th frames and tracking for the frames in between. This baseline, however, is suboptimal since the detection frequency should depend on the tracking quality. To this end, we propose a scheduler network, which determines to detect or track at a certain frame, as a generalization of Siamese trackers. Although being light-weight and simple in structure, the scheduler network is more effective than the frame skipping baselines and flow-based approaches, as validated on ImageNet VID dataset in video object detection/tracking.

PDF Details