Arrow Research search

Author name cluster

Ke Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

73 papers
2 author rows

Possible papers

73

AAAI Conference 2026 Conference Paper

Activations as Features: Probing LLMs for Generalizable Essay Scoring Representations

  • Jinwei Chi
  • Ke Wang
  • Yu Chen
  • Xuanye Lin
  • Qiang Xu

Automated essay scoring (AES) is a challenging task in cross-prompt settings due to the diversity of scoring criteria. While previous studies have focused on the output of large language models (LLMs) to improve scoring accuracy, we believe activations from intermediate layers may also provide valuable information. To explore this possibility, we evaluated the discriminative power of LLMs’ activations in cross-prompt essay scoring task. Specifically, we used activation to fit probes and further analyzed the effects of different models and input content of LLMs on this discriminative power. By computing the directions of essays across various trait dimensions under different prompts, we analyzed the variation in evaluation perspectives of large language models concerning essay types and traits. Results show that the activations possess strong discriminative power in evaluating essay quality and that LLMs can adapt their evaluation perspectives to different traits and essay types, effectively handling the diversity of scoring criteria in cross-prompt settings.

AAAI Conference 2026 Conference Paper

From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

  • Weikang Shi
  • Houxing Ren
  • Junting Pan
  • Aojun Zhou
  • Ke Wang
  • Zimu Lu
  • Yunqiao Yang
  • Yuxuan Hu

Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.

AAAI Conference 2026 Conference Paper

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes

  • Fudong Ge
  • Jin Gao
  • Hanshi Wang
  • Yiwei Zhang
  • Ke Wang
  • Weiming Hu
  • Zhipeng Zhang

This paper tackles the challenging task of achieving storage-efficient yet high-fidelity motion representation in large-scale dynamic 3D Gaussian Splatting. Our motivation stems from the truth that existing urban-scale methods, which rely on massive and unstructured individual Gaussians for scene modeling, face a critical scalability bottleneck. Inspired by recent advances in the 3DGS-based compression beyond autonomous driving, we address this challenge by leveraging the compression capability of anchor-driven methods. However, this is non-trivial as our exploratory experiments reveal that the direct application of this paradigm to dynamic, large-scale urban scenes results in performance degradation. We attribute this phenomenon to the hierarchical anchor design that severely loses dynamic information. To this end, we propose Hierarchical Dynamic Gaussian Splatting (HDGS), a novel framework designed to adapt the anchor-based Gaussian paradigm to 4D urban environments. We first establish a local support network to reinforce inter-anchor consistency, mitigating geometric and appearance fractures caused by supervision attenuation in deep hierarchies. Then, we handle heterogeneous object motion via coarse-to-fine decomposition, where high-level anchors model coarse dynamics and low-level anchors refine them with residual deformations. Third, we introduce a hybrid supervision scheme that fuses global geometric constraints and local pixel-level cues to alleviate geometrically inconsistent reconstruction under sparse LiDAR. Extensive experiments show that HDGS reduces storage by 69.0% while maintaining or even improving rendering fidelity compared to state-of-the-art methods.

AAAI Conference 2026 Conference Paper

Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images

  • Zimao Lu
  • Hui Xu
  • Bing Liu
  • Ke Wang

Text-only training provides an attractive approach to address data scarcity challenges in zero-shot image captioning (ZIC), avoiding the expense of collecting paired image-text annotations. However, although these approaches perform well within training domains, they suffer from poor cross-domain generalization, often producing hallucinated content when encountering novel visual environments. Retrieval-based methods attempt to mitigate this limitation by leveraging external knowledge, but they can paradoxically exacerbate hallucination when retrieved captions contain entities irrelevant to the inputs. We introduce the concept of negative entities—objects that appear in generated caption but are absent from the input—and propose Negative Entity Suppression (NES) to tackle this challenge. NES seamlessly integrates three stages: (1) it employs synthetic images to ensure consistent image-to-text retrieval across both training and inference; (2) it filters negative entities from retrieved content to enhance accuracy; and (3) it applies attention-level suppression using identified negative entities to further minimize the impact of hallucination-prone features. Evaluation across multiple benchmarks demonstrates that NES maintains competitive in-domain performance while improving cross-domain transfer and reducing hallucination rates, achieving new state-of-the-art results in ZIC.

AAAI Conference 2026 Conference Paper

Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling

  • Adam Hazimeh
  • Ke Wang
  • Mark Collier
  • Gilles Baechler
  • Efi Kokiopoulou
  • Pascal Frossard

Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts the attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069, and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.

IJCAI Conference 2025 Conference Paper

Accelerating Adversarial Training on Under-Utilized GPU

  • Zhuoxin Zhan
  • Ke Wang
  • Pulei Xiong

Deep neural networks are vulnerable to adversarial attacks and adversarial training has been proposed to defend against such attacks by adaptively generating attacks, i. e. , adversarial examples, during training. However, adversarial training is significantly slower than traditional training due to the search for worst attacks for each minibatch. To speed up adversarial training, existing work has considered a subset of a minibatch for generating attacks and reduced the steps in the search for attacks. We propose a novel adversarial training acceleration method, called AttackRider, by exploring under-utilized GPU hardware to reduce the number of calls to attack generation without increasing the time of each call. We characterize the extent of under-utilization of GPU for given GPU and model size, hence the potential for speedup, and present the application scenarios where this opportunity exists. The results on various machine learning tasks and datasets show that AttackRider can speed up state-of-the-art adversarial training algorithms with comparable robust accuracy. The source code of AttackRider is available at https: //github. com/zxzhan/AttackRider.

ECAI Conference 2025 Conference Paper

ASMA-Tune: Unlocking LLMs' Assembly Code Comprehension via Structural-Semantic Instruction Tuning

  • Xinyi Wang
  • Jiashui Wang
  • Jinbo Su
  • Ke Wang
  • Peng Chen
  • Yanming Liu
  • Long Liu
  • Xiang Li

Assembly code analysis and comprehension play critical roles in applications like reverse engineering, yet they face substantial challenges due to low information density and a lack of explicit syntactic structures. While traditional masked language modeling (MLM) approaches do not explicitly focus on natural language interaction, emerging decoder-focused large language models (LLMs) demonstrate partial success in binary analysis yet remain underexplored for holistic comprehension. We present Assembly Augmented Tuning (ASMA-Tune), an end-to-end structural-semantic instruction tuning framework that synergizes encoder architecture with decoder-based LLMs through a projector module, where the assembly encoder extracts hardware-level structural features, the projector bridges representations with the semantic space, and the instruction-tuned LLM preserves natural language capabilities. Experimental results demonstrate three key advantages: (1) State-of-the-art performance in assembly comprehension with +39. 7% Recall@1 and +17. 8% MRR improvements over GPT-4-Turbo, (2) Consistent enhancements across base models (24. 6–107. 4% Recall@1 and 15. 2–106. 3% MRR on Qwen2. 5-Coder, Deepseek-Coder and CodeLlama variants), and (3) Superior instruction-following capabilities (41. 5%–118% improvements) with controlled code generation degradation (–8. 9% to –35% across architectures).

AAAI Conference 2025 Conference Paper

Beyond Spatial Domain: Cross-domain Promoted Fourier Convolution Helps Single Image Dehazing

  • Xiaozhe Zhang
  • Haidong Ding
  • Fengying Xie
  • Linpeng Pan
  • Yue Zi
  • Ke Wang
  • Haopeng Zhang

Vanilla convolution and window-based self-attention have shown significant success in image dehazing. However, they are constrained by limited receptive fields and ignore frequency gaps between dehazed and clear images. The former hampers the modeling of global dependencies, while the latter impedes the learning of high-frequency features, leading to suboptimal performance. In this paper, we propose the Joint Spatial and Fourier Convolutional Network (JSFC-Net), which leverages Fourier transformation to simultaneously address the two aforementioned problems with low computational overhead. We introduce the Frequency-Spatial Promoted and Physical Learning Block, which extracts high-level features from the spatial domain and frequency domain in parallel. We design a simple yet effective solution that uses spatial features to promote and modulate frequency features in a multi-scale manner, achieving refinement of frequency features and addressing robustness issue caused by global sensitivity. Additionally, we present the Receptive Field Selection Module to facilitate improved fusion of spatial and frequency domain features. Finally, we introduce frequency loss to further narrow frequency gaps. Comprehensive experiments on multiple datasets demonstrate that JSFC-Net is significantly superior to SOTA dehazing methods.

NeurIPS Conference 2025 Conference Paper

COME: Adding Scene-Centric Forecasting Control to Occupancy World Model

  • Yining Shi
  • Kun Jiang
  • Qiang Meng
  • Ke Wang
  • Jiabao Wang
  • Wenchao Sun
  • Tuopu Wen
  • MengMeng Yang

World models are critical for autonomous driving to simulate environmental dynamics and generate synthetic data. Existing methods struggle to disentangle ego-vehicle motion (perspective shifts) from scene evolvement (agent interactions), leading to suboptimal predictions. Instead, we propose to separate environmental changes from ego-motion by leveraging the scene-centric coordinate systems. In this paper, we introduce COME: a framework that integrates scene-centric forecasting Control into the Occupancy world ModEl. Specifically, COME first generates ego-irrelevant, spatially consistent future features through a scene-centric prediction branch, which are then converted into scene condition using a tailored ControlNet. These condition features are subsequently injected into the occupancy world model, enabling more accurate and controllable future occupancy predictions. Experimental results on the nuScenes-Occ3D dataset show that COME achieves consistent and significant improvements over state-of-the-art (SOTA) methods across diverse configurations, including different input sources (ground-truth, camera-based, fusion-based occupancy) and prediction horizons (3s and 8s). For example, under the same settings, COME achieves 26. 3% better mIoU metric than DOME and 23. 7% better mIoU metric than UniScene. These results highlight the efficacy of disentangled representation learning in enhancing spatio-temporal prediction fidelity for world models. Code is available at https: //github. com/synsin0/COME.

ICLR Conference 2025 Conference Paper

DEPfold: RNA Secondary Structure Prediction as Dependency Parsing

  • Ke Wang
  • Shay B. Cohen

RNA secondary structure prediction is critical for understanding RNA function but remains challenging due to complex structural elements like pseudoknots and limited training data. We introduce DEPfold, a novel deep learning approach that re-frames RNA secondary structure prediction as a dependency parsing problem. DEPfold presents three key innovations: (1) a biologically motivated transformation of RNA structures into labeled dependency trees, (2) a biaffine attention mechanism for joint prediction of base pairings and their types, and (3) an optimal tree decoding algorithm that enforces valid RNA structural constraints. Unlike traditional energy-based methods, DEPfold learns directly from annotated data and leverages pretrained language models to predict RNA structure. We evaluate DEPfold on both within-family and cross-family RNA datasets, demonstrating significant performance improvements over existing methods. DEPfold shows strong performance in cross-family generalization when trained on data augmented by traditional energy-based models, outperforming existing methods on the bpRNAnew dataset. This demonstrates DEPfold’s ability to effectively learn structural information beyond what traditional methods capture. Our approach bridges natural language processing (NLP) with RNA biology, providing a computationally efficient and adaptable tool for advancing RNA structure prediction and analysis

NeurIPS Conference 2025 Conference Paper

Each Complexity Deserves a Pruning Policy

  • Hanshi Wang
  • Yuhao Xu
  • Zekun Xu
  • Jin Gao
  • Yufan Liu
  • Weiming Hu
  • Ke Wang
  • Zhipeng Zhang

The established redundancy in visual tokens within large vision–language models (LVLMs) allows for pruning to effectively reduce their substantial computational demands. Empirical evidence from previous works indicates that visual tokens in later decoder stages receive less attention than shallow layers. Then, previous methods typically employ heuristics layer-specific pruning strategies where, although the number of tokens removed may differ across decoder layers, the overall pruning schedule is fixed and applied uniformly to all input samples and tasks, failing to align token elimination with the model’s holistic reasoning trajectory. Cognitive science indicates that human visual processing often begins with broad exploration to accumulate evidence before narrowing focus as the target becomes distinct. Our experiments reveal an analogous pattern in LVLMs. This observation strongly suggests that neither a fixed pruning schedule nor a heuristics layer-wise strategy can optimally accommodate the diverse complexities inherent in different inputs. To overcome this limitation, we introduce Complexity-Adaptive Pruning (AutoPrune), which is a training-free, plug-and-play framework that tailors pruning policies to varying sample and task complexities. Specifically, AutoPrune quantifies the mutual information between visual and textual tokens, and then projects this signal to a budget-constrained logistic retention curve. Each such logistic curve, defined by its unique shape, is shown to effectively correspond with the specific complexity of different tasks, and can easily guarantee adherence to a pre-defined computational constraints. We evaluate AutoPrune not only on standard vision-language tasks but also on Vision-Language-Action (VLA) models for autonomous driving. Notably, when applied to LLaVA-1. 5-7B, our method prunes 89% of visual tokens and reduces inference FLOPs by 76. 8%, but still retaining 96. 7% of the original accuracy averaged over all tasks. This corresponds to a 9. 1% improvement over the recent work PDrop (CVPR'2025), demonstrating the effectivenes. Code is available at https: //github. com/AutoLab-SAI-SJTU/AutoPrune.

ICLR Conference 2025 Conference Paper

LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging

  • Ke Wang
  • Nikolaos Dimitriadis
  • Alessandro Favero
  • Guillermo Ortiz-Jiménez
  • François Fleuret
  • Pascal Frossard

Fine-tuning pre-trained models has become the standard approach to endow them with specialized knowledge, but it poses fundamental challenges. In particular, (i) fine-tuning often leads to catastrophic forgetting, where improvements on a target domain degrade generalization on other tasks, and (ii) merging fine-tuned checkpoints from disparate tasks can lead to significant performance loss. To address these challenges, we introduce LiNeS, Layer-increasing Network Scaling, a post-training editing technique designed to preserve pre-trained generalization while enhancing fine-tuned task performance. LiNeS scales parameter updates linearly based on their layer depth within the network, maintaining shallow layers close to their pre-trained values to preserve general features while allowing deeper layers to retain task-specific representations. In multi-task model merging scenarios, layer-wise scaling of merged parameters reduces negative task interference. LiNeS demonstrates significant improvements in both single-task and multi-task settings across various benchmarks in vision and natural language processing. It mitigates forgetting, enhances out-of-distribution generalization, integrates seamlessly with existing multi-task model merging baselines improving their performance across benchmarks and model sizes, and can boost generalization when merging LLM policies aligned with different rewards via RLHF. Our method is simple to implement, computationally efficient and complementary to many existing techniques. Our source code is available at github.com/wang-kee/LiNeS.

NeurIPS Conference 2025 Conference Paper

MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs

  • Ke Wang
  • Yiming Qin
  • Nikolaos Dimitriadis
  • Alessandro Favero
  • Pascal Frossard

Language models deployed in real-world systems often require post-hoc updates to incorporate new or corrected knowledge. However, editing such models efficiently and reliably—without retraining or forgetting previous information—remains a major challenge. Existing methods for lifelong model editing either compromise generalization, interfere with past edits, or fail to scale to long editing sequences. We propose MEMOIR, a novel scalable framework that injects knowledge through a residual memory, i. e. , a dedicated parameter module, while preserving the core capabilities of the pre-trained model. By sparsifying input activations through data-dependent masks, MEMOIR confines each edit to a distinct subset of the memory parameters, minimizing interference among edits. At inference, it identifies relevant edits by comparing the sparse activation patterns of new queries to those stored during editing. This enables generalization to rephrased queries by activating only the relevant knowledge while suppressing unnecessary memory activation for unrelated prompts. Experiments on question answering, hallucination correction, and out-of-distribution generalization benchmarks across LLaMA-3 and Mistral demonstrate that MEMOIR achieves state-of-the-art performance across reliability, generalization, and locality metrics, scaling to thousands of sequential edits with minimal forgetting.

JBHI Journal 2025 Journal Article

Multi-dimensional Feature-Guided Cross-Population Human Activity Recognition and Prediction

  • Renbo Liu
  • Yangfei Zhao
  • Pei Lv
  • Ke Wang
  • Weifeng Zhang
  • Zhaoyang Ge
  • Mingliang Xu

With the rapid development of wearable devices and intelligent sensing technologies, the demand for human behavior recognition in rehabilitation medicine and human-machine collaboration has been increasing. To address the issue of high variability in gait features caused by individual differences in cross-population gait analysis, and to tackle the insufficient generalization ability of models due to the coupling of pathological features with normal gait, we propose a multi-dimensional spatiotemporal feature-guided SG-LSTM framework, based on a dual-branch architecture comprising symmetric LSTM (S-LSTM) and grouped LSTM (G-LSTM) networks, for cross-population lower-limb activity recognition and prediction. On the one hand, the S-LSTM module with a symmetric input structure is used to explicitly model the spatiotemporal symmetry of lower-limb joints in normal gait. On the other hand, the G-LSTM module with a joint functional grouping strategy and local motion decoupling is employed to explicitly model the abnormal motion coupling of lower-limb joints in pathological gait. Furthermore, a dynamically weighted multi-task loss function is designed to jointly optimize gait trajectory prediction and classification tasks, allowing the framework to simultaneously produce both outputs and enhance the adaptability of the model. Extensive experiments on our self-constructed gait dataset as well as the HuGaDB and WearGait-PD datasets demonstrate that the proposed method not only outperforms several existing approaches in cross-population human behavior prediction and gait recognition, but also holds potential clinical application value, achieving state-of-the-art (SOTA) performance.

NeurIPS Conference 2025 Conference Paper

Online Segment Any 3D Thing as Instance Tracking

  • Hanshi Wang
  • Cai Zijian
  • Jin Gao
  • Yiwei Zhang
  • Weiming Hu
  • Ke Wang
  • Zhipeng Zhang

Online, real-time, and fine-grained 3D segmentation constitutes a fundamental capability for embodied intelligent agents to perceive and comprehend their operational environments. Recent advancements employ predefined object queries to aggregate semantic information from Vision Foundation Models (VFMs) outputs that are lifted into 3D point clouds, facilitating spatial information propagation through inter-query interactions. Nevertheless, perception, whether human or robotic, is an inherently dynamic process, rendering temporal understanding a critical yet overlooked dimension within these prevailing query-based pipelines. This deficiency in temporal reasoning can exacerbate issues such as the over-segmentation commonly produced by VFMs, necessitating more handcrafted post-processing. Therefore, to further unlock the temporal environmental perception capabilities of embodied agents, our work reconceptualizes online 3D segmentation as an instance tracking problem (AutoSeg3D). Our core strategy involves utilizing object queries for temporal information propagation, where long-term instance association promotes the coherence of features and object identities, while short-term instance update enriches instant observations. Given that viewpoint variations in embodied robotics often lead to partial object visibility across frames, this mechanism aids the model in developing a holistic object understanding beyond incomplete instantaneous views. Furthermore, we introduce spatial consistency learning to mitigate the fragmentation problem inherent in VFMs, yielding more comprehensive instance information for enhancing the efficacy of both long-term and short-term temporal learning. The temporal information exchange and consistency learning facilitated by these sparse object queries not only enhance spatial comprehension but also circumvent the computational burden associated with dense temporal point cloud interactions. Our method establishes a new state-of-the-art, surpassing ESAM by 2. 8 AP on ScanNet200 and delivering consistent gains on ScanNet, SceneNN, and 3RScan datasets, corroborating that identity-aware temporal reasoning is a crucial, previously underemphasized component for robust 3D segmentation in real-time embodied intelligence. Code is at https: //github. com/AutoLab-SAI-SJTU/AutoSeg3D.

JBHI Journal 2025 Journal Article

SeqNovo: De Novo Peptide Sequencing Prediction in IoMT via Seq2Seq

  • Ke Wang
  • Mingjia Zhu
  • Wadii Boulila
  • Maha Driss
  • Thippa Reddy Gadekallu
  • Chien-Ming Chen
  • Lei Wang
  • Saru Kumari

In the Internet of Medical Things (IoMT), de novo peptide sequencing prediction is one of the most important techniques for the fields of disease prediction, diagnosis, and treatment. Recently, deep-learning-based peptide sequencing prediction has been a new trend. However, most popular deep learning models for peptide sequencing prediction suffer from poor interpretability and poor ability to capture long-range dependencies. To solve these issues, we propose a model named SeqNovo, which has the encoding-decoding structure of sequence to sequence (Seq2Seq), the highly nonlinear properties of multilayer perceptron (MLP), and the ability of the attention mechanism to capture long-range dependencies. SeqNovo use MLP to improve the feature extraction and utilize the attention mechanism to discover key information. A series of experiments have been conducted to show that the SeqNovo is superior to the Seq2Seq benchmark model, DeepNovo. SeqNovo improves both the accuracy and interpretability of the predictions, which will be expected to support more related research.

IROS Conference 2025 Conference Paper

SplatPose: Geometry-Aware 6-DoF Pose Estimation from Single RGB Image via 3D Gaussian Splatting

  • Linqi Yang
  • Xiongwei Zhao
  • Qihao Sun
  • Ke Wang
  • Ao Chen
  • Peng Kang

6-DoF pose estimation is a fundamental task in computer vision with wide-ranging applications in augmented reality and robotics. Existing single RGB-based methods often compromise accuracy due to their reliance on initial pose estimates and susceptibility to rotational ambiguity, while approaches requiring depth sensors or multi-view setups incur significant deployment costs. To address these limitations, we introduce SplatPose, a novel framework that synergizes 3D Gaussian Splatting (3DGS) with a dual-branch neural architecture to achieve high-precision pose estimation using only a single RGB image. Central to our approach is the Dual-Attention Ray Scoring Network (DARS-Net), which innovatively decouples positional and angular alignment through geometry-domain attention mechanisms, explicitly modeling directional dependencies to mitigate rotational ambiguity. Additionally, a coarse-to-fine optimization pipeline progressively refines pose estimates by aligning dense 2D features between query images and 3DGS-synthesized views, effectively correcting feature misalignment and depth errors from sparse ray sampling. Experiments on three benchmark datasets demonstrate that SplatPose achieves state-of-the-art 6-DoF pose estimation accuracy in single RGB settings, rivaling approaches that depend on depth or multi-view images.

TMLR Journal 2025 Journal Article

Step-Controlled DPO: Leveraging Stepwise Errors for Enhancing Mathematical Reasoning of Language Models

  • Zimu Lu
  • Aojun Zhou
  • Ke Wang
  • Houxing Ren
  • Weikang Shi
  • Yunqiao Yang
  • Junting Pan
  • Mingjie Zhan

Direct Preference Optimization (DPO) has proven effective at improving the performance of large language models (LLMs) on downstream tasks such as reasoning and alignment. In this work, we propose Step-Controlled DPO (SCDPO), a method for automatically providing stepwise error supervision by creating negative samples of mathematical reasoning rationales that start making errors at a specified step. By applying these samples in DPO training, SCDPO can better align the model to avoid reasoning errors and output accurate reasoning steps. Qualitative analysis of the credit assignment of SCDPO and DPO demonstrates the effectiveness of SCDPO at identifying errors in mathematical solutions. We then apply SCDPO to an InternLM2-20B model, resulting in a 20B model that achieves competitive scores of 88.5% on GSM8K and 58.1% on MATH, rivaling all other open-source LLMs, showing the great potential of our method. The code, models and data are released to inspire future work.

ICRA Conference 2025 Conference Paper

The Devil is in the Quality: Exploring Informative Samples for Semi-Supervised Monocular 3D Object Detection

  • Zhipeng Zhang
  • Zhenyu Li 0007
  • Hanshi Wang
  • Yuan He
  • Ke Wang
  • Heng Fan 0001

This paper tackles the challenging problem of semi-supervised monocular 3D object detection with a general framework. In specific, having observed that the bottleneck of this task lies in lacking reliable and informative samples from unlabeled data for detector learning, we introduce a novel simple yet effective ‘Augment and Criticize’ pipeline that mines abundant informative samples for robust detection. To be more specific, in the ‘Augment’ stage, we present the Augmentation-based Prediction aGgregation (APG), which applies automatically learned transformations to unlabeled images and aggregates detections from various augmented views as pseudo labels. Since not all the pseudo labels from APG are beneficially informative, the subsequent ‘Criticize’ phase is introduced. Particularly, we present the Critical Retraining Strategy (CRS) that, unlike simply filtering pseudo labels using a fixed threshold, employs a learnable network to evaluate the contribution of unlabeled images at different training timestamps. This way, the noisy samples prohibitive to model evolution can be effectively suppressed. In order to validate ‘Augment-Criticize’, we apply it to MonoDLE [1] and MonoFlex [2], and the two new detectors, dubbed 3DSeMo DLE and 3DSeMo FLEX, achieve state-of-the-art results with consistent improvements, evidencing its effectiveness and generality.

AAAI Conference 2025 Conference Paper

Vision Transformers Beat WideResNets on Small Scale Datasets Adversarial Robustness

  • Juntao Wu
  • Ziyu Song
  • Xiaoyu Zhang
  • Shujun Xie
  • Longxin Lin
  • Ke Wang

For an extensive period, Vision Transformers (ViTs) have been deemed unsuitable for attaining robust performance on small-scale datasets, with WideResNet models maintaining dominance in this domain. While WideResNet models have persistently set the state-of-the-art (SOTA) benchmarks for robust accuracy on datasets such as CIFAR-10 and CIFAR-100, this paper challenges the prevailing belief that only WideResNet can excel in this context. We pose the critical question of whether ViTs can surpass the robust accuracy of WideResNet models. Our results provide a resounding affirmative answer. By employing ViT, enhanced with data generated by a diffusion model for adversarial training, we demonstrate that ViTs can indeed outshine WideResNet in terms of robust accuracy. Specifically, under the Infty-norm threat model with epsilon = 8/255, our approach achieves robust accuracies of 74.97% on CIFAR-10 and 44.07% on CIFAR-100, representing improvements of +3.9% and +1.4%, respectively, over the previous SOTA models. Notably, our ViT-B/2 model, with 3 times fewer parameters, surpasses the previously best-performing WRN-70-16. Our achievement opens a new avenue, suggesting that future models employing ViTs or other novel efficient architectures could eventually replace the long-dominant WRN models.

IROS Conference 2025 Conference Paper

VLIN-RL: A Unified Vision-Language Interpreter and Reinforcement Learning Motion Planner Framework for Robot Dynamic Tasks

  • Zewu Jiang
  • Junnan Zhang
  • Ke Wang
  • Chenyi Si

Recently, with the development of Large Language Models (LLMs), Embodied AI represented by Vision-Language-Action Models (VLAs) has played a significant role in realizing the natural language interaction between humans and robots. Current VLA models can process and understand visual information and language instructions, while guiding robots to complete interactive tasks with the environment based on human language instructions. However, when tackling with the real-time and dynamic tasks, VLAs have poor robustness and real-time planning and adjustment ability against changes in target objects, instructions, and environments. To handle these limitations, we propose VLIN-RL, a unified framework that consists of the Vision-Language Interpreter (VLIN) that owns excellent vision language information understanding and advanced task planning abilities and reinforcement learning (RL)-based motion planner with enhanced flexibility and broader applicability. If the environmental state changes during task execution, the RL planning module in VLIN-RL will directly make dynamic adjustments at the subtask level based on visual feedback to achieve the task goals, without the need for time-consuming reprocessing by VLIN. Experiments demonstrate that our model can complete multi-robot manipulation tasks more efficiently and stably. Finally, our work is verified by the pick-grasp tasks and real manipulators experiments. The test video is available at https://github.com/jzwsoulferryman/VLIN-RL.git.

NeurIPS Conference 2025 Conference Paper

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

  • Zimu Lu
  • Yunqiao Yang
  • Houxing Ren
  • Haotian Hou
  • Han Xiao
  • Ke Wang
  • Weikang Shi
  • Aojun Zhou

LLM‑based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we generate test cases targeting each functionality described in the instructions. These test cases are then manually filtered, refined, and organized to ensure accuracy, resulting in a total of 647 test cases. Each test case specifies an operation to be performed on the website and the expected outcome of the operation. To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute test cases on the generated websites and determine whether the observed responses align with the expected results. We evaluate three high-performance code-agent frameworks—Bolt. diy, OpenHands, and Aider—using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt. diy powered by DeepSeek-R1, achieves only 27. 8\% accuracy on the test cases, highlighting the challenging nature of our benchmark. Additionally, we construct WebGen-Instruct, a training set consisting of 6, 667 website-generation instructions. Training Qwen2. 5-Coder-32B-Instruct on Bolt. diy trajectories generated from a subset of the training set achieves an accuracy of 38. 2\%, surpassing the performance of the best proprietary model. We release our data-generation, training, and testing code, along with both the datasets and model weights at https: //github. com/mnluzimu/WebGen-Bench.

AAAI Conference 2024 Conference Paper

Learning from Failure: Improving Meeting Summarization without Good Samples

  • Ke Wang
  • Xiutian Zhao
  • Wei Peng

Existing methods aligning language models with various human needs are reliant heavily on high-quality and task-specific data. However, industrial deployment of task-specific language models often encounter challenges in the availability of appropriate training samples. Taking meeting summarization for instance, public datasets are scarce, and private corpora are also hard to obtain due to privacy issues or resource-demanding annotation. To improve meeting summarization in the absence of positively-rated (i.e., ``good'') samples, we propose Score Tuning, a cold start tuning framework that leverages bad samples of distinguishable degrees to incrementally enhance the performance of summary generation without an initial presence of good samples. Our method utilizes asynchronous and numerical human feedback that measure the quality of generated summaries. Formulating data into triplets of (transcript, summary, score), our approach instructs a pre-trained model to learn the association between summary qualities and human-rated scores and hence to generate better summaries corresponding to higher scores. The experiment results show that our method is effective in improving meeting summarization on both English and Chinese corpora while requiring less annotated data and training resources compared to existing alignment methods. Additionally, we also preliminarily explore the transferability of our approach in machine translation tasks and demonstrate its potential for future development and usage in other domains.

ICML Conference 2024 Conference Paper

Localizing Task Information for Improved Model Merging and Compression

  • Ke Wang
  • Nikolaos Dimitriadis
  • Guillermo Ortiz-Jiménez
  • François Fleuret
  • Pascal Frossard

Model merging and task arithmetic have emerged as promising scalable approaches to merge multiple single-task checkpoints to one multi-task model, but their applicability is reduced by significant performance loss. Previous works have linked these drops to interference in the weight space and erasure of important task-specific features. Instead, in this work we show that the information required to solve each task is still preserved after merging as different tasks mostly use non-overlapping sets of weights. We propose TALL-masks, a method to identify these task supports given a collection of task vectors and show that one can retrieve $>$99% of the single task accuracy by applying our masks to the multi-task vector, effectively compressing the individual checkpoints. We study the statistics of intersections among constructed masks and reveal the existence of selfish and catastrophic weights, i. e. , parameters that are important exclusively to one task and irrelevant to all tasks but detrimental to multi-task fusion. For this reason, we propose Consensus Merging, an algorithm that eliminates such weights and improves the general performance of existing model merging approaches. Our experiments in vision and NLP benchmarks with up to 20 tasks, show that Consensus Merging consistently improves existing approaches. Furthermore, our proposed compression scheme reduces storage from 57Gb to 8. 2Gb while retaining 99. 7% of original performance.

NeurIPS Conference 2024 Conference Paper

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

  • Ke Wang
  • Junting Pan
  • Weikang Shi
  • Zimu Lu
  • Houxing Ren
  • Aojun Zhou
  • Mingjie Zhan
  • Hongsheng Li

Recent advancements in Large Multimodal Models (LMMs) have shown promising results in mathematical reasoning within visual contexts, with models exceeding human-level performance on existing benchmarks such as MathVista. However, we observe significant limitations in the diversity of questions and breadth of subjects covered by these benchmarks. To address this issue, we present the MATH-Vision (MATH-V) dataset, a meticulously curated collection of 3, 040 high-quality mathematical problems with visual contexts sourced from real math competitions. Spanning 16 distinct mathematical disciplines and graded across 5 levels of difficulty, our dataset provides a comprehensive and diverse set of challenges for evaluating the mathematical reasoning abilities of LMMs. Through extensive experimentation, we unveil a notable performance gap between current LMMs and human performance on \datasetname, underscoring the imperative for further advancements in LMMs. Moreover, our detailed categorization allows for a thorough error analysis of LMMs, offering valuable insights to guide future research and development. The dataset is released at MathLLMs/MathVision

NeurIPS Conference 2024 Conference Paper

OPUS: Occupancy Prediction Using a Sparse Set

  • Jiabao Wang
  • Zhaojiang Liu
  • Qiang Meng
  • Liujiang Yan
  • Ke Wang
  • Jie Yang
  • Wei Liu
  • Qibin Hou

Occupancy prediction, aiming at predicting the occupancy status within voxelized 3D environment, is quickly gaining momentum within the autonomous driving community. Mainstream occupancy prediction works first discretize the 3D environment into voxels, then perform classification on such dense grids. However, inspection on sample data reveals that the vast majority of voxels is unoccupied. Performing classification on these empty voxels demands suboptimal computation resource allocation, and reducing such empty voxels necessitates complex algorithm designs. To this end, we present a novel perspective on the occupancy prediction task: formulating it as a streamlined set prediction paradigm without the need for explicit space modeling or complex sparsification procedures. Our proposed framework, called OPUS, utilizes a transformer encoder-decoder architecture to simultaneously predict occupied locations and classes using a set of learnable queries. Firstly, we employ the Chamfer distance loss to scale the set-to-set comparison problem to unprecedented magnitudes, making training such model end-to-end a reality. Subsequently, semantic classes are adaptively assigned using nearest neighbor search based on the learned locations. In addition, OPUS incorporates a suite of non-trivial strategies to enhance model performance, including coarse-to-fine learning, consistent point sampling, and adaptive re-weighting, etc. Finally, compared with current state-of-the-art methods, our lightest model achieves superior RayIoU on the Occ3D-nuScenes dataset at near 2x FPS, while our heaviest model surpasses previous best results by 6. 1 RayIoU.

ICML Conference 2024 Conference Paper

Pi-DUAL: Using privileged information to distinguish clean from noisy labels

  • Ke Wang
  • Guillermo Ortiz-Jiménez
  • Rodolphe Jenatton
  • Mark Collier
  • Efi Kokiopoulou
  • Pascal Frossard

Label noise is a pervasive problem in deep learning that often compromises the generalization performance of trained models. Recently, leveraging privileged information (PI) – information available only during training but not at test time – has emerged as an effective approach to mitigate this issue. Yet, existing PI-based methods have failed to consistently outperform their no-PI counterparts in terms of preventing overfitting to label noise. To address this deficiency, we introduce Pi-DUAL, an architecture designed to harness PI to distinguish clean from wrong labels. Pi-DUAL decomposes the output logits into a prediction term, based on conventional input features, and a noise-fitting term influenced solely by PI. A gating mechanism steered by PI adaptively shifts focus between these terms, allowing the model to implicitly separate the learning paths of clean and wrong labels. Empirically, Pi-DUAL achieves significant performance improvements on key PI benchmarks (e. g. , +6. 8% on ImageNet-PI), establishing a new state-of-the-art test set accuracy. Additionally, Pi-DUAL is a potent method for identifying noisy samples post-training, outperforming other strong methods at this task. Overall, Pi-DUAL is a simple, scalable and practical approach for mitigating the effects of label noise in a variety of real-world scenarios with PI.

AAAI Conference 2024 Conference Paper

Review-Enhanced Hierarchical Contrastive Learning for Recommendation

  • Ke Wang
  • Yanmin Zhu
  • Tianzi Zang
  • Chunyang Wang
  • Mengyuan Jing

Designed to establish potential relations and distill high-order representations, graph-based recommendation systems continue to reveal promising results by jointly modeling ratings and reviews. However, existing studies capture simple review relations, failing to (1) completely explore hidden connections between users (or items), (2) filter out redundant information derived from reviews, and (3) model the behavioral association between rating and review interactions. To address these challenges, we propose a review-enhanced hierarchical contrastive learning, namely ReHCL. First, ReHCL constructs topic and semantic graphs to fully mine review relations from different views. Moreover, a cross-view graph contrastive learning is used to achieve enhancement of node representations and extract useful review knowledge. Meanwhile, we design a neighbor-based positive sampling to capture the graph-structured similarity between topic and semantic views, further performing efficient contrast and reducing redundant noise. Next, we propose a cross-modal contrastive learning to match the rating and review representations, by exploring the association between ratings and reviews. Lastly, these two contrastive learning modes form a hierarchical contrastive learning task, which is applied to enhance the final recommendation task. Extensive experiments verify the superiority of ReHCL compared with state-of-the-arts.

AAAI Conference 2023 Conference Paper

Disentangled Representation for Causal Mediation Analysis

  • Ziqi Xu
  • Debo Cheng
  • Jiuyong Li
  • Jixue Liu
  • Lin Liu
  • Ke Wang

Estimating direct and indirect causal effects from observational data is crucial to understanding the causal mechanisms and predicting the behaviour under different interventions. Causal mediation analysis is a method that is often used to reveal direct and indirect effects. Deep learning shows promise in mediation analysis, but the current methods only assume latent confounders that affect treatment, mediator and outcome simultaneously, and fail to identify different types of latent confounders (e.g., confounders that only affect the mediator or outcome). Furthermore, current methods are based on the sequential ignorability assumption, which is not feasible for dealing with multiple types of latent confounders. This work aims to circumvent the sequential ignorability assumption and applies the piecemeal deconfounding assumption as an alternative. We propose the Disentangled Mediation Analysis Variational AutoEncoder (DMAVAE), which disentangles the representations of latent confounders into three types to accurately estimate the natural direct effect, natural indirect effect and total effect. Experimental results show that the proposed method outperforms existing methods and has strong generalisation ability. We further apply the method to a real-world dataset to show its potential application.

JBHI Journal 2023 Journal Article

Dual-Channel Neural Network for Atrial Fibrillation Detection From a Single Lead ECG Wave

  • Bo Fang
  • Junxin Chen
  • Yu Liu
  • Wei Wang
  • Ke Wang
  • Amit Kumar Singh
  • Zhihan Lv

With the dramatic progress of wearable devices, continuous collection of single lead ECG wave is able to be implemented in a comfortable fashion. Data mining on single lead ECG wave is therefore attracting increasing attention, where atrial fibrillation (AF) detection is a hot topic. In this paper, we propose a dual-channel neural network for AF detection from a single lead ECG wave. Two primary phases are included, the data preprocessing part followed by a dual-channel neural network. A two-stage denoising procedure is developed for data preprocessing, so as to tackle the high noise and disturbance which generally resides in the ECG wave collected by wearable devices. Then the time-frequency spectrum and Poincare plot of the denoised ECG signal are imported into the developed dual-channel neural network for feature extraction and AF detection. On the 2017 PhysioNet/CinC Challenge database, the F1 values were 0. 83, 0. 90, and 0. 75 for AF rhythm and normal rhythm, and other rhythm, respectively. The results well validate the effectiveness of the proposed method for AF detection from a single lead ECG wave, and also indicate its performance advantages over some state-of-the-art counterparts.

ICRA Conference 2023 Conference Paper

On Human Grasping and Manipulation in Kitchens: Automated Annotation, Insights, and Metrics for Effective Data Collection

  • Nathan Elangovan
  • Ricardo V. Godoy
  • Felipe Sanches
  • Ke Wang
  • Tom White
  • Patrick Jarvis
  • Minas Liarokapis

The advancement in robotic grasping and manipulation has elicited an increased research interest in the development of household robots capable of performing a plethora of complex tasks. These advancements require the shift of robotics research from a laboratory setting to dynamic and unstructured home environments. In this work, we focus on a comprehensive data collection and analysis of key attributes involved in the selection of grasping and manipulation strategies for the successful execution of kitchen tasks. An unprecedented dataset that comprises over 7 hours of high-definition videos that were analyzed to classify more than 10, 000 kitchen activities annotated with 24 attributes each has been created. Machine learning techniques were employed to automate the annotation process partially by extracting grasp types, hand, and object information from the videos. The annotated dataset was analyzed using clustering algorithms to identify underlying patterns. This study also identifies key attributes and specific data that require focus during data collection based on inter-subject variability. The insights from this study can be used to improve the speed, quality, and effectiveness of data collection. It also helps identify the strategies employed by the humans for the execution of kitchen tasks and transfer the necessary skills to a robotic end-effector enabling it to complete the tasks autonomously or collaborate with humans.

NeurIPS Conference 2023 Conference Paper

ResoNet: Noise-Trained Physics-Informed MRI Off-Resonance Correction

  • Alfredo De Goyeneche Macaya
  • Shreya Ramachandran
  • Ke Wang
  • Ekin Karasan
  • Joseph Y. Cheng
  • Stella X. Yu
  • Michael Lustig

Magnetic Resonance Imaging (MRI) is a powerful medical imaging modality that offers diagnostic information without harmful ionizing radiation. Unlike optical imaging, MRI sequentially samples the spatial Fourier domain (k-space) of the image. Measurements are collected in multiple shots, or readouts, and in each shot, data along a smooth trajectory is sampled. Conventional MRI data acquisition relies on sampling k-space row-by-row in short intervals, which is slow and inefficient. More efficient, non-Cartesian sampling trajectories (e. g. , Spirals) use longer data readout intervals, but are more susceptible to magnetic field inhomogeneities, leading to off-resonance artifacts. Spiral trajectories cause off-resonance blurring in the image, and the mathematics of this blurring resembles that of optical blurring, where magnetic field variation corresponds to depth and readout duration to aperture size. Off-resonance blurring is a system issue with a physics-based, accurate forward model. We present a physics-informed deep learning framework for off-resonance correction in MRI, which is trained exclusively on synthetic, noise-like data with representative marginal statistics. Our approach allows for fat/water separation and is compatible with parallel imaging acceleration. Through end-to-end training using synthetic randomized data (i. e. , noise-like images, coil sensitivities, field maps), we train the network to reverse off-resonance effects across diverse anatomies and contrasts without retraining. We demonstrate the effectiveness of our approach through results on phantom and in-vivo data. This work has the potential to facilitate the clinical adoption of non-Cartesian sampling trajectories, enabling efficient, rapid, and motion-robust MRI scans. Code is publicly available at: https: //github. com/mikgroup/ResoNet.

IROS Conference 2023 Conference Paper

Scalable. Intuitive Human to Robot Skill Transfer with Wearable Human Machine Interfaces: On Complex, Dexterous Tasks

  • Felipe Sanches
  • Geng Gao
  • Nathan Elangovan
  • Ricardo V. Godoy
  • Jayden Chapman
  • Ke Wang
  • Patrick Jarvis
  • Minas Liarokapis

The advent of collaborative industrial and house-hold robotics has blurred the demarcation between the human and robot workspace. The capability of robots to function efficiently alongside humans requires new research to be conducted in dynamic environments as opposed to the traditional well-structured laboratory. In this work, we propose an efficient skill transfer methodology comprising intuitive interfaces, efficient optical tracking systems, and compliant control of robotic arm-hand systems. The lightweight wearable interfaces mounted with robotic grippers and hands allow the execution of dexterous activities in dynamic environments without restricting human dexterity. The fiducial and reflective markers mounted on the interfaces facilitate the extraction of positional and rotational information allowing efficient trajectory tracking. As the tasks are performed using the mounted grippers and hands, gripper state information can be directly transferred. The hardware-agnostic nature and efficiency of the proposed interfaces and skill transfer methodology are demonstrated through the execution of complex tasks that require increased dexterity, writing and drawing.

AAAI Conference 2022 Conference Paper

Incorporating Item Frequency for Differentially Private Set Union

  • Ricardo Silva Carvalho
  • Ke Wang
  • Lovedeep Singh Gondara

We study the problem of releasing the set union of users’ items subject to differential privacy. Previous approaches consider only the set of items for each user as the input. We propose incorporating the item frequency, which is typically available in set union problems, to boost the utility of private mechanisms. However, using the global item frequency over all users would largely increase privacy loss. We propose to use the local item frequency of each user to approximate the global item frequency without incurring additional privacy loss. Local item frequency allows us to design greedy set union mechanisms that are differentially private, which is impossible for previous greedy proposals. Moreover, while all previous works have to use uniform sampling to limit the number of items each user would contribute to, our construction eliminates the sampling step completely and allows our mechanisms to consider all of the users’ items. Finally, we propose to transfer the knowledge of the global item frequency from a public dataset into our mechanism, which further boosts utility even when the public and private datasets are from different domains. We evaluate the proposed methods on multiple real-life datasets.

ICRA Conference 2022 Conference Paper

On Wearable, Lightweight, Low-Cost Human Machine Interfaces for the Intuitive Collection of Robot Grasping and Manipulation Data

  • Che-Ming Chang
  • Jayden Chapman
  • Ke Wang
  • Patrick Jarvis
  • Minas Liarokapis

Robot grasping and manipulation allow robots to interact with their environments and execute a plethora of complex tasks that require increased dexterity (e. g. , open a door, push buttons, collect and transpose objects, etc.). Collecting data of such activities is of paramount importance as it allows roboticists to create new methods and models that will facilitate the execution of sophisticated tasks. In this paper, we propose new wearable, lightweight, low-cost human machine interfaces that improve the efficiency of the data collection process for both robotic grasping and manipulation by offering intuitive and simplified control of the employed robotic grippers and hands. In particular, two different types of interfaces are proposed: i) a handle-based forearm stabilized interface that uses a waist-linkage system to provide weight support for bulky and heavy robotic end-effectors and ii) a palm-mounted interface that can accommodate smaller and lightweight grippers and hands, offering more agility in the control and positioning of these devices. Both interfaces are equipped with appropriate sliders, joysticks, and buttons that facilitate the control of the multiple degrees of freedom of the employed end-effectors and appropriate cameras that allow for object detection, identification, and object pose estimation.

NeurIPS Conference 2022 Conference Paper

Robust Learning against Relational Adversaries

  • Yizhen Wang
  • Mohannad Alhanahnah
  • Xiaozhu Meng
  • Ke Wang
  • Mihai Christodorescu
  • Somesh Jha

Test-time adversarial attacks have posed serious challenges to the robustness of machine-learning models, and in many settings the adversarial perturbation need not be bounded by small $\ell_p$-norms. Motivated by attacks in program analysis and security tasks, we investigate $\textit{relational adversaries}$, a broad class of attackers who create adversarial examples in a reflexive-transitive closure of a logical relation. We analyze the conditions for robustness against relational adversaries and investigate different levels of robustness-accuracy trade-off due to various patterns in a relation. Inspired by the insights, we propose $\textit{normalize-and-predict}$, a learning framework that leverages input normalization to achieve provable robustness. The framework solves the pain points of adversarial training against relational adversaries and can be combined with adversarial training for the benefits of both approaches. Guided by our theoretical findings, we apply our framework to source code authorship attribution and malware detection. Results of both tasks show our learning framework significantly improves the robustness of models against relational adversaries. In the process, it outperforms adversarial training, the most noteworthy defense mechanism, by a wide margin.

NeurIPS Conference 2021 Conference Paper

Benign Overfitting in Multiclass Classification: All Roads Lead to Interpolation

  • Ke Wang
  • Vidya Muthukumar
  • Christos Thrampoulidis

The growing literature on "benign overfitting" in overparameterized models has been mostly restricted to regression or binary classification settings; however, most success stories of modern machine learning have been recorded in multiclass settings. Motivated by this discrepancy, we study benign overfitting in multiclass linear classification. Specifically, we consider the following popular training algorithms on separable data: (i) empirical risk minimization (ERM) with cross-entropy loss, which converges to the multiclass support vector machine (SVM) solution; (ii) ERM with least-squares loss, which converges to the min-norm interpolating (MNI) solution; and, (iii) the one-vs-all SVM classifier. Our first key finding is that under a simple sufficient condition, all three algorithms lead to classifiers that interpolate the training data and have equal accuracy. When the data is generated from Gaussian mixtures or a multinomial logistic model, this condition holds under high enough effective overparameterization. Second, we derive novel error bounds on the accuracy of the MNI classifier, thereby showing that all three training algorithms lead to benign overfitting under sufficient overparameterization. Ultimately, our analysis shows that good generalization is possible for SVM solutions beyond the realm in which typical margin-based bounds apply.

AAAI Conference 2021 Conference Paper

Bridging the Domain Gap: Improve Informal Language Translation via Counterfactual Domain Adaptation

  • Ke Wang
  • Guandan Chen
  • Zhongqiang Huang
  • Xiaojun Wan
  • Fei Huang

Despite the near-human performances already achieved on formal texts such as news articles, neural machine translation still has difficulty in dealing with ”user-generated” texts that have diverse linguistic phenomena but lack large-scale high-quality parallel corpora. To address this problem, we propose a counterfactual domain adaptation method to better leverage both large-scale source-domain data (formal texts) and small-scale target-domain data (informal texts). Specifically, by considering effective counterfactual conditions (the concatenations of source-domain texts and the target-domain tag), we construct the counterfactual representations to fill the sparse latent space of the target domain caused by a small amount of data, that is, bridging the gap between the sourcedomain data and the target-domain data. Experiments on English-to-Chinese and Chinese-to-English translation tasks show that our method outperforms the base model that is trained only on the informal corpus by a large margin, and consistently surpasses different baseline methods by +1. 12 ∼ 4. 34 BLEU points on different datasets. Furthermore, we also show that our method achieves competitive performances on cross-domain language translation on four language pairs.

IJCAI Conference 2020 Conference Paper

Automatic Emergency Diagnosis with Knowledge-Based Tree Decoding

  • Ke Wang
  • Xuyan Chen
  • Ning Chen
  • Ting Chen

Automatic diagnosis based on clinical notes is critical especially in the emergency department, where a fast and professional result is vital in assuring proper and timely treatment. Previous works formalize this task as plain text classification and fail to utilize the medically significant tree structure of International Classification of Diseases (ICD) coding system. Besides, external medical knowledge is rarely used before, and we explore it by extracting relevant materials from Wikipedia or Baidupedia. In this paper, we propose a knowledge-based tree decoding model (K-BTD), and the inference procedure is a top-down decoding process from the root node to leaf nodes. The stepwise inference procedure enables the model to give support for decision at each step, which visualizes the diagnosis procedure and adds to the interpretability of final predictions. Experiments on real-world data from the emergency department of a large-scale hospital indicate that the proposed model outperforms all baselines in both micro-F1 and macro-F1, and reduce the semantic distance dramatically.

ICLR Conference 2020 Conference Paper

SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference

  • Lasse Espeholt
  • Raphaël Marinier
  • Piotr Stanczyk
  • Ke Wang
  • Marcin Michalski

We present a modern scalable reinforcement learning agent called SEED (Scalable, Efficient Deep-RL). By effectively utilizing modern accelerators, we show that it is not only possible to train on millions of frames per second but also to lower the cost. of experiments compared to current methods. We achieve this with a simple architecture that features centralized inference and an optimized communication layer. SEED adopts two state-of-the-art distributed algorithms, IMPALA/V-trace (policy gradients) and R2D2 (Q-learning), and is evaluated on Atari-57, DeepMind Lab and Google Research Football. We improve the state of the art on Football and are able to reach state of the art on Atari-57 twice as fast in wall-time. For the scenarios we consider, a 40% to 80% cost reduction for running experiments is achieved. The implementation along with experiments is open-sourced so results can be reproduced and novel ideas tried out.

NeurIPS Conference 2019 Conference Paper

Controllable Unsupervised Text Attribute Transfer via Editing Entangled Latent Representation

  • Ke Wang
  • Hang Hua
  • Xiaojun Wan

Unsupervised text attribute transfer automatically transforms a text to alter a specific attribute (e. g. sentiment) without using any parallel data, while simultaneously preserving its attribute-independent content. The dominant approaches are trying to model the content-independent attribute separately, e. g. , learning different attributes' representations or using multiple attribute-specific decoders. However, it may lead to inflexibility from the perspective of controlling the degree of transfer or transferring over multiple aspects at the same time. To address the above problems, we propose a more flexible unsupervised text attribute transfer framework which replaces the process of modeling attribute with minimal editing of latent representations based on an attribute classifier. Specifically, we first propose a Transformer-based autoencoder to learn an entangled latent representation for a discrete text, then we transform the attribute transfer task to an optimization problem and propose the Fast-Gradient-Iterative-Modification algorithm to edit the latent representation until conforming to the target attribute. Extensive experimental results demonstrate that our model achieves very competitive performance on three public data sets. Furthermore, we also show that our model can not only control the degree of transfer freely but also allow to transfer over multiple aspects at the same time.

AAAI Conference 2019 Conference Paper

Cost-Sensitive Learning to Rank

  • Ryan McBride
  • Ke Wang
  • Zhouyang Ren
  • Wenyuan Li

We formulate the Cost-Sensitive Learning to Rank problem of learning to prioritize limited resources to mitigate the most costly outcomes. We develop improved ranking models to solve this problem, as verified by experiments in diverse domains such as forest fire prevention, crime prevention, and preventing storm caused outages in electrical networks.

IJCAI Conference 2019 Conference Paper

Discrete Binary Coding based Label Distribution Learning

  • Ke Wang
  • Xin Geng

Label Distribution Learning (LDL) is a general learning paradigm in machine learning, which includes both single-label learning (SLL) and multi-label learning (MLL) as its special cases. Recently, many LDL algorithms have been proposed to handle different application tasks such as facial age estimation, head pose estimation and visual sentiment distributions prediction. However, the training time complexity of most existing LDL algorithms is too high, which makes them unapplicable to large-scale LDL. In this paper, we propose a novel LDL method to address this issue, termed Discrete Binary Coding based Label Distribution Learning (DBC-LDL). Specifically, we design an efficiently discrete coding framework to learn binary codes for instances. Furthermore, both the pair-wise semantic similarities and the original label distributions are integrated into this framework to learn highly discriminative binary codes. In addition, a fast approximate nearest neighbor (ANN) search strategy is utilized to predict label distributions for testing instances. Experimental results on five real-world datasets demonstrate its superior performance over several state-of-the-art LDL methods with the lower time cost.

NeurIPS Conference 2019 Conference Paper

Exact Gaussian Processes on a Million Data Points

  • Ke Wang
  • Geoff Pleiss
  • Jacob Gardner
  • Stephen Tyree
  • Kilian Weinberger
  • Andrew Gordon Wilson

Gaussian processes (GPs) are flexible non-parametric models, with a capacity that grows with the available data. However, computational constraints with standard inference procedures have limited exact GPs to problems with fewer than about ten thousand training points, necessitating approximations for larger datasets. In this paper, we develop a scalable approach for exact GPs that leverages multi-GPU parallelization and methods like linear conjugate gradients, accessing the kernel matrix only through matrix multiplication. By partitioning and distributing kernel matrix multiplies, we demonstrate that an exact GP can be trained on over a million points, a task previously thought to be impossible with current computing hardware. Moreover, our approach is generally applicable, without constraints to grid data or specific kernel classes. Enabled by this scalability, we perform the first-ever comparison of exact GPs against scalable GP approximations on datasets with $10^4 \! -\! 10^6$ data points, showing dramatic performance improvements.

IJCAI Conference 2018 Conference Paper

Binary Coding based Label Distribution Learning

  • Ke Wang
  • Xin Geng

Label Distribution Learning (LDL) is a novel learning paradigm in machine learning, which assumes that an instance is labeled by a distribution over all labels, rather than labeled by a logic label or some logic labels. Thus, LDL can model the description degree of all possible labels to an instance. Although many LDL methods have been put forward to deal with different application tasks, most existing methods suffer from the scalability issue. In this paper, a scalable LDL framework named Binary Coding based Label Distribution Learning (BC-LDL) is proposed for large-scale LDL. The proposed framework includes two parts, i. e. , binary coding and label distribution generation. In the binary coding part, the learning objective is to generate the optimal binary codes for the instances. We integrate the label distribution information of the instances into a binary coding procedure, leading to high-quality binary codes. In the label distribution generation part, given an instance, the k nearest training instances in the Hamming space are searched and the mean of the label distributions of all the neighboring instances is calculated as the predicted label distribution. Experiments on five benchmark datasets validate the superiority of BC-LDL over several state-of-the-art LDL methods.

IJCAI Conference 2018 Conference Paper

SentiGAN: Generating Sentimental Texts via Mixture Adversarial Networks

  • Ke Wang
  • Xiaojun Wan

Generating texts of different sentiment labels is getting more and more attention in the area of natural language generation. Recently, Generative Adversarial Net (GAN) has shown promising results in text generation. However, the texts generated by GAN usually suffer from the problems of poor quality, lack of diversity and mode collapse. In this paper, we propose a novel framework - SentiGAN, which has multiple generators and one multi-class discriminator, to address the above problems. In our framework, multiple generators are trained simultaneously, aiming at generating texts of different sentiment labels without supervision. We propose a penalty based objective in the generators to force each of them to generate diversified examples of a specific sentiment label. Moreover, the use of multiple generators and one multi-class discriminator can make each generator focus on generating its own examples of a specific sentiment label accurately. Experimental results on four datasets demonstrate that our model consistently outperforms several state-of-the-art text generation methods in the sentiment accuracy and quality of generated texts.

IROS Conference 2017 Conference Paper

An underwater electrosensor for identifying objects of similar volume and aspect ratio using convolutional neural network

  • Ke Wang
  • Khac Duc Do 0001
  • Lei Cui 0005

Underwater electrosense is bio-inspired by weakly electric fishes that use an electric field to see the objects in the water. Current studies on engineering electrosense focus on designing sophisticated sensors and algorithms for emulating biological functions including localization and identification. This work aimed to develop a planar sensor equipped with a dense electrode array that is capable of providing accurate and dense data for identifying objects of similar volume and aspect ratio, which has been a challenge in underwater sensing. After sensor design and implementation were presented, convolutional neural networks (CNN), which are widely used in digital image recognition, was trained using both simulation and experimental data. In the simulation, the overall success rate on identifying the sphere, cube, and rod is 92. 6% by a 28 × 28 electrode array. In the preliminary experimental tests, a sensor with 16 × 16 electrode array achieved an overall success rate of 90. 4% on identifying a sphere and a rod.

IROS Conference 2016 Conference Paper

A discrete dipole approximation approach to underwater active electrosense problems

  • Ke Wang
  • Lei Cui 0005
  • Khac Duc Do 0001

Weakly electric fish use self-established electric field to sense the underwater environment that may be cluttered and turbid. Previous works on building artificial counterparts are limited to simplest cases, as no analytical solutions exist under complex boundary conditions. Universal numerical approaches like Finite Element Method (FEM) and Boundary Element Method (BEM) suffer from lengthy meshing process and heavily computational burden. In this paper, discrete dipole approximation (DDA), which is widely used in light scattering and absorption problems, was for the first time proposed to be applied for underwater electrosense. This approach is lightweight, flexible and computationally efficient compared with FEM. It was simulated in electric fields excited by parallel-plate electrodes and spherical electrodes of a simplified robotic model. A constrained unscented Kalman filter (CUKF) was further utilized to localize the position and identify the size of an invading cube. Results comparison with FEM indicate the differences of a cuboidal object in two orthogonal positions were 7. 10% and 10. 46% respectively, and the difference in size was 11. 82%. These results were achieved at a cost of less than 1% of the computational effort of the FEM. The proposed approach proved effective from the simulation results and laid a solid foundation for real-time underwater active electrosense in a more general environment.

IROS Conference 2016 Conference Paper

An underwater electrosensory membrane bio-inspired by weakly electric fish

  • Ke Wang
  • Lei Cui 0005
  • Khac Duc Do 0001

Artificial sensory system is promising to navigate in cluttered and turbid underwater environment by achieving similar functions of biological electrosense of weakly electric fish. In this paper we designed an electrosensory membrane that can operate in a full 3-dimensional mode. Algorithms on the object localization were also designed and tested based on numerical methods of electric field forward simulation. We combined the statistic learning method by training a multi-layer neural network and a probabilistic approach by applying a constrained unscented Kalman filter (CUKF). This exploits the merits of fast estimation and precise signal marching process. Experimental results showed that the detection and localization with the reported sensor were quick and accurate, with errors of around 10 mm using one-step neural network mapping and about 5 mm in close-range using CUKF. This work demonstrated the effectiveness of proposed electrosensory membrane and algorithms.

IJCAI Conference 2016 Conference Paper

Dimensionally Guided Synthesis of Mathematical Word Problems

  • Ke Wang
  • Zhendong Su

Mathematical Word Problems (MWPs) are important for training students' literacy and numeracy skills. Traditionally MWPs have been manually designed; an effective automated MWP generator can significantly benefit education and research. The goal of this work is to efficiently synthesize MWPs that are authentic (i. e. , similar to manually written problems), diverse (i. e. , covering a wide range of mathematical tasks), and configurable (i. e. , varying difficulty levels and solution characteristics). This is challenging because a generated problem needs to both exhibit a well-founded mathematical structure and also an easily understood natural language story. Our key insight is to leverage the important role that dimensional units play in MWPs, both textually and symbolically. We first synthesize a dimensionally consistent equation and then compose the natural language story via a bottom-up traversal of the equation tree. We have realized our technique and extensively evaluated its efficiency and effectiveness. Results show that the system can generate hundreds of valid problems per second with varying levels of difficulty. More importantly, we show, via a user study with 30 students from a local middle school, that the generated problems are statistically indistinguishable from actual textbook problems for practice and examination.

AAAI Conference 2015 Conference Paper

Are Features Equally Representative? A Feature-Centric Recommendation

  • Chenyi Zhang
  • Ke Wang
  • Ee-Peng Lim
  • Qinneng Xu
  • Jianling Sun
  • Hongkun Yu

Typically a user prefers an item (e. g. , a movie) because she likes certain features of the item (e. g. , director, genre, producer). This observation motivates us to consider a featurecentric recommendation approach to item recommendation: instead of directly predicting the rating on items, we predict the rating on the features of items, and use such ratings to derive the rating on an item. This approach offers several advantages over the traditional item-centric approach: it incorporates more information about why a user chooses an item, it generalizes better due to the denser feature rating data, it explains the prediction of item ratings through the predicted feature ratings. Another contribution is turning a principled item-centric solution into a feature-centric solution, instead of inventing a new algorithm that is feature-centric. This approach maximally leverages previous research. We demonstrate this approach by turning the traditional item-centric latent factor model into a feature-centric solution and demonstrate its superiority over item-centric approaches.

IJCAI Conference 2015 Conference Paper

Automated Geometry Theorem Proving for Human-Readable Proofs

  • Ke Wang
  • Zhendong Su

Geometry reasoning and proof form a major and challenging component in the K-121 mathematics curriculum. Although several computerized systems exist that help students learn and practice general geometry concepts, they do not target geometry proof problems, which are more advanced and difficult. Powerful geometry theorem provers also exist, however they typically employ advanced algebraic methods and generate complex, difficult to understand proofs, and thus do not meet general K-12 students’ educational needs. This paper tackles these weaknesses of prior systems by introducing a geometry proof system, iGeoTutor, capable of generating human-readable elementary proofs, i. e. proofs using standard Euclidean axioms. We have gathered 77 problems in total from various sources, including ones unsolvable by other systems and from Math competitions. iGeoTutor solves all but two problems in under two minutes each, and more importantly, demonstrates a much more effective and intelligent proof search than prior systems. We have also conducted a pilot study with 12 high school students, and the results show that iGeoTutor provides a clear benefit in helping students learn geometry proofs. We are in active discussions with Khan Academy and local high schools for possible adoption of iGeo- Tutor in real learning environments. Video demo: https: //www. youtube. com/watch? v=KL0dUb6hKxU

IJCAI Conference 2015 Conference Paper

Automatic Generation of Raven's Progressive Matrices

  • Ke Wang
  • Zhendong Su

Raven’s Progressive Matrices (RPMs) are a popular family of general intelligence tests, and provide a non-verbal measure of a test subject’s reasoning abilities. Traditionally RPMs have been manually designed. To make them readily available for both practice and examination, we tackle the problem of automatically synthesizing RPMs. Our goal is to efficiently generate a large number of RPMs that are authentic (i. e. similar to manually written problems), interesting (i. e. diverse in terms of difficulty), and well-formed (i. e. unambiguous). The main technical challenges are: How to formalize RPMs to accommodate their seemingly enormous diversity, and how to define and enforce their validity? To this end, we (1) introduce an abstract representation of RPMs using first-order logic, and (2) restrict instantiations to only valid RPMs. We have realized our approach and evaluated its efficiency and effectiveness. We show that our system can generate hundreds of valid problems per second with varying levels of difficulty. More importantly, we show, via a user study with 24 participants, that the generated problems are statistically indistinguishable from actual problems. This work is an exciting instance of how logic and reasoning may aid general learning.

AAAI Conference 2011 Conference Paper

CCRank: Parallel Learning to Rank with Cooperative Coevolution

  • Shuaiqiang Wang
  • Byron Gao
  • Ke Wang
  • Hady Lauw

We propose CCRank, the first parallel algorithm for learning to rank, targeting simultaneous improvement in learning accuracy and efficiency. CCRank is based on cooperative coevolution (CC), a divide-and-conquer framework that has demonstrated high promise in function optimization for problems with large search space and complex structures. Moreover, CC naturally allows parallelization of sub-solutions to the decomposed subproblems, which can substantially boost learning efficiency. With CCRank, we investigate parallel CC in the context of learning to rank. Extensive experiments on benchmarks in comparison with the state-of-the-art algorithms show that CCRank gains in both accuracy and efficiency.

AAAI Conference 2006 Conference Paper

Classification Spanning Private Databases

  • Ke Wang
  • Rong She

In this paper, we study the classification problem involving information spanning multiple private databases. The privacy challenges lie in the facts that data cannot be collected in one place and the classifier itself may disclose private information. We present a novel solution that builds the same decision tree classifier as if data are collected in a central place, but preserves the privacy of participating sites.

IJCAI Conference 1997 Conference Paper

Minimum Splits Based Discretization for Continuous Features

  • Ke Wang
  • Han Chong Goh

Discretization refers to splitting the range of continuous values into intervals so as to provide useful information about classes. This is usually done by minimizing a goodness measure, subject to constraints such as the maximal number of intervals, the minimal number of examples per interval, or some stopping criterion for splitting. We take a different approach by searching for minimum splits that minimize the number of intervals with respect to a threshold of impurity (i. e. , badness). We propose a "total entropy" motivated selection of the "best" split from minimum splits, without requiring additional constraints. Experiments show that the proposed method produces better decision trees.