Arrow Research search

Author name cluster

Bo Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

65 papers
2 author rows

Possible papers

65

AAAI Conference 2026 Conference Paper

Deconstructing Pre-training: Knowledge Attribution Analysis in MoE and Dense Models

  • Bo Wang
  • Junzhuo Li
  • Hong Chen
  • Yuanlin Chu
  • Yuxuan Fan
  • Xuming Hu

Mixture-of-Experts (MoE) architectures decouple model capacity from per-token computation, enabling scaling beyond the computational limits imposed by dense scaling laws. Yet how MoE architectures shape knowledge acquisition during pre-training—and how this process differs from dense architectures—remains unknown. To address this issue, we introduce Gated-LPI (Log-Probability Increase), a neuron-level attribution metric that decomposes log-probability increase across neurons. We present a time-resolved comparison of knowledge acquisition dynamics in MoE and dense architectures, tracking checkpoints over 1.2M (~ 5.0T tokens) and 600K (~ 2.5T tokens) training steps, respectively. Our experiments uncover three patterns: (1) Low-entropy backbone. The top approximately 1% of MoE neurons capture over 45% of positive updates, forming a high-utility core, which is absent in the dense baseline. (2) Early consolidation. The MoE model locks into a stable importance profile within 50% for the dense model, showing that sparsity fosters distributed—rather than brittle—knowledge storage. These patterns collectively demonstrate that sparsity fosters an intrinsically stable and distributed computational backbone from early in training, helping bridge the gap between sparse architectures and training-time interpretability.

TIST Journal 2026 Journal Article

GOAT-Bench: Safety Insights to Large Multimodal Models through Meme-Based Social Abuse

  • Hongzhan Lin
  • Ziyang Luo
  • Bo Wang
  • Ruichao Yang
  • Jing Ma

The exponential growth of social media has profoundly transformed how information is created, disseminated, and absorbed, exceeding any precedent in the digital age. Regrettably, this explosion has also spawned a significant increase in the online abuse of memes. Evaluating the negative impact of memes is notably challenging, owing to their often subtle and implicit meanings, which are not directly conveyed through the overt text and image. In light of this, Large Multimodal Models (LMMs) have emerged as a focal point of interest due to their remarkable capabilities in handling diverse multimodal tasks. In response to this development, our article aims to thoroughly examine the capacity of various LMMs (e.g., GPT-4V, LLaVA, and Qwen-VL) to discern and respond to the nuanced aspects of social abuse manifested in memes. We introduce the comprehensive meme benchmark, GOAT-Bench, comprising over 6K varied memes encapsulating themes, such as implicit hate speech, sexism, and cyberbullying. Utilizing GOAT-Bench, we delve into the ability of LMMs to accurately assess hatefulness, misogyny, offensiveness, sarcasm, and harmful content. Our extensive experiments across a range of LMMs reveal that current models still exhibit a deficiency in safety awareness, showing insensitivity to various forms of implicit abuse. We posit that this shortfall represents a critical impediment to the realization of safe artificial intelligence. The GOAT-Bench and accompanying resources are publicly accessible at https://goatlmm.github.io/, contributing to ongoing research in this vital field.

JBHI Journal 2026 Journal Article

Medical Knowledge-Driven Contrastive Learning for Similar Patient Retrieval

  • Fanqing Meng
  • Chong Feng
  • Ge Shi
  • Xia Liu
  • Bo Wang
  • Kaiyuan Zhang
  • Yan Zhuang

Similar patient retrieval is a fundamental task in medical informatics, aiming to identify patients with similar clinical characteristics to assist in diagnosis and treatment plan recommendation. While traditional methods relying on lexical features or medical ontologies often fail to capture implicit semantic relationships, recent advancements in dense retrieval methods powered by deep learning have shown promise yet face challenges in adapting to specific tasks such as similar patient retrieval. To address these limitations, we propose a medical knowledge-driven contrastive learning approach to enhance the representation capacity of general-purpose embedding models for medical text. Specifically, our approach introduces a novel negative sampling strategy leveraging International Classification of Diseases (ICD) codes to identify hard negatives. However, due to data imbalance issues, this method struggles to adequately mine negative examples. To overcome this limitation, we develop an external knowledge-based negative sampling method that incorporates both statistical and ambiguous knowledge, thereby enhancing the model's ability to differentiate between fine-grained medical conditions and complex clinical scenarios. We then integrate these methods into a contrastive learning framework to train more robust patient representations. Extensive experiments on real-world medical datasets show that our proposed method achieves significant improvements over existing state-of-the-art baseline models.

AAAI Conference 2026 Conference Paper

QueryAligner: Customizing User Query to Match LLMs Preferences for Better Intent Recognition

  • Yunlong Ma
  • Bo Wang
  • Yihong Tang
  • Zifei Yu
  • Chenyun Xue
  • Gaoke Zhang
  • Yuexian Hou

The interpretative efficacy of large language models (LLMs) fundamentally hinges on the intricate alignment between user inputs and model-specific linguistic priors. Existing methodologies predominantly employ static input optimization strategies, failing to account for the empirically observed divergence in linguistic preference spaces across distinct LLM architectures, including variations in syntactic parsing heuristics, semantic grounding mechanisms, and knowledge retrieval pathways. We propose QueryAligner, an adaptive rewriting system implementing dynamic model-aware input transformation through architecture-specific preference modeling. Our framework introduces two pivotal innovations: 1) A dual-phase optimization engine integrating supervised learning on reverse-engineered cross-architectural training data with reinforcement learning driven by multi-objective reward signals, ensuring simultaneous preservation of semantic integrity and maximization of target model compatibility; 2) An architecture-informed rewriting protocol that automatically discovers latent alignment patterns encoded within distinct LLMs' parametric configurations. Experimental results demonstrate that our method achieves superior performance compared to conventional input optimization techniques.

AAAI Conference 2026 Short Paper

Style-First Authorship Verification for Academic Integrity in the Generative AI Era (Student Abstract)

  • Jun Jang
  • Thai Le
  • Bo Wang

With the rise of generative artificial intelligence (GenAI), academic dishonesty in classrooms has skyrocketed, yet the existing solutions for detecting such dishonesty often fall short. Standard "AI detectors" merely analyze one text at a time, failing to account for students' previous writings, which risks erroneous predictions. Meanwhile, existing token-based authorship verification (AV) models fail to analyze the nuances in writing styles that truly distinguish authorship. To fill this existing gap, we propose a novel AV framework that combines token-level stylometric features (e.g., POS tag patterns) with handcrafted stylistic features (e.g., sentence structure variation) to construct a comprehensive feature set. Using both benchmark corpora and real-world high school student essays, we trained multiple machine learning classifiers using the proposed feature set. Our initial experiments show that our approach outperforms the standard token-only baselines by over 25%, while offering interpretable, style-based insights. These preliminary results highlight the importance of nuanced stylistic features and suggest that a holistic AV system can provide educators with more reliable and transparent detection tools. Looking ahead, we plan to extend this work with large language models and multi-agent approaches to further enhance robustness and adaptability.

AAAI Conference 2026 Conference Paper

TGCA-LLM: Time-Aware Graph-Text Contrastive Alignment for Enhancing LLMs in Temporal Knowledge Graph Completion

  • Zexuan Wan
  • Bo Wang
  • Kuofei Fang
  • Bin Wu

Temporal Knowledge Graph Completion (TKGC) aims to infer missing facts by modeling historical events and latent temporal dependencies in Temporal Knowledge Graphs (TKGs). Recently, TKGC methods that integrate graph embeddings into Large Language Models (LLMs) have shown great promise by leveraging the structural information of TKGs together with the powerful reasoning capabilities of LLMs. However, these embedding-based methods are limited by suboptimal graph representations due to noise and long-tail issues in real-world scenarios, and insufficient cross-modal alignment between graph and language, hindering LLMs' ability to fully capture the temporal and structural information of TKGs. To address these issues, we propose TGCA-LLM, a novel embedding-based framework for TKGC. Specifically, TGCA-LLM first employs time-aware contrastive learning to align fact texts with graph structures in the temporal dimension, generating robust graph embeddings and establishing initial cross-modal alignment. Then, through a two-stage tuning process, it enables LLMs to gradually acquire structural and temporal knowledge from graph embeddings while enhancing their cross-modal reasoning capabilities in TKGC. Extensive experiments on three widely used real-world benchmarks demonstrate that TGCA-LLM outperforms state-of-the-art (SOTA) baselines by at least 8.7% MRR, highlighting its effectiveness.

AAAI Conference 2026 Conference Paper

Time-Frequency Token Advantage Clipping for Training Efficient Large Reasoning Model

  • Rong Bao
  • Bo Wang
  • Xiao Wang
  • Hongyu Li
  • Rui Zheng
  • Leszek Rutkowski
  • Qi Zhang
  • Liang Ding

Long Chain-of-Thought (CoT) reasoning enhances large reasoning models' performance but suffers from severe inefficiencies, as models often overthink simple problems or underthink complex ones. Current sequence-level optimizations, like length penalties, are too coarse-grained to distinguish core logic from verbose language, precluding the necessary token-level control for efficient reasoning CoT. To overcome these limitations, we introduce Time-Frequency token Advantage Clipping (TFAC), a novel training framework designed to build efficient large reasoning models via token-level interventions. Specifically, TFAC functions along two dimensions: 1) The Frequency Dimension: It discourages inefficient loops and encourages deeper exploration by dynamically reducing the advantage scores of high-entropy tokens that are repeatedly generated within a single reasoning path. 2) The Time Dimension: It reduces excessive overthinking of the system by establishing a historical baseline for the occurrence count of each critical token in previously successful trajectories, and clipping the advantages of tokens that exceed this baseline during training. Crucially, to preserve the model's exploratory capabilities on novel problems, this suppression mechanism is automatically disabled when no historical record of success is available. Experiments conducted on the Deepseek-Distill-32B and Qwen3-8B models show that TFAC outperforms leading baseline methods, improving performance by 2.3 and 3.1 percentage points, respectively, while simultaneously reducing inference costs by 35% and 28% in scenarios where correct answers are generated. These results validate the significant efficacy of TFAC in training large reasoning models that are both powerful and highly efficient.

NeurIPS Conference 2025 Conference Paper

3DOT: Texture Transfer for 3DGS Objects from a Single Reference Image

  • Xiao Cao
  • Beibei Lin
  • Bo Wang
  • Zhiyong Huang
  • Robby Tan

Image-based 3D texture transfer from a single 2D reference image enables practical customization of 3D object appearances with minimal manual effort. Adapted 2D editing and text-driven 3D editing approaches can serve this purpose. However, 2D editing typically involves frame-by-frame manipulation, often resulting in inconsistencies across views, while text-driven 3D editing struggles to preserve texture characteristics from reference images. To tackle these challenges, we introduce \textbf{3DOT}, a \textbf{3D} Gaussian Splatting \textbf{O}bject \textbf{T}exture Transfer method based on a single reference image, integrating: 1) progressive generation, 2) view-consistency gradient guidance, and 3) prompt-tuned gradient guidance. To ensure view consistency, progressive generation starts by transferring texture from the reference image and gradually propagates it to adjacent views. View-consistency gradient guidance further reinforces coherence by conditioning the generation model on feature differences between consistent and inconsistent outputs. To preserve texture characteristics, prompt-tuning-based gradient guidance learns a token that describes differences between original and reference textures, guiding the transfer for faithful texture preservation across views. Overall, 3DOT combines these strategies to achieve effective texture transfer while maintaining structural coherence across viewpoints. Extensive qualitative and quantitative evaluations confirm that our three components enable convincing and effective 2D-to-3D texture transfer. Our project page is available here: https: //massyzs. github. io/3DOT_web/.

EAAI Journal 2025 Journal Article

A collaborative surface target detection and localization method for an unmanned surface vehicle swarm

  • Bo Wang
  • Chenyu Mao
  • Kaixin Wei
  • Xueyi Wu
  • Ye Li

A single unmanned surface vehicle (USV) designed for marine missions suffers from limited payload, low efficiency and weak intelligence, while a swarm of USVs shows significant advantages in mission flexibility, diverse payload and task efficiency. One of the key issues for an USV swarm is how to achieve highly efficient collaborative perception. To address this issue, a method framework of collaborative surface target detection and localization based on multiple sensors for a swarm including 4 USVs is designed. First, perception systems are constructed, a joint calibration method for different sensors is proposed, and a lightweight target detection method improved with attention mechanism and lightweight adaptive spatial feature fusion is designed. Second, a specialized fusion method using sensor principles based on an extended Kalman filter (EKF) is proposed for a single USV to obtain a target state model. Third, the obtained target models from different USVs are registered with fuzzy matching and integrated into the complete model in a geographic coordinate system. The proposed method is applied to the collaborative perception system on our developed 4 USV swarm and verified in real marine environment and simulation. Experimental results show that our proposed method framework significantly improves the accuracy, efficiency, and reliability of the target detection and localization. The proposed LAF-YOLOv8-s reduces the model size by 5. 1M, while the mean average precision (mAP) reaches 68. 7%, which is significantly superior to other methods. The average collaborative localization error is reduced by 2. 9m. The dataset is available at https: //github. com/maochenyu1/WSLight.

EAAI Journal 2025 Journal Article

A self-prompt based dual-domain network for nighttime flare removal

  • Kejing Qi
  • Bo Wang
  • Yun Liu

Existing nighttime flare removal work regards flare as a single degradation factor in spatial domain. However, flare in complex scenes consists of multiple flare types, and it is difficult to distinguish them from the background, leading to distorted results and incomplete perception. In this paper, we propose a self-prompt based dual-domain network named SPDDNet for nighttime flare removal, which encodes the data distributions of different flare types to generate prompt features and facilitates the interaction of the prompt features with the decoder to guide network for flare removal. In addition, we introduce Fast Fourier Transform and parallel attention in traditional convolutional neural network that is designed to extract global frequency features and location-dependent local information to accurately perceive the flare region. Finally, to adequately integrate spatial details, contextual information, prompt and image features, we propose a feature fusion module that generates a set of learned dynamic weights to adaptively guide the information fusion across channels. Extensive experiments on real-world and synthetic datasets strongly demonstrate the effectiveness of our proposed SPDDNet and its superior performance compared to state-of-the-art methods. Moreover, as an essential pre-processing step, the potential advantages of our method for other computer vision applications, including object detection and semantic segmentation, are demonstrated.

EAAI Journal 2025 Journal Article

An integrated exergy efficiency and machine learning method for optimizing organic solid waste gasification process

  • Wenni Chen
  • Xianan Xiang
  • Sha Liu
  • Jun Guo
  • Tao Li
  • Xuehua Zhou
  • Deyong Peng
  • Zhiya Deng

Organic solid waste (OSW) gasification is a critical pathway toward sustainable energy utilization. This study develops an integrated prediction model by combining exergy efficiency-based analytic hierarchy process-fuzzy comprehensive evaluation (AHP-FCE) with machine learning techniques. The model aims to select the optimal gasifier type and operational parameters based on OSW characteristics and processing capacities. Exergy efficiency derived from experimental data is used to construct AHP-FCE scores, which are then predicted using eight machine learning algorithms. Gradient boosting decision tree (GBDT) achieves the best performance. The prediction model is applied to three practical cases. For a project with an annual processing capacity of 2000 tons of refuse-derived fuel (RDF), the model consistently recommends the downdraft fixed-bed gasifier (DBG). In a corn straw gasification project processing 11, 000 tons per year, the bubbling fluidized-bed gasifier (BBG) is identified as the optimal choice. For a bamboo chip gasification project with an annual capacity of 150, 000 tons, the model suggests using the circulating fluidized-bed gasifier (CFBG) for reduction objectives and the dual fluidized-bed gasifier (DFBG) for hydrogen production goals. Additionally, the model shows significant potential. It can also be applied to optimize other complex systems that require balancing multiple influencing factors.

NeurIPS Conference 2025 Conference Paper

BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model

  • Adibvafa Fallahpour
  • Andrew Magnuson
  • Purav Gupta
  • Shihao Ma
  • Jack Naimer
  • Arnav Shah
  • Haonan Duan
  • Omar Ibrahim

Unlocking deep and interpretable biological reasoning from complex genomic data remains a major AI challenge limiting scientific progress. While current DNA foundation models excel at representing sequences, they struggle with multi-step reasoning and lack transparent, biologically meaningful explanations. BioReason addresses this by tightly integrating a DNA foundation model with a large language model (LLM), enabling the LLM to directly interpret and reason over genomic information. Through supervised fine-tuning and reinforcement learning, BioReason learns to produce logical, biologically coherent deductions. It achieves major performance gains, boosting KEGG-based disease pathway prediction accuracy from 86% to 98% and improving variant effect prediction by an average of 15% over strong baselines. BioReason can reason over unseen biological entities and explain its decisions step by step, offering a transformative framework for interpretable, mechanistic AI in biology. All data, code, and checkpoints are available at https: //github. com/bowang-lab/BioReason.

TMLR Journal 2025 Journal Article

Commander-GPT: Dividing and Routing for Multimodal Sarcasm Detection

  • Yazhou Zhang
  • Chunwang Zou
  • Bo Wang
  • Jing Qin
  • Prayag Tiwari

Multimodal sarcasm understanding is a high-order cognitive task. Although large language models (LLMs) have shown impressive performance on many downstream NLP tasks, growing evidence suggests that they struggle with sarcasm understanding. In this paper, we propose Commander-GPT, a modular decision routing framework inspired by military command theory. Rather than relying on a single LLM's capability, Commander-GPT orchestrates a team of specialized LLM agents where each agent will be selectively assigned to a focused sub-task such as context modeling, sentiment analysis, etc. Their outputs are then routed back to the commander, which integrates the information and performs the final sarcasm judgment. To coordinate these agents, we introduce three types of centralized commanders: (1) a trained lightweight encoder-based commander (e.g., multi-modal BERT); (2) four small autoregressive language models, serving as moderately capable commanders (e.g., DeepSeek-VL); (3) two large LLM-based commander (Gemini Pro and GPT-4o) that performs task routing, output aggregation, and sarcasm decision-making in a zero-shot fashion. We evaluate Commander-GPT on the MMSD and MMSD 2.0 benchmarks, comparing five prompting strategies. Experimental results show that our framework achieves 4.4% and 8.5% improvement in F1 score over state-of-the-art (SoTA) baselines on average, demonstrating its effectiveness.

NeurIPS Conference 2025 Conference Paper

Ctrl-DNA: Controllable Cell-Type-Specific Regulatory DNA Design via Constrained RL

  • Xingyu Chen
  • Shihao Ma
  • Runsheng Lin
  • Jiecong Lin
  • Bo Wang

Designing regulatory DNA sequences that achieve precise cell-type-specific gene expression is crucial for advancements in synthetic biology, gene therapy and precision medicine. Although transformer-based language models (LMs) can effectively capture patterns in regulatory DNA, their generative approaches often struggle to produce novel sequences with reliable cell-specific activity. Here, we introduce Ctrl-DNA, a novel constrained reinforcement learning (RL) framework tailored for designing regulatory DNA sequences with controllable cell-type specificity. By formulating regulatory sequence design as a biologically informed constrained optimization problem, we apply RL to autoregressive genomic LMs, enabling the models to iteratively refine sequences that maximize regulatory activity in targeted cell types while constraining off-target effects. Our evaluation on human promoters and enhancers demonstrates that Ctrl-DNA consistently outperforms existing generative and RL-based approaches, generating high-fitness regulatory sequences and achieving state-of-the-art cell-type specificity. Moreover, Ctrl-DNA-generated sequences capture key cell-type-specific transcription factor binding sites (TFBS), short DNA motifs recognized by regulatory proteins that control gene expression, demonstrating the biological plausibility of the generated sequences.

AAAI Conference 2025 Conference Paper

D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching

  • Jingyu Liu
  • Minquan Wang
  • Ye Ma
  • Bo Wang
  • Aozhu Chen
  • Quan Chen
  • Peng Jiang
  • Xirong Li

Videos showcasing specific products are increasingly important for E-commerce. Key moments naturally exist as the first appearance of a specific product, presentation of its distinctive features, the presence of a buying link, etc. Adding proper sound effects (SFX) to such moments, or video decoration with SFX (VDSFX), is crucial for enhancing user engaging experience. Previous work adds SFX to videos by video-to-SFX matching at a holistic level, lacking the ability of adding SFX to a specific moment. Meanwhile, previous studies on video highlight detection or video moment retrieval consider only moment localization, leaving moment to SFX matching untouched. By contrast, we propose in this paper D&M, a unified method that accomplishes key moment detection and moment-to-SFX matching simultaneously. Moreover, for the new VDSFX task we build a large-scale dataset SFX-Moment from an E-commerce video creation platform. For a fair comparison, we build competitive baselines by extending a number of current video moment detection methods to the new task. Extensive experiments on SFX-Moment show the superior performance of the proposed method over the baselines.

NeurIPS Conference 2025 Conference Paper

EAG3R: Event-Augmented 3D Geometry Estimation for Dynamic and Extreme-Lighting Scenes

  • Xiaoshan Wu
  • Yifei Yu
  • Xiaoyang Lyu
  • Yihua Huang
  • Bo Wang
  • Baoheng Zhang
  • Zhongrui Wang
  • Xiaojuan Qi

Robust 3D geometry estimation from videos is critical for applications such as autonomous navigation, SLAM, and 3D scene reconstruction. Recent methods like DUSt3R demonstrate that regressing dense pointmaps from image pairs enables accurate and efficient pose-free reconstruction. However, existing RGB-only approaches struggle under real-world conditions involving dynamic objects and extreme illumination, due to the inherent limitations of conventional cameras. In this paper, we propose \textbf{EAG3R}, a novel geometry estimation framework that augments pointmap-based reconstruction with asynchronous event streams. Built upon the MonST3R backbone, EAG3R introduces two key innovations: (1) a retinex-inspired image enhancement module and a lightweight event adapter with SNR-aware fusion mechanism that adaptively combines RGB and event features based on local reliability; and (2) a novel event-based photometric consistency loss that reinforces spatiotemporal coherence during global optimization. Our method enables robust geometry estimation in challenging dynamic low-light scenes without requiring retraining on night-time data. Extensive experiments demonstrate that EAG3R significantly outperforms state-of-the-art RGB-only baselines across monocular depth estimation, camera pose tracking, and dynamic reconstruction tasks.

AAAI Conference 2025 Conference Paper

FakeDiffer: Distributional Disparity Learning on Differentiated Reconstruction for Face Forgery Detection

  • Bo Wang
  • Zhao Zhang
  • Suiyi Zhao
  • Xianming Ye
  • Haijun Zhang
  • Meng Wang

Existing face forgery detection methods achieve promising performance when training and testing forgery data are from identical manipulation types, while they fail to generalize well to unseen samples. In this paper, we experimentally investigate and find that the poor generalization of the methods mainly arises from their overfitting on the known fake patterns. Excessively focused on seen fakes, those detectors fail to effectively learn image-intrinsic information and the distributional disparity between real and fake images. Then, to address this issue, we redefine fake learning as real-fake distributional disparity learning. We propose a novel deepfake detection framework learning distributional disparity based on the differentiated reconstruction on real and fake images for improved generalization. Specifically, distributional disparity learning on differentiated reconstruction of the real and fake images, enforces the model to learn image-invariant intrinsic representations. The reconstruction on real and fake images forces the decoders to learn the distribution of real and fake images, respectively. Moreover, to avoid the influence from the specificalization of the known fake patterns, we further propose the information interaction learning on the encoded intrinsic information and the pixel disparity between the input image and its reconstruction to distinguish face forgeries that are even unknown. Extensive experiments on large-scale benchmark datasets demonstrated the effectiveness of addressing the overfitting issue of the classification network, and verified the superior performance of our method.

TMLR Journal 2025 Journal Article

FraGNNet: A Deep Probabilistic Model for Tandem Mass Spectrum Prediction

  • Adamo Young
  • Fei Wang
  • David Wishart
  • Bo Wang
  • Russell Greiner
  • Hannes Rost

Compound identification from tandem mass spectrometry (MS/MS) data is a critical step in the analysis of complex mixtures. Typical solutions for the MS/MS spectrum to compound (MS2C) problem involve comparing the unknown spectrum against a library of known spectrum-molecule pairs, an approach that is limited by incomplete library coverage. Compound to MS/MS spectrum (C2MS) models can improve retrieval rates by augmenting real libraries with predicted MS/MS spectra. Unfortunately, many existing C2MS models suffer from problems with mass accuracy, generalization, or interpretability. We develop a new probabilistic method for C2MS prediction, FraGNNet, that can efficiently and accurately simulate MS/MS spectra with high mass accuracy. Our approach formulates the C2MS problem as learning a distribution over molecule fragments. FraGNNet achieves state-of-the-art performance in terms of prediction error and surpasses existing C2MS models as a tool for retrieval-based MS2C.

NeurIPS Conference 2025 Conference Paper

Generative Pre-trained Autoregressive Diffusion Transformer

  • Yuan Zhang
  • Jiacheng Jiang
  • Guoqing Ma
  • Zhiying Lu
  • Bo Wang
  • Haoyang Huang
  • Jianlong Yuan
  • Nan Duan

In this work, we present GPDiT, a Generative Pre-trained Autoregressive Diffusion Transformer that unifies the strengths of diffusion and autoregressive modeling for long-range video synthesis, within a continuous latent space. Instead of predicting discrete tokens, GPDiT autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics and semantic consistency across frames. This continuous autoregressive framework not only enhances generation quality but also endows the model with representation capabilities. Additionally, we introduce a lightweight causal attention variant and a parameter-free rotation-based time-conditioning mechanism, improving both the training and inference efficiency. Extensive experiments demonstrate that GPDiT achieves strong performance in video generation quality, video representation ability, and few-shot learning tasks, highlighting its potential as an effective framework for video modeling in continuous space.

NeurIPS Conference 2025 Conference Paper

GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior

  • Penghao Wu
  • Shengnan Ma
  • Bo Wang
  • Jiaheng Yu
  • Lewei Lu
  • Ziwei Liu

Multimodal Large Language Models (MLLMs) have shown great potential in revolutionizing Graphical User Interface (GUI) automation. However, existing GUI models mostly rely on learning from nearly error-free offline trajectories, thus lacking reflection and error recovery capabilities. To bridge this gap, we propose GUI-Reflection, a novel framework that explicitly integrates self-reflection and error correction capabilities into end-to-end multimodal GUI models throughout dedicated training stages: GUI-specific pre-training, offline supervised fine-tuning (SFT), and online reflection tuning. GUI-reflection enables self-reflection behavior emergence with fully automated data generation and learning processes without requiring any human annotation. Specifically, 1) we first propose scalable data pipelines to automatically construct reflection and error correction data from existing successful trajectories. While existing GUI models mainly focus on grounding and UI understanding ability, we propose the GUI-Reflection Task Suite to learn and evaluate reflection-oriented abilities explicitly. 2) Furthermore, we built a diverse and efficient environment for online training and data collection of GUI models on mobile devices. 3) We also present an iterative online reflection tuning algorithm leveraging the proposed environment, enabling the model to continuously enhance its reflection and error correction abilities. Our framework equips GUI agents with self-reflection and correction capabilities, paving the way for more robust, adaptable, and intelligent GUI automation, with all data, models, environments, and tools to be released publicly.

NeurIPS Conference 2025 Conference Paper

Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections

  • Bo Wang
  • Qinyuan Cheng
  • Runyu Peng
  • Rong Bao
  • Peiji Li
  • Qipeng Guo
  • Linyang Li
  • Zhiyuan Zeng

Post-training processes are essential phases in grounding pre-trained language models to real-world tasks, with learning from demonstrations or preference signals playing a crucial role in this adaptation. We present a unified theoretical framework bridging Supervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training. Through rigorous mathematical derivation, we demonstrate that both SFT and preference learning methods like Direct Preference Optimization (DPO) operate within the same optimal policy-reward subspace, with SFT representing a special case of implicit reward learning. Our analysis reveals a critical limitation in conventional SFT: the KL divergence term in distribution matching becomes constant with respect to the policy during optimization, failing to constrain model updates. To address this, we propose a simple yet effective learning rate reduction approach that yields significant performance improvements (up to \textbf{25\%} relative gain and \textbf{6\%} absolute win rate increase in instruction following tasks. Additionally, we derive alternative SFT objectives from various f-divergence functions that preserve the KL term during optimization, further enhancing post-DPO model performance. Finally, we extend the theoretical relationship between LLM logits and Q-functions from preference learning to the SFT context, providing mathematical derivations and experimental validation.

EAAI Journal 2025 Journal Article

Multi-To-Binary: A generalizable deepfake detection approach with multi-classification guidance

  • Fei Wang
  • Bo Wang
  • Botao Jing
  • Wei Wang
  • Fei Wei
  • Junxin Chen

Visual content forgery techniques, such as Deepfake, have rapidly advanced in recent years. Due to the potential misuse of these techniques for malicious purposes, there is increasing attention to the corresponding detection methods. Most existing methods focus on specific forgery patterns, making it difficult to detect forgeries with unknown or evolving patterns. In this work, we propose a novel forgery detection framework designed to extract comprehensive features utilizing multiple classification models. More specifically, our proposed framework consists of both binary-classification and multi-classification models working collaboratively, enhanced by innovative fusion and freezing mechanisms to improve accuracy and efficiency. We conducted extensive experiments to evaluate the performance of our approach. The results demonstrate that our approach outperforms state-of-the-art techniques in terms of generalization to new forgery patterns and robustness against various types of forgeries. This demonstrates promising effectiveness for real-world applications where forgeries can be diverse and sophisticated. Our code is available at https: //github. com/Phoebe-cap/M2B-main.

NeurIPS Conference 2025 Conference Paper

PANTHER: Generative Pretraining Beyond Language for Sequential User Behavior Modeling

  • Guilin Li
  • Yun Zhang
  • Xiuyuan Chen
  • CHENGQI LI
  • Bo Wang
  • Linghe Kong
  • Wenjia Wang
  • Weiran Huang

Large language models (LLMs) have shown that generative pretraining can distill vast world knowledge into compact token representations. While LLMs encapsulate extensive world knowledge, they remain limited in modeling the behavioral knowledge contained within user interaction histories. User behavior forms a distinct modality, where each action—defined by multi-dimensional attributes such as time, context, and transaction type—constitutes a behavioral token. Modeling these high-cardinality, sparse, and irregular sequences is challenging, and discriminative models often falter under limited supervision. To bridge this gap, we extend generative pretraining to user behavior, learning transferable representations from unlabeled behavioral data analogous to how LLMs learn from text. We present PANTHER, a hybrid generative–discriminative framework that unifies user behavior pretraining and downstream adaptation, enabling large-scale sequential user representation learning and real-time inference. PANTHER introduces: (1) Structured Tokenization to compress multi-dimensional transaction attributes into an interpretable vocabulary; (2) Sequence Pattern Recognition Module (SPRM) for modeling periodic transaction motifs; (3) a Unified User-Profile Embedding that fuses static demographics with dynamic transaction histories, enabling both personalized predictions and population-level knowledge transfer; and (4) Real-time scalability enabled by offline caching of pre-trained embeddings for millisecond-level inference. Fully deployed and operational online at WeChat Pay, PANTHER delivers a 25. 6\% boost in next-transaction prediction HitRate@1 and a 38. 6\% relative improvement in fraud detection recall over baselines. Cross-domain evaluations on public benchmarks (CCT, MBD, MovieLens-1M, Yelp) show strong generalization, achieving up to 21\% HitRate@1 gains over transformer baselines, establishing PANTHER as a scalable, high-performance framework for industrial user sequential behavior modeling.

IROS Conference 2025 Conference Paper

PI-WAN: A Physics-Informed Wind-Adaptive Network for Quadrotor Dynamics Prediction in Unknown Environments

  • Mengyun Wang
  • Bo Wang
  • Yifeng Niu
  • Chang Wang

Accurate dynamics modeling is essential for quadrotors to achieve precise trajectory tracking in various applications. Traditional physical knowledge-driven modeling methods face substantial limitations in unknown environments characterized by variable payloads, wind disturbances, and external perturbations. On the other hand, data-driven modeling methods suffer from poor generalization when handling outof-distribution (OoD) data, restricting their effectiveness in unknown scenarios. To address these challenges, we introduce the Physics-Informed Wind-Adaptive Network (PI-WAN), which combines knowledge-driven and data-driven modeling methods by embedding physical constraints directly into the training process for robust quadrotor dynamics learning. Specifically, PI-WAN employs a Temporal Convolutional Network (TCN) architecture that efficiently captures temporal dependencies from historical flight data, while a physics-informed loss function applies physical principles to improve model generalization and robustness across previously unseen conditions. By incorporating real-time prediction results into a model predictive control (MPC) framework, we achieve improvements in closed-loop tracking performance. Comprehensive simulations and real-world flight experiments demonstrate that our approach outperforms baseline methods in terms of prediction accuracy, tracking precision, and robustness to unknown environments.

EAAI Journal 2025 Journal Article

Predicting potential microbe-disease associations based on heterogeneous graph attention network and deep sparse autoencoder

  • Bo Wang
  • Wenlong Zhao
  • Xiaoxin Du
  • Jianfei Zhang
  • Chunyu Zhang
  • Liping Wang
  • Yang He

Identifying potential associations between microbes and diseases is crucial for explaining disease pathogenesis and designing targeted therapeutic strategies. Basic biological experiments for microbe-disease association (MDA) prediction are costly, time-consuming, and labor-intensive, whereas computational methods can effectively complement traditional biological experiments. We propose a computational framework called graph attention convolutional deep sparse autoencoder microbe-disease association (GCDSAEMDA) to predict unknown MDAs. First, we calculate the semantic similarity and Gaussian interaction profile (GIP) similarity of diseases, as well as the functional similarity and GIP similarity of microbes, and integrate these similarity matrices to construct a heterogeneous graph. Next, a multi-head dynamic graph attention mechanism is employed to extract low-order features of microbe and disease nodes in the heterogeneous graph, while multiple convolutional neural networks with different kernels aggregate and concatenate these low-order features to form new high-order representations. Third, we apply a cosine distance-based k-means clustering to select reliable negative samples and use a deep sparse autoencoder to extract high-order features of microbe-disease pairs. Finally, an ensemble Light Gradient Boosting Machine (LightGBM) algorithm is used to predict potential MDAs. GCDSAEMDA was compared to four state-of-the-art MDA models on the Human Microbe-Disease Association Database (HMDAD) and Disbiome databases and validated through five-fold cross-validation on diseases, microbes, and microbe-disease pairs. Results indicate that GCDSAEMDA outperforms the other four models in MDA prediction. Additionally, case studies demonstrate the robust predictive capability of GCDSAEMDA. The source code and datasets for GCDSAEMDA are available at https: //github. com/chenyunmolu/GCDSAEMDA.

IROS Conference 2025 Conference Paper

Quality-Driven Adaptive Control Framework for Robotic Ultrasound Imaging of Vascular Anatomies

  • Bo Wang
  • Junling Fu
  • Giancarlo Ferrigno
  • Elena De Momi

This paper proposes a quality-driven adaptive control framework for robotic vascular anatomies scanning to facilitate the acquisition of high-quality ultrasound (US) images. Specifically, a novel probability-based US image quality evaluation metric for vascular anatomies is introduced, leveraging an image segmentation network to establish a mapping between the controlled variables of the robot (e. g. , pose and force) and US image quality. Furthermore, an adaptive US probe control strategy driven by US image quality is developed to optimize real-time image acquisition, with its stability rigorously proven. To assess the effectiveness of the proposed framework, two experiments were conducted on a human tissue-mimicking phantom, encompassing both static and dynamic scenarios. The experimental results demonstrate that the proposed framework ensures stable contact force and significantly enhances US image quality for robot-assisted vascular anatomy imaging, even in the presence of external disturbances.

YNIMG Journal 2025 Journal Article

Sleep indicators and staging: A functional near-infrared spectroscopy study in healthy young adults

  • Yong Cao
  • Xingwei An
  • Wenxiao Zhong
  • Jin Jiang
  • Hongzuo Chu
  • Xuejun Jiao
  • Xiaoping Chen
  • Yufeng Ke

Functional near-infrared spectroscopy(fNIRS)-based sleep staging has attracted considerable interest due to its portability and limited interference with sleep. However, few studies have systematically examined sleep indicators or formulated sleep staging models based on fNIRS features labelled by polysomnography(PSG). This study aimed to address these shortcomings and promote the application of fNIRS in sleep monitoring. 37 volunteers participated in our experiment, with 6-channel prefrontal fNIRS data and standard PSG data collected simultaneously. Sleep indicators were extracted from time-domain, frequency-domain, and entropy perspectives. Sleep staging was developed based on these indicators using human-scored PSG as reference. Our findings indicated deeper sleep was correlated with a decrease in amplitude of time-domain features, while entropy features showed a contrasting trend. The fNIRS-based sleep staging achieved a Cohen's kappa(κ) of 0.76±0.12, 0.72±0.09, 0.71±0.07, with accuracies of 94.2 ± 2.4 %, 87.8 ± 3.2 %, and 82.2 ± 4.1 %, for 2-class(Wake/Sleep), 3-class(Wake/NREM/REM), 4-class (Wake/N1+N2/N3/REM) classifications, respectively. Sleep statistics derived from fNIRS closely aligned with those from PSG, with differences in sleep onset latency, wake after sleep onset, total wake/sleep time within 5 min and sleep efficiency below 3 %. The substantial agreement in both detailed (epoch-by-epoch) and comprehensive (total) sleep statistics with PSG suggests fNIRS is a reliable tool for long-term sleep monitoring in everyday settings.

EAAI Journal 2025 Journal Article

Study on the deformation mode domain of shrink energy-absorbing structures based on curve feature classification method

  • Jiaxing He
  • Ping Xu
  • Jie Xing
  • Shuguang Yao
  • Bo Wang
  • Xin Zheng

Shrink energy-absorbing structures play a key role in engineering applications by absorbing impact energy and ensuring passenger safety. However, inappropriate structural parameters and contact conditions can lead to buckling instability or folding collapse, which reduces the energy absorption efficiency. For this purpose, a deformation mode classification method based on the curve feature was proposed. A Long Short Term Memory (LSTM) network was used to predict the crushing force curve, followed by feature extraction and mode classification to establish the mapping relationships from design parameters to deformation modes. The deformation mode domain was then constructed using the classification model for data expansion, and its boundaries were precisely defined using surface fitting techniques. The critical cone angles of the shrink deformation mode at different friction coefficients were obtained by two-dimensional analysis. In addition, a structural design strategy was also proposed to maximize the specific energy absorption (SEA) of the structure under the shrink deformation mode. The results show that the classification method can effectively predict the deformation modes with 97 % accuracy. Further analysis of the deformation mode domain reveals that the critical cone angle of the shrink deformation mode decreases with the increase of the friction coefficient. Overall, this study predicts the deformation modes of shrink energy-absorbing structures and analyzes the variation of the critical cone angle, providing important guidance for structural optimization and improving energy absorption efficiency.

JBHI Journal 2025 Journal Article

Swallow-PPG: Photoplethysmography Templates for Comprehensive Temporal Analysis of Swallowing Anatomical Actions

  • Ying Zhang
  • Junjie Li
  • Ping Wang
  • Huaiyu Zhu
  • Bo Wang
  • Wei Luo
  • Yun Pan

In clinical practice, Videofluoroscopic Swallowing Study (VFSS) is commonly used to monitor the activity of anatomical structures during swallowing. However, it is limited by ionizing radiation exposure, adverse effects of barium contrast agents, and the high cost of specialized equipment. In this study, we propose a framework for analyzing swallowing behaviors in photoplethysmography (PPG) waveforms, which includes generalizing the manifestation of swallowing in PPG (i. e. , swallowing templates generation) and conducting comprehensive temporal analysis of swallowing anatomical actions (TASAA). For swallowing templates generation, we cluster and average the samples to obtain waveforms of templates, followed by conducting shape-based mapping and averaging on 28 time indicators to derive template unified time indicators (TUTIs). For comprehensive TASAA, we leverage templates waveforms and TUTIs to estimate time indicators based on the mapping relationship between samples and their respective templates. We evaluate the proposed framework on 357 swallowing PPG samples from 41 elderly subjects. The average relative error across all time indicators is 0. 123, and 6 indicators notably excel with errors below 0. 1. The proposed template-based swallowing analysis framework is expected to become a low-cost and non-ionizing alternative to VFSS for comprehensive TASAA.

YNIMG Journal 2025 Journal Article

The association among individual gray matter volume of frontal-limbic circuitry, fatigue susceptibility, and comorbid neuropsychiatric symptoms following COVID-19

  • Xuan Niu
  • Wenrui Bao
  • Zhaoyao Luo
  • Pang Du
  • Heping Zhou
  • Haiyang Liu
  • Baoqi Wang
  • Huawen Zhang

BACKGROUND: Fatigue is often accompanied by comorbid sleep disturbance and psychiatric distress following the COVID-19 infection. However, identifying individuals at risk for developing post-COVID fatigue remains challenging. This study aimed to identify the neurobiological markers underlying fatigue susceptibility and further investigate their effect on COVID-19-related neuropsychiatric symptoms. METHODS: Individuals following a mild SARS-CoV-2 infection (COV+) underwent neuropsychiatric measurements (n = 335) and MRI scans (n = 271) within 1 month (baseline), and 191 (70.5 %) of the individuals were followed up 3 months after infection. Sixty-seven healthy controls (COV-) completed the same recruitment protocol. RESULTS: Whole-brain voxel-wise analysis showed that gray matter volume (GMV) during the acute phase did not differ between the COV+ and COV- groups. GMV in the right dorsolateral prefrontal cortex (DLPFC) and left dorsal anterior cingulate cortex (dACC) were associated with fatigue severity only in the COV+ group at baseline, which were assigned to the frontal system and limbic system, respectively. Furthermore, fatigue mediated the associations between volume differences in fatigue susceptibility and COVID-related sleep, post-traumatic stress disorder, anxiety and depression. Crucially, the initial GMV in the right DLPFC can predict fatigue symptoms 3 months after infection. CONCLUSIONS: We provide novel evidence on the neuroanatomical basis of fatigue vulnerability and emphasize that acute fatigue is an important link between early GMV in the frontal-limbic regions and comorbid neuropsychiatric symptoms at baseline and 3 months after infection. Our findings highlight the role of the frontal-limbic system in predisposing individuals to develop post-COVID fatigue.

AIIM Journal 2025 Journal Article

VAE-GANMDA: A microbe-drug association prediction model integrating variational autoencoders and generative adversarial networks

  • Bo Wang
  • Yang He
  • Xiaoxin Du
  • Lei Zhu
  • Junqi Wang
  • Tongxuan Wang

Traditional biological experimental methods typically require weeks or even months of experimentation, and the cost of each experiment can reach hundreds or even thousands of dollars, which is quite expensive and time-consuming. To address this, a model called VAE-GANMDA, which integrates variational autoencoders (VAE) and generative adversarial networks (GAN) for predicting microbe-drug associations, has been proposed. Firstly, a heterogeneous network of microbes and drugs is established to enrich the association information. Secondly, by fusing VAE and GAN, the model learns the manifold distribution of data through association features, obtaining nonlinear manifold features. Furthermore, the VAE generation module is improved by integrating the Convolutional Block Attention Module (CBAM) and Gaussian kernel function, enhancing the smooth perception of manifold features, thus endowing VAE with stronger feature extraction capabilities. Then, singular value decomposition (SVD) technique is employed to extract linear features of the data. Finally, by combining linear and nonlinear features, the k-means++ algorithm is used to select balanced and high-quality negative samples for training the MLP classifier. Through performance evaluation, the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) of VAE-GANMDA reach 0. 9724 and 0. 9635 respectively, outperforming classical machine learning methods and the majority of deep learning methods. Case studies demonstrate that VAE-GANMDA accurately predicts candidate drugs related to SARS-CoV-2 and candidate microbes related to ciprofloxacin.

EAAI Journal 2024 Journal Article

An adaptive lightweight small object detection method for incremental few-shot scenarios of unmanned surface vehicles

  • Bo Wang
  • Peng Jiang
  • Zhuoyan Liu
  • Yueming Li
  • Jian Cao
  • Ye Li

Real-time and accurate detection of sea surface objects has become an important research topic for unmanned surface vehicles (USVs). During the execution of tasks, USVs sometimes need to upgrade their model to detect new categories, and the initial data is often limited. This requires quick adaptation to new categories under few-shot scenarios. We propose a lightweight neural network for detecting small sea surface objects named Shuffle-High-Resolution-Net (SHRDet), which integrates the enhanced Shuffle Block based on High-Resolution-Net (HRNet), lightweight feature fusion module, and Focal Efficient Intersection over Union loss. Based on SHRDet, a fast adaptation method named SHRDet-N for incremental few-shot categories is proposed. It generates category enhancement features through a cross-attention mechanism, and introduces elastic weight consolidation and feature distance to solve catastrophic forgetting when learning incremental few-shot categories. The algorithms have been applied to an intelligent USV platform for various surface missions, such as security patrol, ocean investigation, and marine engineering. The experimental results on public datasets indicate that SHRDet achieves 80. 7 % mean Average Precision (mAP) on the Water Surface Object Detection Dataset (WSODD) with only 0. 69 M parameters and less calculation quantity, and SHRDet is significantly superior to state-of-the-art methods in terms of lightweight and accuracy. Moreover, SHRDet-N effectively solves the learning problem of incremental few-shot categories. When the sample number of a new category is set to 20, the mAP of the base categories is 82. 5 % and that of the new category is 63. 8 %, which is 2. 8 % and 4. 2 % higher than that of state-of-the-art models like Sylph.

NeurIPS Conference 2024 Conference Paper

Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models

  • Jiahao Ying
  • Yixin Cao
  • Yushi Bai
  • Qianru Sun
  • Bo Wang
  • Wei Tang
  • Zhaojun Ding
  • Yizhe Yang

Large language models (LLMs) have achieved impressive performance across various natural language benchmarks, prompting a continual need to curate more difficult datasets for larger LLMs, which is costly and time-consuming. In this paper, we propose to automate dataset updating and provide systematical analysis regarding its effectiveness in dealing with benchmark leakage issue, difficulty control, and stability. Thus, once current benchmark has been mastered or leaked, we can update it for timely and reliable evaluation. There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, preserving stylistic and contextual essence, and 2) extending strategy that further expands existing samples at varying cognitive levels by adapting Bloom’s taxonomy of educational objectives. Extensive experiments on updated MMLU and BIG-Bench demonstrate the stability of the proposed strategies and find that the mimicking strategy can effectively alleviate issues of overestimation from benchmark leakage. In cases where the efficient mimicking strategy fails, our extending strategy still shows promising results. Additionally, by controlling the difficulty, we can better discern the models’ performance and enable fine-grained analysis — neither too difficult nor too easy an exam can fairly judge students’ learning status. To the best of our knowledge, we are the first to automate updating benchmarks for reliable and timely evaluation. Our demo leaderboard can be found at https: //yingjiahao14. github. io/Automating-DatasetUpdates/.

IJCAI Conference 2024 Conference Paper

Boosting Single Positive Multi-label Classification with Generalized Robust Loss

  • Yanxi Chen
  • Chunxiao Li
  • Xinyang Dai
  • Jinhuan Li
  • Weiyu Sun
  • Yiming Wang
  • Renyuan Zhang
  • Tinghe Zhang

Multi-label learning (MLL) requires comprehensive multi-semantic annotations that is hard to fully obtain, thus often resulting in missing labels scenarios. In this paper, we investigate Single Positive Multi-label Learning (SPML), where each image is associated with merely one positive label. Existing SPML methods only focus on designing losses using mechanisms such as hard pseudo-labeling and robust losses, mostly leading to unacceptable false negatives. To address this issue, we first propose a generalized loss framework based on expected risk minimization to provide soft pseudo labels, and point out that the former losses can be seamlessly converted into our framework. In particular, we design a novel robust loss based on our framework, which enjoys flexible coordination between false positives and false negatives, and can additionally deal with the imbalance between positive and negative samples. Extensive experiments show that our approach can significantly improve SPML performance and outperform the vast majority of state-of-the-art methods on all the four benchmarks. Our code is available at https: //github. com/yan4xi1/GRLoss.

IROS Conference 2024 Conference Paper

C 3 P-VoxelMap: Compact, Cumulative and Coalescible Probabilistic Voxel Mapping

  • Xu Yang
  • Wenhao Li
  • Qijie Ge
  • Lulu Suo
  • Weijie Tang
  • Zhengyu Wei
  • Longxiang Huang
  • Bo Wang

This work presents a compact, cumulative, and coalescible probabilistic voxel mapping method to enhance performance, accuracy, and memory efficiency in LiDAR odometry. Probabilistic voxel mapping requires storing past point clouds and re-iterating them to update the uncertainty at every iteration, which consumes large memory space and CPU cycles. To solve this problem, we propose a two-fold strategy. First, we introduce a compact point-free representation for probabilistic voxels and derive a cumulative update of the planar uncertainty without caching original point clouds. Our voxel structure only keeps track of a predetermined set of statistics for points that lie inside it. This method reduces the runtime complexity from O(MN) to O(N) and the space complexity from O(N) to O(1) where M is the number of iterations and N is the number of points. Second, to further minimize memory usage and enhance mapping accuracy, we provide a strategy to dynamically merge voxels associated with the same physical planes by taking advantage of the geometric features in the real world. Rather than constantly scanning for these coalescible voxels at every iteration, our merging strategy accumulates voxels in a locality-sensitive hash and triggers merging lazily. On-demand merging reduces memory footprint with minimal computational overhead and improves localization accuracy thanks to cross-voxel denoising. Experiments exhibit 20% higher accuracy, 20% faster performance, and 70% lower memory consumption than the state-of-the-art.

NeurIPS Conference 2024 Conference Paper

EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection

  • Qinqian Lei
  • Bo Wang
  • Robby T. Tan

Detecting Human-Object Interactions (HOI) in zero-shot settings, where models must handle unseen classes, poses significant challenges. Existing methods that rely on aligning visual encoders with large Vision-Language Models (VLMs) to tap into the extensive knowledge of VLMs, require large, computationally expensive models and encounter training difficulties. Adapting VLMs with prompt learning offers an alternative to direct alignment. However, fine-tuning on task-specific datasets often leads to overfitting to seen classes and suboptimal performance on unseen classes, due to the absence of unseen class labels. To address these challenges, we introduce a novel prompt learning-based framework for Efficient Zero-Shot HOI detection (EZ-HOI). First, we introduce Large Language Model (LLM) and VLM guidance for learnable prompts, integrating detailed HOI descriptions and visual semantics to adapt VLMs to HOI tasks. However, because training datasets contain seen-class labels alone, fine-tuning VLMs on such datasets tends to optimize learnable prompts for seen classes instead of unseen ones. Therefore, we design prompt learning for unseen classes using information from related seen classes, with LLMs utilized to highlight the differences between unseen and related seen classes. Quantitative evaluations on benchmark datasets demonstrate that our EZ-HOI achieves state-of-the-art performance across various zero-shot settings with only 10. 35\% to 33. 95\% of the trainable parameters compared to existing methods. Code is available at https: //github. com/ChelsieLei/EZ-HOI.

AAAI Conference 2024 Conference Paper

Few-Shot Learning from Augmented Label-Uncertain Queries in Bongard-HOI

  • Qinqian Lei
  • Bo Wang
  • Robby T. Tan

Detecting human-object interactions (HOI) in a few-shot setting remains a challenge. Existing meta-learning methods struggle to extract representative features for classification due to the limited data, while existing few-shot HOI models rely on HOI text labels for classification. Moreover, some query images may display visual similarity to those outside their class, such as similar backgrounds between different HOI classes. This makes learning more challenging, especially with limited samples. Bongard-HOI epitomizes this HOI few-shot problem, making it the benchmark we focus on in this paper. In our proposed method, we introduce novel label-uncertain query augmentation techniques to enhance the diversity of the query inputs, aiming to distinguish the positive HOI class from the negative ones. As these augmented inputs may or may not have the same class label as the original inputs, their class label is unknown. Those belonging to a different class become hard samples due to their visual similarity to the original ones. Additionally, we introduce a novel pseudo-label generation technique that enables a mean teacher model to learn from the augmented label-uncertain inputs. We propose to augment the negative support set for the student model to enrich the semantic information, fostering diversity that challenges and enhances the student’s learning. Experimental results demonstrate that our method sets a new state-of-the-art (SOTA) performance by achieving 68.74% accuracy on the Bongard-HOI benchmark, a significant improvement over the existing SOTA of 66.59%. In our evaluation on HICO-FS, a more general few-shot recognition dataset, our method achieves 73.27% accuracy, outperforming the previous SOTA of 71.20% in the 5- way 5-shot task.

NeurIPS Conference 2024 Conference Paper

GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning

  • Zehui Li
  • Vallijah Subasri
  • Guy-Bart Stan
  • Yiren Zhao
  • Bo Wang

Genetic variants (GVs) are defined as differences in the DNA sequences among individuals and play a crucial role in diagnosing and treating genetic diseases. The rapid decrease in next generation sequencing cost, analogous to Moore’s Law, has led to an exponential increase in the availability of patient-level GV data. This growth poses a challenge for clinicians who must efficiently prioritize patient-specific GVs and integrate them with existing genomic databases to inform patient management. To addressing the interpretation of GVs, genomic foundation models (GFMs) have emerged. However, these models lack standardized performance assessments, leading to considerable variability in model evaluations. This poses the question: *How effectively do deep learning methods classify unknown GVs and align them with clinically-verified GVs? * We argue that representation learning, which transforms raw data into meaningful feature spaces, is an effective approach for addressing both indexing and classification challenges. We introduce a large-scale Genetic Variant dataset, named $\textsf{GV-Rep}$, featuring variable-length contexts and detailed annotations, designed for deep learning models to learn GV representations across various traits, diseases, tissue types, and experimental contexts. Our contributions are three-fold: (i) $\textbf{Construction}$ of a comprehensive dataset with 7 million records, each labeled with characteristics of the corresponding variants, alongside additional data from 17, 548 gene knockout tests across 1, 107 cell types, 1, 808 variant combinations, and 156 unique clinically-verified GVs from real-world patients. (ii) $\textbf{Analysis}$ of the structure and properties of the dataset. (iii) $\textbf{Experimentation}$ of the dataset with pre-trained genomic foundation models (GFMs). The results highlight a significant disparity between the current capabilities of GFMs and the accurate representation of GVs. We hope this dataset will advance genomic deep learning to bridge this gap.

NeurIPS Conference 2024 Conference Paper

MassSpecGym: A benchmark for the discovery and identification of molecules

  • Roman Bushuiev
  • Anton Bushuiev
  • Niek F. de Jonge
  • Adamo Young
  • Fleming Kretschmer
  • Raman Samusevich
  • Janne Heirman
  • Fei Wang

The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality MS/MS spectra and defines three MS/MS annotation challenges: \textit{de novo} molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community. MassSpecGym is publicly available at \url{https: //github. com/pluskal-lab/MassSpecGym}.

AAAI Conference 2024 Conference Paper

MultiSum: A Multi-Facet Approach for Extractive Social Summarization Utilizing Semantic and Sociological Relationships

  • Tanglong Zhao
  • Ruifang He
  • Jing Xu
  • Bo Wang

Social summarization aims to provide summaries for a large number of social texts (called posts) about a single topic. To extract a summary, both the representation of post and summary selection method are crucial. Previous methods introduce social relation to enhance post embedding to mitigate the sparse representation due to its brief and informal expression. However, they ignore that there are multiple relations between posts. Besides, existing graph-based centrality calculation approaches tend to select posts from one aspect. This leads to facet bias especially when there are multiple viewpoints. In this paper, we propose a model named MultiSum to improve social summarization. Specifically, 1) We use graph convolutional networks to fuse text content with social and semantic relations to improve post representation; 2) The similarity between the summary and all aspects is incorporated into the centrality score during the selection phase, encouraging the model to pay attention to different facets. Experimental results on English and Chinese corpora support the effectiveness of this model. Furthermore, external evaluations by human experts and large language models demonstrate the validity of MultiSum in facet coverage and redundancy reduction.

IJCAI Conference 2024 Conference Paper

OSIC: A New One-Stage Image Captioner Coined

  • Bo Wang
  • Zhao Zhang
  • Mingbo Zhao
  • Xiaojie Jin
  • Mingliang Xu
  • Meng Wang

Mainstream image captioning models are usually two-stage captioners, i. e. , encoding the region features by a pre-trained detector and then feeding them into a language model to generate the captions. However, such a two-stage procedure will lead to a task-based information gap that decreases the performance, because the region features in the detection task are suboptimal representations and cannot provide all the necessary information for subsequent captions generation. Besides, the region features are usually represented from the last layer of the detectors that lose the local details of images. In this paper, we propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning, which directly transforms the images into descriptive sentences in one stage for eliminating the information gap. Specifically, to obtain rich features, multi-level features are captured by Swin Transformer, and then fed into a novel dynamic multi-sight embedding module to exploit both the global structure and local texture of input images. To enhance the global modeling capacity of the visual encoder, we propose a new dual-dimensional refining to non-locally model the features interaction. As a result, OSIC can directly obtain rich semantic information to improve the captioner. Extensive comparisons on the benchmark MS-COCO, Flickr8K and Flickr30K datasets verified the superior performance of our method.

EAAI Journal 2024 Journal Article

Spatial-frequency feature fusion based deepfake detection through knowledge distillation

  • Bo Wang
  • Xiaohan Wu
  • Fei Wang
  • Yushu Zhang
  • Fei Wei
  • Zengren Song

While the misuse of Deepfake technology is drawing growing concern in the literature of information security, related forgery detection has become a significant challenge in practical applications. Most state-of-the-art detection methods achieve satisfactory results on raw images, but their performance drops significantly on processed images (e. g. compression). In this work, we propose a novel Deepfake detection method that integrates spatial and frequency domain information within a knowledge distillation framework for efficient forgery detection. Our method consists of two steps: (1) spatial-frequency fusion, and (2) multi-knowledge distillation. We first extract frequency-domain and spatial-domain features, then fuse them and utilize them in attention-based guidance to improve the classification results. Note that the spatial-frequency fusion serves as the basis for both the teacher and student models with spatial-frequency features and logits transferred as knowledge. We conducted comprehensive experiments on several benchmark datasets which successfully demonstrate the excellent generalization performance of our method on compressed images while outperforming state-of-the-art techniques.

EAAI Journal 2023 Journal Article

Application of Quantum Particle Swarm Optimization for task scheduling in Device-Edge-Cloud Cooperative Computing

  • Bo Wang
  • Zhifeng Zhang
  • Ying Song
  • Ming Chen
  • Yangyang Chu

Swarm intelligence and evolutionary algorithms (SI&EAs) have been widely applied to various fields. In this paper, we make the first attempt, to our best knowledge, to apply an SI&EA, Quantum Particle Swarm Optimization (QPSO), for addressing the task scheduling problem in Device-Edge-Cloud Cooperative Computing (DE3C) which is one of the most widespread and new computing paradigms. We first formulate the problem and propose a QPSO based method to solve the problem with a reasonable time. Then we summarize the existing variants of QPSO, which exploit various improvement schemes for QPSO. At last, we conduct simulated experiments to evaluate the performance of QPSO and its variants on solving the task scheduling problem of DE3C, and have the following findings. (1) QPSO outperforms several up-to-date heuristics and SI&EAs in both the user satisfaction and the resource efficiency. (2) Existing improvement methods have no appreciable effect on QPSO for solving large-scale problems. (3) The performance of an improvement for QPSO depends mostly on randomness of the offset added to particle movements.

EAAI Journal 2023 Journal Article

Distributional prediction of short-term traffic using neural networks

  • Bo Wang
  • Hai L. Vu
  • Inhi Kim
  • Chen Cai

Neural network (NN)-based models have recently achieved outstanding results in short-term traffic prediction. However, most of these are based on the regression approach and trained to generate a single data point as a predicted value for future timesteps, which does not provide information on prediction uncertainty and limits its performance under different traffic conditions. To solve this problem, this study proposes a novel, high-dimensional distributional prediction (HDP) framework. This method has been validated by a series of experiments using the Caltrans Performance Measurement System dataset and four widely used NN models. The results suggest that the proposed HDP scheme can help existing NN structures to (1) generate adaptive distributional predictions for quantifying the uncertainty of multiple targets, and (2) gain better point prediction in terms of accuracy and robustness. Furthermore, we demonstrate that predicted speed distributions can be used for travel time estimation, outperforming other traditional methods in unexpected traffic conditions such as traffic incidents.

NeurIPS Conference 2023 Conference Paper

DynGFN: Towards Bayesian Inference of Gene Regulatory Networks with GFlowNets

  • Lazar Atanackovic
  • Alexander Tong
  • Bo Wang
  • Leo J Lee
  • Yoshua Bengio
  • Jason S. Hartford

One of the grand challenges of cell biology is inferring the gene regulatory network (GRN) which describes interactions between genes and their products that control gene expression and cellular function. We can treat this as a causal discovery problem but with two non-standard challenges: (1) regulatory networks are inherently cyclic so we should not model a GRN as a directed acyclic graph (DAG), and (2) observations have significant measurement noise so for typical sample sizes, there will always be a large equivalence class of graphs that are likely given the data, and we want methods that capture this uncertainty. Existing methods either focus on challenge (1), identifying cyclic structure from dynamics, or on challenge (2) learning complex Bayesian posteriors over directed acyclic graphs, but not both. In this paper we leverage the fact that it is possible to estimate the ``velocity'' of the expression of a gene with RNA velocity techniques to develop an approach that addresses both challenges. Because we have access to velocity information, we can treat the Bayesian structure learning problem as a problem of sparse identification of a dynamical system, capturing cyclic feedback loops through time. We leverage Generative Flow Networks (GFlowNets) to estimate the posterior distribution over the combinatorial space of possible sparse dependencies. Our results indicate that our method learns posteriors that better encapsulate the distributions of cyclic structures compared to counterpart state-of-the-art Bayesian structure learning approaches.

IJCAI Conference 2023 Conference Paper

Explainable Text Classification via Attentive and Targeted Mixing Data Augmentation

  • Songhao Jiang
  • Yan Chu
  • Zhengkui Wang
  • Tianxing Ma
  • Hanlin Wang
  • Wenxuan Lu
  • Tianning Zang
  • Bo Wang

Mixing data augmentation methods have been widely used in text classification recently. However, existing methods do not control the quality of augmented data and have low model explainability. To tackle these issues, this paper proposes an explainable text classification solution based on attentive and targeted mixing data augmentation, ATMIX. Instead of selecting data for augmentation without control, ATMIX focuses on the misclassified training samples as the target for augmentation to better improve the model's capability. Meanwhile, to generate meaningful augmented samples, it adopts a self-attention mechanism to understand the importance of the subsentences in a text, and cut and mix the subsentences between the misclassified and correctly classified samples wisely. Furthermore, it employs a novel dynamic augmented data selection framework based on the loss function gradient to dynamically optimize the augmented samples for model training. In the end, we develop a new model explainability evaluation method based on subsentence attention and conduct extensive evaluations over multiple real-world text datasets. The results indicate that ATMIX is more effective with higher explainability than the typical classification models, hidden-level, and input-level mixup models.

NeurIPS Conference 2023 Conference Paper

Spatially Resolved Gene Expression Prediction from Histology Images via Bi-modal Contrastive Learning

  • Ronald Xie
  • Kuan Pang
  • Sai Chung
  • Catia Perciani
  • Sonya MacParland
  • Bo Wang
  • Gary Bader

Histology imaging is an important tool in medical diagnosis and research, enabling the examination of tissue structure and composition at the microscopic level. Understanding the underlying molecular mechanisms of tissue architecture is critical in uncovering disease mechanisms and developing effective treatments. Gene expression profiling provides insight into the molecular processes underlying tissue architecture, but the process can be time-consuming and expensive. We present BLEEP (Bi-modaL Embedding for Expression Prediction), a bi-modal embedding framework capable of generating spatially resolved gene expression profiles of whole-slide Hematoxylin and eosin (H&E) stained histology images. BLEEP uses contrastive learning to construct a low-dimensional joint embedding space from a reference dataset using paired image and expression profiles at micrometer resolution. With this approach, the gene expression of any query image patch can be imputed using the expression profiles from the reference dataset. We demonstrate BLEEP’s effectiveness in gene expression prediction by benchmarking its performance on a human liver tissue dataset captured using the 10x Visium platform, where it achieves significant improvements over existing methods. Our results demonstrate the potential of BLEEP to provide insights into the molecular mechanisms underlying tissue architecture, with important implications in diagnosis and research of various diseases. The proposed approach can significantly reduce the time and cost associated with gene expression profiling, opening up new avenues for high-throughput analysis of histology images for both research and clinical applications.

AAMAS Conference 2023 Conference Paper

Think Twice: A Human-like Two-stage Conversational Agent for Emotional Response Generation

  • Yushan Qian
  • Bo Wang
  • Shangzhao Ma
  • Wu Bin
  • Shuo Zhang
  • Dongming Zhao
  • Kun Huang
  • Yuexian Hou

Towards human-like dialogue systems, current emotional dialogue approaches jointly model emotion and semantics with a unified neural network. This strategy tends to generate safe responses due to the mutual restriction between emotion and semantics, and requires the rare large-scale emotion-annotated dialogue corpus. Inspired by the "think twice" behavior in human intelligent dialogue, we propose a two-stage conversational agent for the generation of emotional dialogue. Firstly, a dialogue model trained without the emotion-annotated dialogue corpus generates a prototype response that meets the contextual semantics. Secondly, the first-stage prototype is modified by a controllable emotion refiner with the empathy hypothesis. Experimental results on the DailyDialog and EmpatheticDialogues datasets demonstrate that the proposed conversational agent outperforms the compared models in the emotion generation and maintains the semantic performance in the automatic and human evaluations.

NeurIPS Conference 2022 Conference Paper

BigBio: A Framework for Data-Centric Biomedical Natural Language Processing

  • Jason Fries
  • Leon Weber
  • Natasha Seelam
  • Gabriel Altay
  • Debajyoti Datta
  • Samuele Garda
  • Sunny Kang
  • Rosaline Su

Training and evaluating language models increasingly requires the construction of meta-datasets -- diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a variety of novel instruction tuning tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BigBio a community library of 126+ biomedical NLP datasets, currently covering 13 task categories and 10+ languages. BigBio facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BigBio is an ongoing community effort and is available at https: //github. com/bigscience-workshop/biomedical

JBHI Journal 2021 Journal Article

Attention-Guided Deep Neural Network With Multi-Scale Feature Fusion for Liver Vessel Segmentation

  • Qingsen Yan
  • Bo Wang
  • Wei Zhang
  • Chuan Luo
  • Wei Xu
  • Zhengqing Xu
  • Yanning Zhang
  • Qinfeng Shi

Liver vessel segmentation is fast becoming a key instrument in the diagnosis and surgical planning of liver diseases. In clinical practice, liver vessels are normally manual annotated by clinicians on each slice of CT images, which is extremely laborious. Several deep learning methods exist for liver vessel segmentation, however, promoting the performance of segmentation remains a major challenge due to the large variations and complex structure of liver vessels. Previous methods mainly using existing UNet architecture, but not all features of the encoder are useful for segmentation and some even cause interferences. To overcome this problem, we propose a novel deep neural network for liver vessel segmentation, called LVSNet, which employs special designs to obtain the accurate structure of the liver vessel. Specifically, we design Attention-Guided Concatenation (AGC) module to adaptively select the useful context features from low-level features guided by high-level features. The proposed AGC module focuses on capturing rich complemented information to obtain more details. In addition, we introduce an innovative multi-scale fusion block by constructing hierarchical residual-like connections within one single residual block, which is of great importance for effectively linking the local blood vessel fragments together. Furthermore, we construct a new dataset containing 40 thin thickness cases (0. 625 mm) which consist of CT volumes and annotated vessels. To evaluate the effectiveness of the method with minor vessels, we also propose an automatic stratification method to split major and minor liver vessels. Extensive experimental results demonstrate that the proposed LVSNet outperforms previous methods on liver vessel segmentation datasets. Additionally, we conduct a series of ablation studies that comprehensively support the superiority of the underlying concepts.

JBHI Journal 2021 Journal Article

Boundary Aware U-Net for Retinal Layers Segmentation in Optical Coherence Tomography Images

  • Bo Wang
  • Wei Wei
  • Shuang Qiu
  • Shengpei Wang
  • Dan Li
  • Huiguang He

Retinal layers segmentation in optical coherence tomography (OCT) images is a critical step in the diagnosis of numerous ocular diseases. Automatic layers segmentation requires separating each individual layer instance with accurate boundary detection, but remains a challenging task since it suffers from speckle noise, intensity inhomogeneity, and the low contrast around boundary. In this work, we proposed a boundary aware U-Net (BAU-Net) for retinal layers segmentation by detecting accurate boundary. Based on encoder-decoder architecture, we design a dual tasks framework with low-level outputs for boundary detection and high-level outputs for layers segmentation. Specifically, we first use the multi-scale input strategy to enrich the spatial information in the deep features of encoder. For low-level features from encoder, we design an edge aware (EA) module in skip connection to extract the pure edge features. Then, a U-structure feature enhanced (UFE) module is designed in all skip connections to enlarge the features receptive fields from the encoder. Besides, a canny edge fusion (CEF) module is introduced to aforementioned architecture, which can fuse the priory edge information from segmentation task to boundary detection branch for a better predication. Furthermore, we model each boundary as a vertical coordinates distribution for boundary detection. Based on this distribution, a topology guarantee loss with combined A-scan regression loss and structure loss is proposed to make an accurate and guaranteed topological boundary set. The method is evaluated on two public datasets and the results demonstrate that the BAU-Net achieves promising performance than other state-of-the-art methods.

JBHI Journal 2021 Journal Article

CSU-Net: A Context Spatial U-Net for Accurate Blood Vessel Segmentation in Fundus Images

  • Bo Wang
  • Shengpei Wang
  • Shuang Qiu
  • Wei Wei
  • Haibao Wang
  • Huiguang He

Blood vessel segmentation in fundus images is a critical procedure in the diagnosis of ophthalmic diseases. Recent deep learning methods achieve high accuracy in vessel segmentation but still face the challenge to segment the microvascular and detect the vessel boundary. This is due to the fact that common Convolutional Neural Networks (CNN) are unable to preserve rich spatial information and a large receptive field simultaneously. Besides, CNN models for vessel segmentation usually are trained by equal pixel level cross-entropy loss, which tend to miss fine vessel structures. In this paper, we propose a novel Context Spatial U-Net (CSU-Net) for blood vessel segmentation. Compared with the other U-Net based models, we design a two-channel encoder: a context channel with multi-scale convolution to capture more receptive field and a spatial channel with large kernel to retain spatial information. Also, to combine and strengthen the features extracted from two paths, we introduce a feature fusion module (FFM) and an attention skip module (ASM). Furthermore, we propose a structure loss, which adds a spatial weight to cross-entropy loss and guide the network to focus more on the thin vessels and boundaries. We evaluated this model on three public datasets: DRIVE, CHASE-DB1 and STARE. The results show that the CSU-Net achieves higher segmentation accuracy than the current state-of-the-art methods.

ICRA Conference 2021 Conference Paper

Fast Light Show Design Platform for K-12 Children

  • Pengda Mao
  • Yan Gao 0018
  • Bo Wang
  • An Yan
  • Xiaoyu Chi
  • Quan Quan

This paper aims to present a drone swarm light show design platform to support STEAM (science, technology, engineering, art and mathematics) education for K-12 children. With this platform, children can use this platform to design a drone swarm light show easily. To this end, the architecture of this platform contents three layers: UI layer, command layer, and physical layer. The UI layer has an easy-to-use interface for children. Children can feed parameters about the light show by clicking buttons and dragging sliders of four tracks. All actions designed for the swarm in the UI layer will be generated automatically in the form of the drone’s desired trajectories through the command layer. The physical layer includes a router for communication and a drone swarm for the light show. Our experimental results demonstrate that this platform works efficiently and suits for being applied to real STEAM education.

AAAI Conference 2021 Conference Paper

Graph and Temporal Convolutional Networks for 3D Multi-person Pose Estimation in Monocular Videos

  • Yu Cheng
  • Bo Wang
  • Bo Yang
  • Robby T. Tan

Despite the recent progress, 3D multi-person pose estimation from monocular videos is still challenging due to the commonly encountered problem of missing information caused by occlusion, partially out-of-frame target persons, and inaccurate person detection. To tackle this problem, we propose a novel framework integrating graph convolutional networks (GCNs) and temporal convolutional networks (TCNs) to robustly estimate camera-centric multi-person 3D poses that does not require camera parameters. In particular, we introduce a human-joint GCN, which unlike the existing GCN, is based on a directed graph that employs the 2D pose estimator’s confidence scores to improve the pose estimation results. We also introduce a human-bone GCN, which models the bone connections and provides more information beyond human joints. The two GCNs work together to estimate the spatial frame-wise 3D poses, and can make use of both visible joint and bone information in the target frame to estimate the occluded or missing human-part information. To further refine the 3D pose estimation, we use our temporal convolutional networks (TCNs) to enforce the temporal and human-dynamics constraints. We use a joint-TCN to estimate person-centric 3D poses across frames, and propose a velocity-TCN to estimate the speed of 3D joints to ensure the consistency of the 3D pose estimation in consecutive frames. Finally, to estimate the 3D human poses for multiple persons, we propose a root-TCN that estimates camera-centric 3D poses without requiring camera parameters. Quantitative and qualitative evaluations demonstrate the effectiveness of the proposed method. Our code and models are available at https: //github. com/3dpose/GnTCN.

NeurIPS Conference 2021 Conference Paper

OctField: Hierarchical Implicit Functions for 3D Modeling

  • Jia-Heng Tang
  • Weikai Chen
  • Jie Yang
  • Bo Wang
  • Songrun Liu
  • Bo Yang
  • Lin Gao

Recent advances in localized implicit functions have enabled neural implicit representation to be scalable to large scenes. However, the regular subdivision of 3D space employed by these approaches fails to take into account the sparsity of the surface occupancy and the varying granularities of geometric details. As a result, its memory footprint grows cubically with the input volume, leading to a prohibitive computational cost even at a moderately dense decomposition. In this work, we present a learnable hierarchical implicit representation for 3D surfaces, coded OctField, that allows high-precision encoding of intricate surfaces with low memory and computational budget. The key to our approach is an adaptive decomposition of 3D scenes that only distributes local implicit functions around the surface of interest. We achieve this goal by introducing a hierarchical octree structure to adaptively subdivide the 3D space according to the surface occupancy and the richness of part geometry. As octree is discrete and non-differentiable, we further propose a novel hierarchical network that models the subdivision of octree cells as a probabilistic process and recursively encodes and decodes both octree structure and surface geometry in a differentiable manner. We demonstrate the value of OctField for a range of shape modeling and reconstruction tasks, showing superiority over alternative approaches.

IJCAI Conference 2021 Conference Paper

Two-stage Training for Learning from Label Proportions

  • Jiabin Liu
  • Bo Wang
  • Xin Shen
  • Zhiquan Qi
  • YingJie Tian

Learning from label proportions (LLP) aims at learning an instance-level classifier with label proportions in grouped training data. Existing deep learning based LLP methods utilize end-to-end pipelines to obtain the proportional loss with Kullback-Leibler divergence between the bag-level prior and posterior class distributions. However, the unconstrained optimization on this objective can hardly reach a solution in accordance with the given proportions. Besides, concerning the probabilistic classifier, this strategy unavoidably results in high-entropy conditional class distributions at the instance level. These issues further degrade the performance of the instance-level classification. In this paper, we regard these problems as noisy pseudo labeling, and instead impose the strict proportion consistency on the classifier with a constrained optimization as a continuous training stage for existing LLP classifiers. In addition, we introduce the mixup strategy and symmetric cross-entropy to further reduce the label noise. Our framework is model-agnostic, and demonstrates compelling performance improvement in extensive experiments, when incorporated into other deep LLP models as a post-hoc phase.

AAAI Conference 2020 Conference Paper

3D Human Pose Estimation Using Spatio-Temporal Networks with Explicit Occlusion Training

  • Yu Cheng
  • Bo Yang
  • Bo Wang
  • Robby T. Tan

Estimating 3D poses from a monocular video is still a challenging task, despite the significant progress that has been made in the recent years. Generally, the performance of existing methods drops when the target person is too small/large, or the motion is too fast/slow relative to the scale and speed of the training data. Moreover, to our knowledge, many of these methods are not designed or trained under severe occlusion explicitly, making their performance on handling occlusion compromised. Addressing these problems, we introduce a spatio-temporal network for robust 3D human pose estimation. As humans in videos may appear in different scales and have various motion speeds, we apply multi-scale spatial features for 2D joints or keypoints prediction in each individual frame, and multi-stride temporal convolutional networks (TCNs) to estimate 3D joints or keypoints. Furthermore, we design a spatio-temporal discriminator based on body structures as well as limb motions to assess whether the predicted pose forms a valid pose and a valid movement. During training, we explicitly mask out some keypoints to simulate various occlusion cases, from minor to severe occlusion, so that our network can learn better and becomes robust to various degrees of occlusion. As there are limited 3D ground truth data, we further utilize 2D video data to inject a semisupervised learning capability to our network. Experiments on public data sets validate the effectiveness of our method, and our ablation studies show the strengths of our network’s individual submodules.

AAAI Conference 2020 Conference Paper

Diversity Transfer Network for Few-Shot Learning

  • Mengting Chen
  • Yuxin Fang
  • Xinggang Wang
  • Heng Luo
  • Yifeng Geng
  • Xinyu Zhang
  • Chang Huang
  • Wenyu Liu

Few-shot learning is a challenging task that aims at training a classifier for unseen classes with only a few training examples. The main difficulty of few-shot learning lies in the lack of intra-class diversity within insufficient training samples. To alleviate this problem, we propose a novel generative framework, Diversity Transfer Network (DTN), that learns to transfer latent diversities from known categories and composite them with support features to generate diverse samples for novel categories in feature space. The learning problem of the sample generation (i. e. , diversity transfer) is solved via minimizing an effective meta-classification loss in a single-stage network, instead of the generative loss in previous works. Besides, an organized auxiliary task co-training over known categories is proposed to stabilize the meta-training process of DTN. We perform extensive experiments and ablation studies on three datasets, i. e. , miniImageNet, CIFAR100 and CUB. The results show that DTN, with single-stage training and faster convergence speed, obtains the state-of-the-art results among the feature generation based few-shot learning methods. Code and supplementary material are available at: https: //github. com/Yuxin-CV/DTN.

AAAI Conference 2020 Conference Paper

Progressive Feature Polishing Network for Salient Object Detection

  • Bo Wang
  • Quan Chen
  • Min Zhou
  • Zhiqiang Zhang
  • Xiaogang Jin
  • Kun Gai

Feature matters for salient object detection. Existing methods mainly focus on designing a sophisticated structure to incorporate multi-level features and filter out cluttered features. We present Progressive Feature Polishing Network (PFPN), a simple yet effective framework to progressively polish the multi-level features to be more accurate and representative. By employing multiple Feature Polishing Modules (FPMs) in a recurrent manner, our approach is able to detect salient objects with fine details without any post-processing. A FPM parallelly updates the features of each level by directly incorporating all higher level context information. Moreover, it can keep the dimensions and hierarchical structures of the feature maps, which makes it flexible to be integrated with any CNN-based models. Empirical experiments show that our results are monotonically getting better with increasing number of FPMs. Without bells and whistles, PFPN outperforms the state-of-the-art methods significantly on five benchmark datasets under various evaluation metrics. Our code is available at: https: //github. com/chenquan-cq/PFPN.

NeurIPS Conference 2019 Conference Paper

Learning from Label Proportions with Generative Adversarial Networks

  • Jiabin Liu
  • Bo Wang
  • Zhiquan Qi
  • YingJie Tian
  • Yong Shi

In this paper, we leverage generative adversarial networks (GANs) to derive an effective algorithm LLP-GAN for learning from label proportions (LLP), where only the bag-level proportional information in labels is available. Endowed with end-to-end structure, LLP-GAN performs approximation in the light of an adversarial learning mechanism, without imposing restricted assumptions on distribution. Accordingly, we can directly induce the final instance-level classifier upon the discriminator. Under mild assumptions, we give the explicit generative representation and prove the global optimality for LLP-GAN. Additionally, compared with existing methods, our work empowers LLP solver with capable scalability inheriting from deep models. Several experiments on benchmark datasets demonstrate vivid advantages of the proposed approach.

IS Journal 2019 Journal Article

Political Homophily in Independence Movements: Analyzing and Classifying Social Media Users by National Identity

  • Arkaitz Zubiaga
  • Bo Wang
  • Maria Liakata
  • Rob Procter

Social media and data mining are increasingly being used to analyze political and societal issues. Here, we undertake the classification of social media users as supporting or opposing ongoing independence movements in their territories. Independence movements occur in territories whose citizens have conflicting national identities; users with opposing national identities will then support or oppose the sense of being part of an independent nation that differs from the officially recognized country. We describe a methodology that relies on users’ self-reported location to build large-scale datasets for three territories—Catalonia, the Basque Country, and Scotland. An analysis of these datasets shows that homophily plays an important role in determining who people connect with, as users predominantly choose to follow and interact with others from the same national identity. We show that a classifier relying on users’ follow networks can achieve accurate, language-independent classification performances ranging from 85% to 97% for the three territories.

AAAI Conference 2018 Conference Paper

Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents

  • Bo Wang
  • Youjiang Xu
  • Yahong Han
  • Richang Hong

Movies provide us with a mass of visual content as well as attracting stories. Existing methods have illustrated that understanding movie stories through only visual content is still a hard problem. In this paper, for answering questions about movies, we put forward a Layered Memory Network (LMN) that represents frame-level and clip-level movie content by the Static Word Memory module and the Dynamic Subtitle Memory module, respectively. Particularly, we firstly extract words and sentences from the training movie subtitles. Then the hierarchically formed movie representations, which are learned from LMN, not only encode the correspondence between words and visual content inside frames, but also encode the temporal alignment between sentences and frames inside movie clips. We also extend our LMN model into three variant frameworks to illustrate the good extendable capabilities. We conduct extensive experiments on the MovieQA dataset. With only visual content as inputs, LMN with framelevel representation obtains a large performance improvement. When incorporating subtitles into LMN to form the clip-level representation, we achieve the state-of-the-art performance on the online evaluation task of ‘Video+Subtitles’. The good performance successfully demonstrates that the proposed framework of LMN is effective and the hierarchically formed movie representations have good potential for the applications of movie question answering.

NeurIPS Conference 2016 Conference Paper

Unsupervised Learning from Noisy Networks with Applications to Hi-C Data

  • Bo Wang
  • Junjie Zhu
  • Armin Pourshafeie
  • Oana Ursu
  • Serafim Batzoglou
  • Anshul Kundaje

Complex networks play an important role in a plethora of disciplines in natural sciences. Cleaning up noisy observed networks, poses an important challenge in network analysis Existing methods utilize labeled data to alleviate the noise effect in the network. However, labeled data is usually expensive to collect while unlabeled data can be gathered cheaply. In this paper, we propose an optimization framework to mine useful structures from noisy networks in an unsupervised manner. The key feature of our optimization framework is its ability to utilize local structures as well as global patterns in the network. We extend our method to incorporate multi-resolution networks in order to add further resistance to high-levels of noise. We also generalize our framework to utilize partial labels to enhance the performance. We specifically focus our method on multi-resolution Hi-C data by recovering clusters of genomic regions that co-localize in 3D space. Additionally, we use Capture-C-generated partial labels to further denoise the Hi-C network. We empirically demonstrate the effectiveness of our framework in denoising the network and improving community detection results.

YNIMG Journal 2013 Journal Article

Noninvasive real time tomographic imaging of epileptic foci and networks

  • Liangzhong Xiang
  • Lijun Ji
  • Tao Zhang
  • Bo Wang
  • Jianjun Yang
  • Qizhi Zhang
  • Max S. Jiang
  • Junli Zhou

While brain imaging and electrophysiology play a central role in neuroscience research and in the evaluation of neurological disorders, a single noninvasive modality that offers both high spatial and temporal resolution is currently not available. Here we show in an acute epilepsy rat model that photoacoustic tomography (PAT) can noninvasively track seizure brain dynamics with both high spatial and temporal resolution, and at a depth that is clinically relevant. The noninvasive yet whole surface and depth capabilities of the PAT system allowed us to actually see what is happening during ictogenesis in terms of seizure onset and spread. Both seizure onset and propagation were tomographically detected at a spatial resolution of 150μm and a temporal resolution of 300ms, respectively. The current study lends support to the theory that seizure onset and spread involves a rich interplay between multiple cortical and subcortical brain areas during the onset and spread of epileptic seizures. Dynamical changes of vasculature during epileptiform events were also detected with high spatiotemporal resolution. Together, these findings suggest that PAT represents a powerful tool for noninvasively mapping seizure onset and propagation patterns, and the ‘functional’ connectivity within epileptic brain networks.

YNICL Journal 2012 Journal Article

Neuroimaging of structural pathology and connectomics in traumatic brain injury: Toward personalized outcome prediction

  • Andrei Irimia
  • Bo Wang
  • Stephen R. Aylward
  • Marcel W. Prastawa
  • Danielle F. Pace
  • Guido Gerig
  • David A. Hovda
  • Ron Kikinis

Recent contributions to the body of knowledge on traumatic brain injury (TBI) favor the view that multimodal neuroimaging using structural and functional magnetic resonance imaging (MRI and fMRI, respectively) as well as diffusion tensor imaging (DTI) has excellent potential to identify novel biomarkers and predictors of TBI outcome. This is particularly the case when such methods are appropriately combined with volumetric/morphometric analysis of brain structures and with the exploration of TBI-related changes in brain network properties at the level of the connectome. In this context, our present review summarizes recent developments on the roles of these two techniques in the search for novel structural neuroimaging biomarkers that have TBI outcome prognostication value. The themes being explored cover notable trends in this area of research, including (1) the role of advanced MRI processing methods in the analysis of structural pathology, (2) the use of brain connectomics and network analysis to identify outcome biomarkers, and (3) the application of multivariate statistics to predict outcome using neuroimaging metrics. The goal of the review is to draw the community's attention to these recent advances on TBI outcome prediction methods and to encourage the development of new methodologies whereby structural neuroimaging can be used to identify biomarkers of TBI outcome.