Arrow Research search

Author name cluster

Sen Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

31 papers
2 author rows

Possible papers

31

EAAI Journal 2026 Journal Article

A printed circuit board surface defect detection method for long-tail and multi-scale scenarios

  • Xuangang Li
  • Sen Wang
  • Liying Zhu
  • Aiping Shen
  • Dianlu Hu

Surface defects on Printed Circuit Board (PCB) in industrial production exhibit characteristics of random occurrence, uneven category distribution, variable scales, and minute dimensions, increasing the difficulty of quality inspection. To achieve multi-scale defect detection on PCB surfaces in long-tail small-target scenarios, the Long-Tail Dynamic Multi-Scale Printed Circuit Board (LDM-PCB) detection approach proposed in this paper employs the Long-Tail Feature Extraction Network (LTFE-Net) as the backbone, enhances the representation of tail defects by the Adaptive Tail Attention (ATA) module, improves the ability of the model to quickly capture the low-frequency defect features, and effectively solves the imbalance problem of feature learning under long-tail data distribution. The Dynamic Multi-Scale Fusion (DMS-Fuse) architecture dynamically adjusts feature fusion weights for defects of varying sizes through adaptive weighting strategies, enabling feature interaction across scales. A designed dynamic prediction layer preserves high-resolution defect features, directly outputting dynamic information to mitigate detail degradation in deep networks and improve localization accuracy for subtle defects. On self-built long-tail defect dataset, LDM-PCB achieves 99. 1% mean Average Precision at Intersection-over-Union threshold 0. 5 ( m A P 0. 5 ) with only 8. 61 million (M) parameters, surpassing baseline models by 1. 8 percentage points. The detection speed reaches 100 frames per second (FPS), achieving a balance between accuracy and speed, with results superior to other algorithms. Generalization experiments on public PCB datasets further demonstrate the optimal performance of LDM-PCB. Deployment results on edge devices indicate industrial deployment potential.

EAAI Journal 2026 Journal Article

Zero-velocity update -aided navigation method for miniature quadruped robot based on adapted virtual inertial measurement unit

  • Siwei Tang
  • Weixing Qian
  • Sen Wang
  • Feng Yang
  • Xinyuan Wang
  • Weinan Gao
  • Pengyu Liu

Addressing the challenges associated with installing inertial measurement units (IMUs) on the feet of miniature quadruped robots, this paper proposes a zero-velocity update (ZUPT) method based on adaptive virtual inertial measurement unit (VIMU). This approach eliminates the reliance of existing ZUPT method for inertial navigation systems on foot-mounted IMUs and gait recognition. By utilizing the IMU outputs from legs and feet of a quadruped robot as the training dataset, an innovative Convolutional Neural Network (CNN)- Bidirectional gated recurrent unit neural network (BiGRU)-Attention hybrid network is constructed to establish a nonlinear mapping relationship between the multiple IMUs. In practical applications, the foot-mounted VIMU can be generated solely from the leg-mounted IMU data, and the modified navigation parameters are then output through a ZUPT algorithm to achieve accurate positioning of the quadruped robot. Experimental results demonstrate that the positioning errors of this method is about 1. 34 % of the total path under diverse terrain conditions, including slopes, stairs, and grasslands, outperforming gait recognition-dependent methods in terms of accuracy. This approach effectively implements the inertial navigation function of quadruped robots and enhances the adaptability of ZUPT method to unstructured and unknown terrains. This approach has great potential to improve Global Navigation Satellite Systems (GNSS)-denied positioning performance of quadruped robots in complex environments without the assistance of visual sensor and LightLaser Detection and Ranging (LiDAR).

EAAI Journal 2025 Journal Article

A lightweight vision transformer with embedded hybrid attention for quick response code defect classification

  • Dianlu Hu
  • Lun Zhao
  • Yu Ren
  • Sen Wang
  • Xuanlin Ye
  • Haohan Zhang
  • Changqing Peng

Quick Response (QR) code label printing quality is crucial to product control. Due to the limited number of defect samples, unclear features, and the need to detect a large number of labels in real time, automated visual inspection faces challenges. For efficient and accurate automated visual defect recognition of printed QR code production, we propose a lightweight Vision Transformer network, Vision Transformer with Embedded Hybrid Attention (ViT-EHA). First, the Mixed Depthwise Convolution Block (MDConvBlock) is introduced to capture QR code defect details and feature information. This method additionally reduces the number of model parameters and computational costs. Furthermore, the LeAttention-Local Convolution-Multilayer Perceptron (LeALCM) module is proposed to enhance the ability to capture global information of the model and improve the effect of minor defect recognition. Ultimately, a hybrid attention (HA) module has been integrated to enhance the processing of low-level image features and to strengthen the interplay between shallow and deep features. To verify the validity and generalization of the model, the experimental results show that the proposed ViT-EHA method achieved an accuracy of 99. 00% and a parameter count of 4. 198 million (M) on the self-constructed dataset Code-10 (QR Code Dataset with 10 Classes), and the accuracy reached 98. 33% and 97. 73% on the public datasets NEU-CLS (Northeastern University Classification Dataset) and NEU-CLS-64 (Northeastern University Classification Dataset with 64 × 64 images), respectively.

IROS Conference 2025 Conference Paper

An Inflatable Deployable Origami Grasper for Adaptive and High-Load Grasping

  • Peng Yan
  • Guang Liang
  • Sen Wang
  • Hailin Huang
  • Wei Wang
  • Xu Li
  • Bing Li

Robotic graspers are essential for enhancing the efficiency and versatility of robots in grasping tasks. In this paper, we propose a novel inflatable deployable origami grasper with a rigid-flexible coupling structure. The proposed grasper can achieve multiple deployment configurations under a single pneumatic actuation, enabling both deployment and grasping operations while also allowing for passive self-folding during deflation. The design and fabrication of the grasper are presented. Then, the stiffness model for the inflatable deployable origami unit is developed based on the equivalent truss method. Experimental results show that the grasper successfully grasps objects of various shapes and sizes in both enveloping and fingertip grasping modes, using either two or four fingers. With its simple mechanical system and high deploy/fold ratio, the proposed grasper holds significant potential for applications in industrial automation and space exploration.

NeurIPS Conference 2025 Conference Paper

DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation

  • Jingyi Tian
  • Le Wang
  • Sanping Zhou
  • Sen Wang
  • Gang Hua

Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.

IJCAI Conference 2025 Conference Paper

Multimodal Retina Image Analysis Survey: Datasets, Tasks and Methods

  • Hongwei Sheng
  • Heming Du
  • Xin Shen
  • Sen Wang
  • Xin Yu

Retina images provide a noninvasive view of the central nervous system and microvasculature, making it essential for clinical applications. Changes in the retina often indicate both ophthalmic and systemic diseases, aiding in diagnosis and early intervention. While deep learning algorithms have advanced retina image analysis, a comprehensive review of related datasets, tasks, and benchmarking is still lacking. In this survey, we systematically categorize existing retina image datasets based on their available data modalities, and review the tasks these datasets support in multimodal retina image analysis. We also explain key evaluation metrics used in various retina image analysis benchmarks. By thoroughly examining current datasets and methods, we highlight the challenges and limitations in existing benchmarks and discuss potential research topics in the field. We hope this work will guide future retina analysis methods and promote the shared use of existing data across different tasks.

NeurIPS Conference 2025 Conference Paper

SAMPO: Scale-wise Autoregression with Motion Prompt for Generative World Models

  • Sen Wang
  • Jingyi Tian
  • Le Wang
  • Zhimin Liao
  • Huaiyi Dong
  • Kun Xia
  • Sanping Zhou
  • Wei Tang

World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose Scale-wise Autoregression with Motion PrOmpt (SAMPO), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4. 4× faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.

EAAI Journal 2025 Journal Article

Subtle Defect Detection Network: More accurately detect subtle defects on the Printed Circuit Board surface

  • Liying Zhu
  • Sen Wang
  • Mingfang Chen
  • Yang Zhu
  • Kaizhe Xing
  • Aiping Shen

Printed circuit boards (PCBs) are the hardware foundation of large-scale integrated circuits, where surface quality inspection plays a critical role in manufacturing reliability. We proposed a subtle defect detection network (SDD-Net) to solve the detection problems in PCB surface defects, such as complex background, difficulty in distinguishing between foreground and background, random defect shape, area, and position. We propose a lightweight receptive field augmentation network (LRFA-Net) as the backbone effectively augments the receptive field, reduces parameters, and enhances feature extraction capabilities. A more lightweight multi-scale feature and coordinate information interaction mechanism was designed to enhance the capacity of the network to discern small targets in complex backgrounds. The combination of Varifocal Loss and Complete Intersection over Union Loss (CIoU) addresses the issue of distinguishing between the foreground and background, as well as adapting to PCB surface defects with variable shapes and positions. A lightweight Omni-dimensional dynamic convolutional prediction head (OD-Head) is designed to introduce multi-dimensional attention to effectively perceive small defects on the PCB surface. Compared with other algorithms, SDD-Net has a mean average precision (mAP 0. 5) of 99. 6 % on the PCB Defect Augmented dataset, and the detection speed reaches 53 frames per second, which achieves a balance between accuracy and speed, and the effect is better than other algorithms. At the same time, SDD-Net is also experimentally verified on the real PCB surface welding defect data set, and the results show that SDD-Net also has the best detection effect.

NeurIPS Conference 2025 Conference Paper

When One Moment Isn't Enough: Multi-Moment Retrieval with Cross-Moment Interactions

  • Zhuo Cao
  • Heming Du
  • Bingqing Zhang
  • Xin Yu
  • Xue Li
  • Sen Wang

Existing Moment retrieval (MR) methods focus on Single-Moment Retrieval (SMR). However, one query can correspond to multiple relevant moments in real-world applications. This makes the existing datasets and methods insufficient for video temporal grounding. By revisiting the gap between current MR tasks and real-world applications, we introduce a high-quality datasets called QVHighlights Multi-Moment Dataset (QV-M$^2$), along with new evaluation metrics tailored for multi-moment retrieval (MMR). QV-M$^2$ consists of 2, 212 annotations covering 6, 384 video segments. Building on existing efforts in MMR, we propose a framework called FlashMMR. Specifically, we propose a Multi-moment Post-verification module to refine the moment boundaries. We introduce constrained temporal adjustment and subsequently leverage a verification module to re-evaluate the candidate segments. Through this sophisticated filtering pipeline, low-confidence proposals are pruned, and robust multi-moment alignment is achieved. We retrain and evaluate 6 existing MR methods on QV-M$^2$ and QVHighlights under both SMR and MMR settings. Results show that QV-M$^2$ serves as an effective benchmark for training and evaluating MMR models, while FlashMMR provides a strong baseline. Specifically, on QV-M$^2$, it achieves improvements over prior SOTA method by 3. 00% on G-mAP, 2. 70% on mAP@3+tgt, and 2. 56% on mR@3. The proposed benchmark and method establish a foundation for advancing research in more realistic and challenging video temporal grounding scenarios. Code is released at https: //github. com/Zhuo-Cao/QV-M2.

IJCAI Conference 2024 Conference Paper

Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts

  • Haodong Hong
  • Sen Wang
  • Zi Huang
  • Qi Wu
  • Jiajun Liu

Current Vision-and-Language Navigation (VLN) tasks mainly employ textual instructions to guide agents. However, being inherently abstract, the same textual instruction can be associated with different visual signals, causing severe ambiguity and limiting the transfer of prior knowledge in the vision domain from the user to the agent. To fill this gap, we propose Vision-and-Language Navigation with Multi-modal Prompts (VLN-MP), a novel task augmenting traditional VLN by integrating both natural language and images in instructions. VLN-MP not only maintains backward compatibility by effectively handling text-only prompts but also consistently shows advantages with different quantities and relevance of visual prompts. Possible forms of visual prompts include both exact and similar object images, providing adaptability and versatility in diverse navigation scenarios. To evaluate VLN-MP under a unified framework, we implement a new benchmark that offers: (1) a training-free pipeline to transform textual instructions into multi-modal forms with landmark images; (2) diverse datasets with multi-modal instructions for different downstream tasks; (3) a novel module designed to process various image prompts for seamless integration with state-of-the-art VLN models. Extensive experiments on four VLN benchmarks (R2R, RxR, REVERIE, CVDN) show that incorporating visual prompts would significantly boost navigation performance. While maintaining efficiency with text-only prompts, VLN-MP enables agents to navigate in the pre-explore setting and outperform text-based models, showing its broader applicability. Code is available at https: //github. com/honghd16/VLN-MP.

ICRA Conference 2023 Conference Paper

BAMF-SLAM: Bundle Adjusted Multi-Fisheye Visual-Inertial SLAM Using Recurrent Field Transforms

  • Wei Zhang 0334
  • Sen Wang
  • Xingliang Dong
  • Rongwei Guo
  • Norbert Haala

In this paper, we present BAMF-SLAM, a novel multi-fisheye visual-inertial SLAM system that utilizes Bundle Adjustment (BA) and recurrent field transforms (RFT) to achieve accurate and robust state estimation in challenging scenarios. First, our system directly operates on raw fisheye images, enabling us to fully exploit the wide Field-of-View (FoV) of fisheye cameras. Second, to overcome the low-texture challenge, we explore the tightly-coupled integration of multi-camera inputs and complementary inertial measurements via a unified factor graph and jointly optimize the poses and dense depth maps. Third, for global consistency, the wide FoV of the fisheye camera allows the system to find more potential loop closures, and powered by the broad convergence basin of RFT, our system can perform very wide baseline loop closing with little overlap. Furthermore, we introduce a semi-pose-graph BA method to avoid the expensive full global BA. By combining relative pose factors with loop closure factors, the global states can be adjusted efficiently with modest memory footprint while maintaining high accuracy. Evaluations on TUM-VI, Hilti-Oxford and Newer College datasets show the superior performance of the proposed system over prior works. In the Hilti SLAM Challenge 2022, our VIO version achieves second place. In a subsequent submission, our complete system, including the global BA backend, outperforms the winning approach.

ECAI Conference 2023 Conference Paper

MonoSKD: General Distillation Framework for Monocular 3D Object Detection via Spearman Correlation Coefficient

  • Sen Wang
  • Jin Zheng

Monocular 3D object detection is an inherently ill-posed problem, as it is challenging to predict accurate 3D localization from a single image. Existing monocular 3D detection knowledge distillation methods usually project the LiDAR onto the image plane and train the teacher network accordingly. Transferring LiDAR-based model knowledge to RGB-based models is more complex, so a general distillation strategy is needed. To alleviate cross-modal problem, we propose MonoSKD, a novel Knowledge Distillation framework for Monocular 3D detection based on Spearman correlation coefficient, to learn the relative correlation between cross-modal features. Considering the large gap between these features, strict alignment of features may mislead the training, so we propose a looser Spearman loss. Furthermore, by selecting appropriate distillation locations and removing redundant modules, our scheme saves more GPU resources and trains faster than existing methods. Extensive experiments are performed to verify the effectiveness of our framework on the challenging KITTI 3D object detection benchmark. Our method achieves state-of-the-art performance until submission with no additional inference computational cost. Our codes are available at https: //github. com/Senwang98/MonoSKD.

NeurIPS Conference 2023 Conference Paper

RVD: A Handheld Device-Based Fundus Video Dataset for Retinal Vessel Segmentation

  • MD WAHIDUZZAMAN KHAN
  • Hongwei Sheng
  • Hu Zhang
  • Heming Du
  • Sen Wang
  • Minas Coroneo
  • Farshid Hajati
  • Sahar Shariflou

Retinal vessel segmentation is generally grounded in image-based datasets collected with bench-top devices. The static images naturally lose the dynamic characteristics of retina fluctuation, resulting in diminished dataset richness, and the usage of bench-top devices further restricts dataset scalability due to its limited accessibility. Considering these limitations, we introduce the first video-based retinal dataset by employing handheld devices for data acquisition. The dataset comprises 635 smartphone-based fundus videos collected from four different clinics, involving 415 patients from 50 to 75 years old. It delivers comprehensive and precise annotations of retinal structures in both spatial and temporal dimensions, aiming to advance the landscape of vasculature segmentation. Specifically, the dataset provides three levels of spatial annotations: binary vessel masks for overall retinal structure delineation, general vein-artery masks for distinguishing the vein and artery, and fine-grained vein-artery masks for further characterizing the granularities of each artery and vein. In addition, the dataset offers temporal annotations that capture the vessel pulsation characteristics, assisting in detecting ocular diseases that require fine-grained recognition of hemodynamic fluctuation. In application, our dataset exhibits a significant domain shift with respect to data captured by bench-top devices, thus posing great challenges to existing methods. Thanks to rich annotations and data scales, our dataset potentially paves the path for more advanced retinal analysis and accurate disease diagnosis. In the experiments, we provide evaluation metrics and benchmark results on our dataset, reflecting both the potential and challenges it offers for vessel segmentation tasks. We hope this challenging dataset would significantly contribute to the development of eye disease diagnosis and early prevention.

NeurIPS Conference 2022 Conference Paper

Improved Feature Distillation via Projector Ensemble

  • Yudong Chen
  • Sen Wang
  • Jiajun Liu
  • Xuwei Xu
  • Frank de Hoog
  • Zi Huang

In knowledge distillation, previous feature distillation methods mainly focus on the design of loss functions and the selection of the distilled layers, while the effect of the feature projector between the student and the teacher remains under-explored. In this paper, we first discuss a plausible mechanism of the projector with empirical evidence and then propose a new feature distillation method based on a projector ensemble for further performance improvement. We observe that the student network benefits from a projector even if the feature dimensions of the student and the teacher are the same. Training a student backbone without a projector can be considered as a multi-task learning process, namely achieving discriminative feature extraction for classification and feature matching between the student and the teacher for distillation at the same time. We hypothesize and empirically verify that without a projector, the student network tends to overfit the teacher's feature distributions despite having different architecture and weights initialization. This leads to degradation on the quality of the student's deep features that are eventually used in classification. Adding a projector, on the other hand, disentangles the two learning tasks and helps the student network to focus better on the main feature extraction task while still being able to utilize teacher features as a guidance through the projector. Motivated by the positive effect of the projector in feature distillation, we propose an ensemble of projectors to further improve the quality of student features. Experimental results on different datasets with a series of teacher-student pairs illustrate the effectiveness of the proposed method. Code is available at https: //github. com/chenyd7/PEFD.

IJCAI Conference 2021 Conference Paper

Self-Supervised Adversarial Distribution Regularization for Medication Recommendation

  • Yanda Wang
  • Weitong Chen
  • Dechang Pi
  • Lin Yue
  • Sen Wang
  • Miao Xu

Medication recommendation is a significant healthcare application due to its promise in effectively prescribing medications. Avoiding fatal side effects related to Drug-Drug Interaction (DDI) is among the critical challenges. Most existing methods try to mitigate the problem by providing models with extra DDI knowledge, making models complicated. While treating all patients with different DDI properties as a single cohort would put forward strict requirements on models' generalization performance. In pursuit of a valuable model for a safe recommendation, we propose the Self-Supervised Adversarial Regularization Model for Medication Recommendation (SARMR). SARMR obtains the target distribution associated with safe medication combinations from raw patient records for adversarial regularization. In this way, the model can shape distributions of patient representations to achieve DDI reduction. To obtain accurate self-supervision information, SARMR models interactions between physicians and patients by building a key-value memory neural network and carrying out multi-hop reading to obtain contextual information for patient representations. SARMR outperforms all baseline methods in the experiment on a real-world clinical dataset. This model can achieve DDI reduction when considering the different number of DDI types, which demonstrates the robustness of adversarial regularization for safe medication recommendation.

AAAI Conference 2020 Conference Paper

Adaptive Two-Dimensional Embedded Image Clustering

  • Zhihui Li
  • Lina Yao
  • Sen Wang
  • Salil Kanhere
  • Xue Li
  • Huaxiang Zhang

With the rapid development of mobile devices, people are generating huge volumes of images data every day for sharing on social media, which draws much research attention to understanding the contents of images. Image clustering plays an important role in image understanding systems. Often, most of the existing image clustering algorithms flatten digital images that are originally represented by matrices into 1D vectors as the image representation for the subsequent learning. The drawbacks of vector-based algorithms include limited consideration of spatial relationship between pixels and computational complexity, both of which blame to the simple vectorized representation. To overcome the drawbacks, we propose a novel image clustering framework that can work directly on matrices of images instead of flattened vectors. Specifically, the proposed algorithm simultaneously learn the clustering results and preserve the original correlation information within the image matrix. To solve the challenging objective function, we propose a fast iterative solution. Extensive experiments have been conducted on various benchmark datasets. The experimental results confirm the superiority of the proposed algorithm.

AAAI Conference 2020 Conference Paper

One-Shot Learning for Long-Tail Visual Relation Detection

  • Weitao Wang
  • Meng Wang
  • Sen Wang
  • Guodong Long
  • Lina Yao
  • Guilin Qi
  • Yang Chen

The aim of visual relation detection is to provide a comprehensive understanding of an image by describing all the objects within the scene, and how they relate to each other, in form; for example, . This ability is vital for image captioning, visual question answering, and many other applications. However, visual relationships have long-tailed distributions and, thus, the limited availability of training samples is hampering the practicability of conventional detection approaches. With this in mind, we designed a novel model for visual relation detection that works in one-shot settings. The embeddings of objects and predicates are extracted through a network that includes a feature-level attention mechanism. Attention alleviates some of the problems with feature sparsity, and the resulting representations capture more discriminative latent features. The core of our model is a dual graph neural network that passes and aggregates the context information of predicates and objects in an episodic training scheme to improve recognition of the one-shot predicates and then generate the triplets. To the best of our knowledge, we are the first to center on the viability of one-shot learning for visual relation detection. Extensive experiments on two newly-constructed datasets show that our model significantly improved the performance of two tasks PredCls and SGCls from 2. 8% to 12. 2% compared with state-of-the-art baselines.

IJCAI Conference 2020 Conference Paper

Quadratic Sparse Gaussian Graphical Model Estimation Method for Massive Variables

  • Jiaqi Zhang
  • Meng Wang
  • Qinchi Li
  • Sen Wang
  • Xiaojun Chang
  • Beilun Wang

We consider the problem of estimating a sparse Gaussian Graphical Model with a special graph topological structure and more than a million variables. Most previous scalable estimators still contain expensive calculation steps (e. g. , matrix inversion or Hessian matrix calculation) and become infeasible in high-dimensional scenarios, where p (number of variables) is larger than n (number of samples). To overcome this challenge, we propose a novel method, called Fast and Scalable Inverse Covariance Estimator by Thresholding (FST). FST first obtains a graph structure by applying a generalized threshold to the sample covariance matrix. Then, it solves multiple block-wise subproblems via element-wise thresholding. By using matrix thresholding instead of matrix inversion as the computational bottleneck, FST reduces its computational complexity to a much lower order of magnitude (O(p2)). We show that FST obtains the same sharp convergence rate O(√(log max{p, n}/n) as other state-of-the-art methods. We validate the method empirically, on multiple simulated datasets and one real-world dataset, and show that FST is two times faster than the four baselines while achieving a lower error rate under both Frobenius-norm and max-norm.

IROS Conference 2020 Conference Paper

Robot Calligraphy using Pseudospectral Optimal Control in Conjunction with a Novel Dynamic Brush Model

  • Sen Wang
  • Jiaqi Chen
  • Xuanliang Deng
  • Seth Hutchinson 0001
  • Frank Dellaert

Chinese calligraphy is a unique art form with great artistic value but difficult to master. In this paper, we formulate the calligraphy writing problem as a trajectory optimization problem, and propose an improved virtual brush model for simulating the real writing process. Our approach is inspired by pseudospectral optimal control in that we parameterize the actuator trajectory for each stroke as a Chebyshev polynomial. The proposed dynamic virtual brush model plays a key role in formulating the objective function to be optimized. Our approach shows excellent performance in drawing aesthetically pleasing characters, and does so much more efficiently than previous work, opening up the possibility to achieve real-time closed-loop control.

AAAI Conference 2019 Conference Paper

Distributionally Robust Semi-Supervised Learning for People-Centric Sensing

  • Kaixuan Chen
  • Lina Yao
  • Dalin Zhang
  • Xiaojun Chang
  • Guodong Long
  • Sen Wang

Semi-supervised learning is crucial for alleviating labelling burdens in people-centric sensing. However, humangenerated data inherently suffer from distribution shift in semi-supervised learning due to the diverse biological conditions and behavior patterns of humans. To address this problem, we propose a generic distributionally robust model for semi-supervised learning on distributionally shifted data. Considering both the discrepancy and the consistency between the labeled data and the unlabeled data, we learn the latent features that reduce person-specific discrepancy and preserve task-specific consistency. We evaluate our model in a variety of people-centric recognition tasks on real-world datasets, including intention recognition, activity recognition, muscular movement recognition and gesture recognition. The experiment results demonstrate that the proposed model outperforms the state-of-the-art methods.

NeurIPS Conference 2019 Conference Paper

Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds

  • Bo Yang
  • Jianan Wang
  • Ronald Clark
  • Qingyong Hu
  • Sen Wang
  • Andrew Markham
  • Niki Trigoni

We propose a novel, conceptually simple and general framework for instance segmentation on 3D point clouds. Our method, called 3D-BoNet, follows the simple design philosophy of per-point multilayer perceptrons (MLPs). The framework directly regresses 3D bounding boxes for all instances in a point cloud, while simultaneously predicting a point-level mask for each instance. It consists of a backbone network followed by two parallel network branches for 1) bounding box regression and 2) point mask prediction. 3D-BoNet is single-stage, anchor-free and end-to-end trainable. Moreover, it is remarkably computationally efficient as, unlike existing approaches, it does not require any post-processing steps such as non-maximum suppression, feature sampling, clustering or voting. Extensive experiments show that our approach surpasses existing work on both ScanNet and S3DIS datasets while being approximately 10x more computationally efficient. Comprehensive ablation studies demonstrate the effectiveness of our design.

IJCAI Conference 2018 Conference Paper

3D-PhysNet: Learning the Intuitive Physics of Non-Rigid Object Deformations

  • Zhihua Wang
  • Stefano Rosa
  • Bo Yang
  • Sen Wang
  • Niki Trigoni
  • Andrew Markham

The ability to interact and understand the environment is a fundamental prerequisite for a wide range of applications from robotics to augmented reality. In particular, predicting how deformable objects will react to applied forces in real time is a significant challenge. This is further confounded by the fact that shape information about encountered objects in the real world is often impaired by occlusions, noise and missing regions e. g. a robot manipulating an object will only be able to observe a partial view of the entire solid. In this work we present a framework, 3D-PhysNet, which is able to predict how a three-dimensional solid will deform under an applied force using intuitive physics modelling. In particular, we propose a new method to encode the physical properties of the material and the applied force, enabling generalisation over materials. The key is to combine deep variational autoencoders with adversarial training, conditioned on the applied force and the material properties. We further propose a cascaded architecture that takes a single 2. 5D depth view of the object and predicts its deformation. Training data is provided by a physics simulator. The network is fast enough to be used in real-time applications from partial views. Experimental results show the viability and the generalisation properties of the proposed architecture.

IJCAI Conference 2018 Conference Paper

A Comparative Study of Transactional and Semantic Approaches for Predicting Cascades on Twitter

  • Yunwei Zhao
  • Can Wang
  • Chi-Hung Chi
  • Kwok-Yan Lam
  • Sen Wang

The availability of massive social media data has enabled the prediction of people’s future behavioral trends at an unprecedented large scale. Information cascades study on Twitter has been an integral part of behavior analysis. A number of methods based on the transactional features (such as keyword frequency) and the semantic features (such as sentiment) have been proposed to predict the future cascading trends. However, an in-depth understanding of the pros and cons of semantic and transactional models is lacking. This paper conducts a comparative study of both approaches in predicting information diffusion with three mechanisms: retweet cascade, url cascade, and hashtag cascade. Experiments on Twitter data show that the semantic model outperforms the transactional model, if the exterior pattern is less directly observable (i. e. hashtag cascade). When it becomes more directly observable (i. e. retweet and url cascades), the semantic method yet delivers approximate accuracy (i. e. url cascade) or even worse accuracy (i. e. retweet cascade). Further, we demonstrate that the transactional and semantic models are not independent, and the performance gets greatly enhanced when combining both.

AAAI Conference 2018 Conference Paper

Cascade and Parallel Convolutional Recurrent Neural Networks on EEG-based Intention Recognition for Brain Computer Interface

  • Dalin Zhang
  • Lina Yao
  • Xiang Zhang
  • Sen Wang
  • Weitong Chen
  • Robert Boots
  • Boualem Benatallah

Brain-Computer Interface (BCI) is a system empowering humans to communicate with or control the outside world with exclusively brain intentions. Electroencephalography (EEG) based BCIs are promising solutions due to their convenient and portable instruments. Despite the extensive research of EEG in recent years, it is still challenging to interpret EEG signals effectively due to the massive noises in EEG signals (e. g. , low signal-noise ratio and incomplete EEG signals), and difficulties in capturing the inconspicuous relationships between EEG signals and certain brain activities. Most existing works either only consider EEG as chain-like sequences neglecting complex dependencies between adjacent signals or requiring preprocessing such as transforming EEG waves into images. In this paper, we introduce both cascade and parallel convolutional recurrent neural network models for precisely identifying human intended movements and instructions by effectively learning the compositional spatio-temporal representations of raw EEG streams. Extensive experiments on a large scale movement intention EEG dataset (108 subjects, 3, 145, 160 EEG records) have demonstrated that both models achieve high accuracy near 98. 3% and outperform a set of baseline methods and most recent deep learning based EEG recognition models, yielding a significant accuracy increase of 18% in the cross-subject validation scenario. The developed models are further evaluated with a real-world BCI and achieve a recognition accuracy of 93% over five instruction intentions. This suggests the proposed models are able to generalize over different kinds of intentions and BCI systems.

IJCAI Conference 2018 Conference Paper

Multi-modality Sensor Data Classification with Selective Attention

  • Xiang Zhang
  • Lina Yao
  • Chaoran Huang
  • Sen Wang
  • Mingkui Tan
  • Guodong Long
  • Can Wang

Multimodel wearable sensor data classificationplays an important role in ubiquitous computingand has a wide range of applications in variousscenarios from healthcare to entertainment. How-ever, most of the existing work in this field em-ploys domain-specific approaches and is thus inef-fective in complex situations where multi-modalitysensor data is collected. Moreover, the wearablesensor data is less informative than the conven-tional data such as texts or images. In this paper, to improve the adaptability of such classificationmethods across different application contexts, weturn this classification task into a game and applya deep reinforcement learning scheme to dynami-cally deal with complex situations. We also intro-duce a selective attention mechanism into the rein-forcement learning scheme to focus on the crucialdimensions of the data. This mechanism helps tocapture extra information from the signal, and canthus significantly improve the discriminative powerof the classifier. We carry out several experimentson three wearable sensor datasets, and demonstratecompetitive performance of the proposed approachcompared to several state-of-the-art baselines.

IJCAI Conference 2018 Conference Paper

NeuRec: On Nonlinear Transformation for Personalized Ranking

  • Shuai Zhang
  • Lina Yao
  • Aixin Sun
  • Sen Wang
  • Guodong Long
  • Manqing Dong

Modeling user-item interaction patterns is an important task for personalized recommendations. Many recommender systems are based on the assumption that there exists a linear relationship between users and items while neglecting the intricacy and non-linearity of real-life historical interactions. In this paper, we propose a neural network based recommendation model (NeuRec) that untangles the complexity of user-item interactions and establish an integrated network to combine non-linear transformation with latent factors. We further design two variants of NeuRec: user-based NeuRec and item-based NeuRec, by focusing on different aspects of the interaction matrix. Extensive experiments on four real-world datasets demonstrated their superior performances on personalized ranking task.

IJCAI Conference 2018 Conference Paper

Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling

  • Tao Shen
  • Tianyi Zhou
  • Guodong Long
  • Jing Jiang
  • Sen Wang
  • Chengqi Zhang

Many natural language processing tasks solely rely on sparse dependencies between a few tokens in a sentence. Soft attention mechanisms show promising performance in modeling local/global dependencies by soft probabilities between every two tokens, but they are not effective and efficient when applied to long sentences. By contrast, hard attention mechanisms directly select a subset of tokens but are difficult and inefficient to train due to their combinatorial nature. In this paper, we integrate both soft and hard attention into one context fusion model, "reinforced self-attention (ReSA)", for the mutual benefit of each other. In ReSA, a hard attention trims a sequence for a soft self-attention to process, while the soft attention feeds reward signals back to facilitate the training of the hard one. For this purpose, we develop a novel hard attention called "reinforced sequence sampling (RSS)", selecting tokens in parallel and trained via policy gradient. Using two RSS modules, ReSA efficiently extracts the sparse dependencies between each pair of selected tokens. We finally propose an RNN/CNN-free sentence-encoding model, "reinforced self-attention network (ReSAN)", solely based on ReSA. It achieves state-of-the-art performance on both the Stanford Natural Language Inference (SNLI) and the Sentences Involving Compositional Knowledge (SICK) datasets.

AAAI Conference 2018 Conference Paper

Trace Ratio Optimization With Feature Correlation Mining for Multiclass Discriminant Analysis

  • Forough Rezaei Boroujeni
  • Sen Wang
  • Zhihui Li
  • Nicholas West
  • Bela Stantic
  • Lina Yao
  • Guodong Long

Fisher’s linear discriminant analysis is a widely accepted dimensionality reduction method, which aims to find a transformation matrix to convert feature space to a smaller space by maximising the between-class scatter matrix while minimising the within-class scatter matrix. Although the fast and easy process of finding the transformation matrix has made this method attractive, overemphasizing the large class distances makes the criterion of this method suboptimal. In this case, the close class pairs tend to overlap in the subspace. Despite different weighting methods having been developed to overcome this problem, there is still a room to improve this issue. In this work, we study a weighted trace ratio by maximising the harmonic mean of the multiple objective reciprocals. To further improve the performance, we enforce the 2, 1-norm to the developed objective function. Additionally, we propose an iterative algorithm to optimise this objective function. The proposed method avoids the domination problem of the largest objective, and guarantees that no objectives will be too small. This method can be more beneficial if the number of classes is large. The extensive experiments on different datasets show the effectiveness of our proposed method when compared with four state-of-the-art methods.

AAAI Conference 2017 Conference Paper

Multi-View Correlated Feature Learning by Uncovering Shared Component

  • Xiaowei Xue
  • Feiping Nie
  • Sen Wang
  • Xiaojun Chang
  • Bela Stantic
  • Min Yao

Learning multiple heterogeneous features from different data sources is challenging. One research topic is how to exploit and utilize the correlations among various features across multiple views with the aim of improving the performance of learning tasks, such as classification. In this paper, we propose a new multi-view feature learning algorithm that simultaneously analyzes features from different views. Compared to most of the existing subspace learning methods that only focus on exploiting a shared latent subspace, our algorithm not only learns individual information in each view but also captures feature correlations among multiple views by learning a shared component. By assuming that such a component is shared by all views, we simultaneously exploit the shared component and individual information of each view in a batch mode. Since the objective function is non-smooth and difficult to solve, we propose an efficient iterative algorithm for optimization with guaranteed convergence. Extensive experiments are conducted on several benchmark datasets. The results demonstrate that our proposed algorithm performs better than all the compared multi-view learning algorithms.

AAAI Conference 2017 Conference Paper

VINet: Visual-Inertial Odometry as a Sequence-to-Sequence Learning Problem

  • Ronald Clark
  • Sen Wang
  • Hongkai Wen
  • Andrew Markham
  • Niki Trigoni

In this paper we present an on-manifold sequence-tosequence learning approach to motion estimation using visual and inertial sensors. It is to the best of our knowledge the first end-to-end trainable method for visual-inertial odometry which performs fusion of the data at an intermediate feature-representation level. Our method has numerous advantages over traditional approaches. Specifically, it eliminates the need for tedious manual synchronization of the camera and IMU as well as eliminating the need for manual calibration between the IMU and camera. A further advantage is that our model naturally and elegantly incorporates domain specific information which significantly mitigates drift. We show that our approach is competitive with state-of-theart traditional methods when accurate calibration data is available and can be trained to outperform them in the presence of calibration and synchronization errors.