Arrow Research search

Author name cluster

Gen Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

34 papers
2 author rows

Possible papers

34

TMLR Journal 2026 Journal Article

Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

  • Mingyu Cao
  • Gen Li
  • Jie Ji
  • Jiaqi Zhang
  • AJAY JAISWAL
  • Li Shen
  • Xiaolong Ma
  • Shiwei Liu

Mixture-of-Experts (MoE) has garnered significant attention for its ability to scale up neural networks while utilizing the same or even fewer active parameters. However, MoE does not alleviate the massive memory requirements of networks, which limits their practicality in real-world applications, especially in the era of large language models (LLMs). While recent work explores the possibility of removing entire layers of MoE to reduce memory, the performance degradation is still notable. In this paper, we propose ConDense-MoE (CD-MoE), which, instead of dropping the entire MoE layer, condenses the large, sparse MoE layer into a smaller, denser layer with only a few experts activated for all tokens, while maintaining hardware friendliness. Our approach is specifically designed for fine-grained MoE with shared experts, where Feed-Forward Networks are split into many small experts, with certain experts isolated to serve as shared experts that are always activated, such as DeepSeekMoE and QwenMoE. We demonstrate the effectiveness of our method. Specifically, for the DeepSeekMoE-16B model, our approach maintains 90% of the average accuracy while reducing memory usage by 27.5% and increasing inference speed by 1.26 times. Moreover, we show that by applying lightweight expert fine-tuning—only to the condensed layers—and using 5 hours on a single 80G A100 GPU, we can successfully recover 98% of the original performance.

AAAI Conference 2026 Conference Paper

Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

  • Liyang Chen
  • Tianxiang Ma
  • Jiawei Liu
  • Bingchuan Li
  • Zhuowei Chen
  • Lijie Liu
  • Xu He
  • Gen Li

Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, images, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of modality-complete data and the difficulty of jointly modeling triplet conditions without performance degradation. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct an incomplete-yet-complementary dataset for improved data utilization efficiency and training scalability. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies at each stage. In the first stage, to balance the text-following and subject-preservation abilities, we adopt the minimal-invasive image injection strategy. In the second stage, to enhance audio-visual sync, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multi-modal inputs, we progressively incorporate the audio-visual sync task, building on previously acquired capabilities. During inference, for flexible and fine-grained multimodal control, we design a stage-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG.

AAAI Conference 2026 Conference Paper

Mask2IV: Interaction-Centric Video Generation via Mask Trajectories

  • Gen Li
  • Bo Zhao
  • Jianfei Yang
  • Laura Sevilla-Lara

Generating interaction-centric videos, such as those depicting humans or robots interacting with objects, is crucial for embodied intelligence, as they provide rich and diverse visual priors for robot learning, manipulation policy training, and affordance reasoning. However, existing methods often struggle to model such complex and dynamic interactions. While recent studies show that masks can serve as effective control signals and enhance generation quality, obtaining dense and precise mask annotations remains a major challenge for real-world use. To overcome this limitation, we introduce Mask2IV, a novel framework specifically designed for interaction-centric video generation. It adopts a decoupled two-stage pipeline that first predicts plausible motion trajectories for both actor and object, then generates a video conditioned on these trajectories. This design eliminates the need for dense mask inputs from users while preserving the flexibility to manipulate the interaction process. Furthermore, Mask2IV supports versatile and intuitive control, allowing users to specify the target object of interaction and guide the motion trajectory through action descriptions or spatial position cues. To support systematic training and evaluation, we curate two benchmarks covering diverse action and object categories across both human-object interaction and robotic manipulation scenarios. Extensive experiments demonstrate that our method achieves superior visual realism and controllability compared to existing baselines.

AAAI Conference 2026 Conference Paper

New Synthetic Goldmine: Hand Joint Angle-Driven EMG Data Generation Framework for Micro-Gesture Recognition

  • Nana Wang
  • Suli Wang
  • Gen Li
  • Pengfei Ren
  • Hao Su

Electromyography (EMG)-based gesture recognition has emerged as a promising approach for human-computer interaction. However, its performance is often limited by the scarcity of labeled EMG data, significant cross-user variability, and poor generalization to unseen gestures. To address these challenges, we propose SeqEMG-GAN, a conditional, sequence-driven generative framework that synthesizes high-fidelity EMG signals from hand joint angle sequences. Our method introduces a context-aware architecture composed of an angle encoder, a dual-layer context encoder featuring the novel Ang2Gist unit, a deep convolutional EMG generator, and a discriminator, all jointly optimized via adversarial learning. By conditioning on joint kinematic trajectories, SeqEMG-GAN is capable of generating semantically consistent EMG sequences, even for previously unseen gestures, thereby enhancing data diversity and physiological plausibility. Experimental results show that classifiers trained solely on synthetic data experience only a slight accuracy drop (from 57.77% to 55.71%). In contrast, training with a combination of real and synthetic data significantly improves accuracy to 60.53%, outperforming real-only training by 2.76%. These findings demonstrate the effectiveness of our framework, also achieves the state-of-art performance in augmenting EMG datasets and enhancing gesture recognition performance for applications such as neural robotic hand control, AI/AR glasses, and gesture-based virtual gaming systems.

NeurIPS Conference 2025 Conference Paper

Breaking AR’s Sampling Bottleneck: Provable Acceleration via Diffusion Language Models

  • Gen Li
  • Changxiao Cai

Diffusion models have emerged as a powerful paradigm for modern generative modeling, demonstrating strong potential for large language models (LLMs). Unlike conventional autoregressive (AR) models that generate tokens sequentially, diffusion models allow for parallel sampling, offering a promising path to accelerate generation and eliminate the left-to-right generation constraints. Despite their empirical success, theoretical understandings of diffusion language models remain underdeveloped. In this work, we develop convergence guarantees for diffusion language models from an information-theoretic perspective. Our analysis demonstrates that the sampling error, measured by the Kullback-Leibler (KL) divergence, decays inversely with the number of iterations $T$ and scales linearly with the mutual information between tokens in the target text sequence. Crucially, our theory covers the regime $T<L$, where $L$ is the text sequence length. This justifies that high-quality samples can be generated with fewer iterations than $L$, thereby breaking the fundamental sampling bottleneck of $L$ steps required by AR models. We further establish matching upper and lower bounds, up to some constant factor, that shows the tightness of our convergence analysis. These results offer novel theoretical insights into the practical effectiveness of diffusion language models.

YNIMG Journal 2025 Journal Article

Image-based meta- and mega-analysis (IBMMA): A unified framework for large-scale, multi-site, neuroimaging data analysis

  • Nick Steele
  • Ashley A. Huggins
  • Rajendra A. Morey
  • Ahmed Hussain
  • Courtney Russell
  • Benjamin Suarez-Jimenez
  • Elena Pozzi
  • Hadis Jameei

The increasing scale and complexity of neuroimaging datasets aggregated from multiple study sites present substantial analytic challenges, as existing statistical analysis tools struggle to handle missing voxel-data, suffer from limited computational speed and inefficient memory allocation, and are restricted in the types of statistical designs they are able to model. We introduce Image-Based Meta- & Mega-Analysis (IBMMA), a novel software package implemented in R and Python that provides a unified framework for analyzing diverse neuroimaging features, efficiently handles large-scale datasets through parallel processing, offers flexible statistical modeling options, and properly manages missing voxel-data commonly encountered in multi-site studies. IBMMA successfully analyzed a large-n dataset of several thousand participants and revealed findings in brain regions that some traditional software overlooked due to missing voxel-data resulting in gaps in brain coverage. IBMMA has the potential to accelerate discoveries in neuroscience and enhance the clinical utility of neuroimaging findings.

NeurIPS Conference 2025 Conference Paper

Improving Diffusion-based Inverse Algorithms under Few-Step Constraint via Linear Extrapolation

  • Jiawei Zhang
  • Ziyuan Liu
  • Leon Yan
  • Gen Li
  • Yuantao Gu

Diffusion-based inverse algorithms have shown remarkable performance across various inverse problems, yet their reliance on numerous denoising steps incurs high computational costs. While recent developments of fast diffusion ODE solvers offer effective acceleration for diffusion sampling without observations, their application in inverse problems remains limited due to the heterogeneous formulations of inverse algorithms and their prevalent use of approximations and heuristics, which often introduce significant errors that undermine the reliability of analytical solvers. In this work, we begin with an analysis of ODE solvers for inverse problems that reveals a linear combination structure of approximations for the inverse trajectory. Building on this insight, we propose a canonical form that unifies a broad class of diffusion-based inverse algorithms and facilitates the design of more generalizable solvers. Inspired by the linear subspace search strategy, we propose Learnable Linear Extrapolation (LLE), a lightweight approach that universally enhances the performance of any diffusion-based inverse algorithm conforming to our canonical form. LLE optimizes the combination coefficients to refine current predictions using previous estimates, alleviating the sensitivity of analytical solvers for inverse algorithms. Extensive experiments demonstrate consistent improvements of the proposed LLE method across multiple algorithms and tasks, indicating its potential for more efficient solutions and boosted performance of diffusion-based inverse algorithms with limited steps. Codes for reproducing our experiments are available at https: //github. com/weigerzan/LLE inverse problem.

NeurIPS Conference 2025 Conference Paper

MoRIC: A Modular Region-based Implicit Codec for Image Compression

  • Gen Li
  • Haotian Wu
  • Deniz Gunduz

We introduce Modular Region-Based Implicit Codec (MoRIC), a novel image compression algorithm that relies on implicit neural representations (INRs). Unlike previous INR-based codecs that model the entire image with a single neural network, MoRIC assigns dedicated models to distinct regions in the image, each tailored to its local distribution. This region-wise design enhances adaptation to local statistics and enables flexible, single-object compression with fine-grained rate-distortion (RD) control. MoRIC allows regions of arbitrary shapes, and provides the contour information for each region as separate information. In particular, it incorporates adaptive chain coding for lossy and lossless contour compression, and a shared global modulator that injects multi-scale global context into local overfitting processes in a coarse-to-fine manner. MoRIC achieves state-of-the-art performance in single-object compression with significantly lower decoding complexity than existing learned neural codecs, which results in a highly efficient compression approach for fixed-background scenarios, e. g. , for surveillance cameras. It also sets a new benchmark among overfitted codecs for standard image compression. Additionally, MoRIC naturally supports semantically meaningful layered compression through selective region refinement, paving the way for scalable and flexible INR-based codecs.

JMLR Journal 2025 Journal Article

O(d/T) Convergence Theory for Diffusion Probabilistic Models under Minimal Assumptions

  • Gen Li
  • Yuling Yan

Score-based diffusion models, which generate new data by learning to reverse a diffusion process that perturbs data from the target distribution into noise, have achieved remarkable success across various generative tasks. Despite their superior empirical performance, existing theoretical guarantees are often constrained by stringent assumptions or suboptimal convergence rates. In this paper, we establish a fast convergence theory for the denoising diffusion probabilistic model (DDPM), a widely used SDE-based sampler, under minimal assumptions. Our analysis shows that, provided $\ell_{2}$-accurate estimates of the score functions, the total variation distance between the target and generated distributions is upper bounded by $O(d/T)$ (ignoring logarithmic factors), where $d$ is the data dimensionality and $T$ is the number of steps. This result holds for any target distribution with finite first-order moment. Moreover, we show that with careful coefficient design, the convergence rate improves to $O(k/T)$, where $k$ is the intrinsic dimension of the target data distribution. This highlights the ability of DDPM to automatically adapt to unknown low-dimensional structures, a common feature of natural image distributions. These results are achieved through a novel set of analytical tools that provides a fine-grained characterization of how the error propagates at each step of the reverse process. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

IROS Conference 2025 Conference Paper

Resource-Efficient Affordance Grounding with Complementary Depth and Semantic Prompts

  • Yizhou Huang
  • Fan Yang 0063
  • Guoliang Zhu
  • Gen Li
  • Hao Shi 0004
  • Yukun Zuo
  • Wenrui Chen
  • Zhiyong Li 0001

Affordance refers to the functional properties that an agent perceives and utilizes from its environment, and is key perceptual information required for robots to perform actions. This information is rich and multimodal in nature. Existing multimodal affordance methods face limitations in extracting useful information, mainly due to simple structural designs, basic fusion methods, and large model parameters, making it difficult to meet the performance requirements for practical deployment. To address these issues, this paper proposes the BiT-Align image-depth-text affordance mapping framework. The framework includes a Bypass Prompt Module (BPM) and a Text Feature Guidance (TFG) attention selection mechanism. BPM integrates the auxiliary modality depth image directly as a prompt to the primary modality RGB image, embedding it into the primary modality encoder without introducing additional encoders. This reduces the model’s parameter count and effectively improves functional region localization accuracy. The TFG mechanism guides the selection and enhancement of attention heads in the image encoder using textual features, improving the understanding of affordance characteristics. Experimental results demonstrate that the proposed method achieves significant performance improvements on public AGD20K and HICO-IIF datasets. On the AGD20K dataset, compared with the current state-of-the-art method, we achieve a 6. 0% improvement in the KLD metric, while reducing model parameters by 88. 8%, demonstrating practical application values. The source code will be made publicly available at https://github.com/DAWDSE/BiT-Align.

NeurIPS Conference 2024 Conference Paper

A Single-Step, Sharpness-Aware Minimization is All You Need to Achieve Efficient and Accurate Sparse Training

  • Jie Ji
  • Gen Li
  • Jingjing Fu
  • Fatemeh Afghah
  • Linke Guo
  • Xiaoyong Yuan
  • Xiaolong Ma

Sparse training stands as a landmark approach in addressing the considerable training resource demands imposed by the continuously expanding size of Deep Neural Networks (DNNs). However, the training of a sparse DNN encounters great challenges in achieving optimal generalization ability despite the efforts from the state-of-the-art sparse training methodologies. To unravel the mysterious reason behind the difficulty of sparse training, we connect the network sparsity with neural loss functions structure, and identify the cause of such difficulty lies in chaotic loss surface. In light of such revelation, we propose $S^{2} - SAM$, characterized by a **S**ingle-step **S**harpness_**A**ware **M**inimization that is tailored for **S**parse training. For the first time, $S^{2} - SAM$ innovates the traditional SAM-style optimization by approximating sharpness perturbation through prior gradient information, incurring *zero extra cost*. Therefore, $S^{2} - SAM$ not only exhibits the capacity to improve generalization but also aligns with the efficiency goal of sparse training. Additionally, we study the generalization result of $S^{2} - SAM$ and provide theoretical proof for convergence. Through extensive experiments, $S^{2} - SAM$ demonstrates its universally applicable plug-and-play functionality, enhancing accuracy across various sparse training methods. Code available at https: //github. com/jjsrf/SSAM-NEURIPS2024.

NeurIPS Conference 2024 Conference Paper

Adapting to Unknown Low-Dimensional Structures in Score-Based Diffusion Models

  • Gen Li
  • Yuling Yan

This paper investigates score-based diffusion models when the underlying target distribution is concentrated on or near low-dimensional manifolds within the higher-dimensional space in which they formally reside, a common characteristic of natural image distributions. Despite previous efforts to understand the data generation process of diffusion models, existing theoretical support remains highly suboptimal in the presence of low-dimensional structure, which we strengthen in this paper. For the popular Denoising Diffusion Probabilistic Model (DDPM), we find that the dependency of the error incurred within each denoising step on the ambient dimension $d$ is in general unavoidable. We further identify a unique design of coefficients that yields a converges rate at the order of $O(k^{2}/\sqrt{T})$ (up to log factors), where $k$ is the intrinsic dimension of the target distribution and $T$ is the number of steps. This represents the first theoretical demonstration that the DDPM sampler can adapt to unknown low-dimensional structures in the target distribution, highlighting the critical importance of coefficient design. All of this is achieved by a novel set of analysis tools that characterize the algorithmic dynamics in a more deterministic manner.

AAAI Conference 2024 Conference Paper

Removing Interference and Recovering Content Imaginatively for Visible Watermark Removal

  • Yicheng Leng
  • Chaowei Fang
  • Gen Li
  • Yixiang Fang
  • Guanbin Li

Visible watermarks, while instrumental in protecting image copyrights, frequently distort the underlying content, complicating tasks like scene interpretation and image editing. Visible watermark removal aims to eliminate the interference of watermarks and restore the background content. However, existing methods often implement watermark component removal and background restoration tasks within a singular branch, leading to residual watermarks in the predictions and ignoring cases where watermarks heavily obscure the background. To address these limitations, this study introduces the Removing Interference and Recovering Content Imaginatively (RIRCI) framework. RIRCI embodies a two-stage approach: the initial phase centers on discerning and segregating the watermark component, while the subsequent phase focuses on background content restoration. To achieve meticulous background restoration, our proposed model employs a dual-path network capable of fully exploring the intrinsic background information beneath semi-transparent watermarks and peripheral contextual information from unaffected regions. Moreover, a Global and Local Context Interaction module is built upon multi-layer perceptrons and bidirectional feature transformation for comprehensive representation modeling in the background restoration phase. The efficacy of our approach is empirically validated across two large-scale datasets, and our findings reveal a marked enhancement over existing watermark removal techniques.

NeurIPS Conference 2024 Conference Paper

Unleashing the Denoising Capability of Diffusion Prior for Solving Inverse Problems

  • Jiawei Zhang
  • Jiaxin Zhuang
  • Cheng Jin
  • Gen Li
  • Yuantao Gu

The recent emergence of diffusion models has significantly advanced the precision of learnable priors, presenting innovative avenues for addressing inverse problems. Previous works have endeavored to integrate diffusion priors into the maximum a posteriori estimation (MAP) framework and design optimization methods to solve the inverse problem. However, prevailing optimization-based rithms primarily exploit the prior information within the diffusion models while neglecting their denoising capability. To bridge this gap, this work leverages the diffusion process to reframe noisy inverse problems as a two-variable constrained optimization task by introducing an auxiliary optimization variable that represents a 'noisy' sample at an equivalent denoising step. The projection gradient descent method is efficiently utilized to solve the corresponding optimization problem by truncating the gradient through the $\mu$-predictor. The proposed algorithm, termed ProjDiff, effectively harnesses the prior information and the denoising capability of a pre-trained diffusion model within the optimization framework. Extensive experiments on the image restoration tasks and source separation and partial generation tasks demonstrate that ProjDiff exhibits superior performance across various linear and nonlinear inverse problems, highlighting its potential for practical applications. Code is available at https: //github. com/weigerzan/ProjDiff/.

YNICL Journal 2023 Journal Article

A preliminary study on corticospinal tract morphology in incidental and symptomatic insular low-grade glioma: implications for post-surgical motor outcomes

  • Zuo-Cheng Yang
  • Chuan-Dong Yin
  • Fang-Cheng Yeh
  • Bo-Wen Xue
  • Xin-Yu Song
  • Gen Li
  • Zheng-Hai Deng
  • Sheng-Jun Sun

OBJECTIVE: Our study aimed to investigate the shape and diffusion properties of the corticospinal tract (CST) in patients with insular incidental and symptomatic low-grade gliomas (LGGs), especially those in the incidental group, and evaluate their association with post-surgical motor function. METHODS: We performed automatic fiber tracking on 41 LGG patients, comparing macroscopic shape and microscopic diffusion properties of CST between ipsilateral and contralateral tracts in both incidental and symptomatic groups. A correlation analysis was conducted between properties of CST and post-operative motor strength grades. RESULTS: , respectively). CONCLUSIONS: We found a significant correlation between CST shape measures and post-operative motor function outcomes in patients with incidental insular LGGs. CST morphology shows promise as a potential prognostic factor for identifying functional deficits in this patient population.

NeurIPS Conference 2023 Conference Paper

Dynamic Sparsity Is Channel-Level Sparsity Learner

  • Lu Yin
  • Gen Li
  • Meng Fang
  • Li Shen
  • Tianjin Huang
  • Zhangyang "Atlas" Wang
  • Vlado Menkovski
  • Xiaolong Ma

Sparse training has received an upsurging interest in machine learning due to its tantalizing saving potential for both the entire training process as well as the inference. Dynamic sparse training (DST) as a leading approach can train deep neural networks at high sparsity from scratch to match the performance of their dense counterparts. However, most if not all DST prior arts demonstrate their effectiveness on unstructured sparsity with highly irregular sparse patterns, which receives limited support in common hardware. This limitation hinders the usage of DST in practice. In this paper, we propose Channel-aware dynamic sparse (Chase), that for the first time seamlessly translates the promise of unstructured dynamic sparsity to GPU-friendly channel-level sparsity (not fine-grained N: M or group sparsity) during one end-to-end training process, without any ad-hoc operations. The resulting small sparse networks can be directly accelerated by commodity hardware, without using any particularly sparsity-aware hardware accelerators. This appealing outcome is partially motivated by a hidden phenomenon of dynamic sparsity: off-the-shelf unstructured DST implicitly involves biased parameter reallocation across channels, with a large fraction of channels (up to 60%) being sparser than others. By progressively identifying and removing these channels during training, our approach transfers unstructured sparsity to channel-wise sparsity. Our experimental results demonstrate that Chase achieves 1. 7x inference throughput speedup on common GPU devices without compromising accuracy with ResNet-50 on ImageNet. We release our code in https: //github. com/luuyin/chase.

TIST Journal 2023 Journal Article

MC 2: Unsupervised Multiple Social Network Alignment

  • Li Sun
  • Zhongbao Zhang
  • Gen Li
  • Pengxin Ji
  • Sen Su
  • Philip S. Yu

Social network alignment, identifying social accounts of the same individual across different social networks, shows fundamental importance in a wide spectrum of applications, such as link prediction and information diffusion. Individuals more often than not join in multiple social networks, and it is in fact much too expensive or even impossible to acquiring supervision for guiding the alignment. To the best of our knowledge, few method in the literature can align multiple social networks without supervision. In this article, we propose to study the problem of unsupervised multiple social network alignment. To address this problem, we propose a novel unsupervised model of joint Matrix factorization with a diagonal Cone under orthogonal Constraint, referred to as MC 2. Its core idea is to embed and align multiple social networks in the common subspace via an unsupervised approach. Specifically, in MC 2 model, we first design a matrix optimization to infer the common subspace from different social networks. To address the nonconvex optimization, we then design an efficient alternating algorithm by leveraging its inherent functional property. Through extensive experiments on real-world datasets, we demonstrate that the proposed MC 2 model significantly outperforms the state-of-the-art methods.

YNIMG Journal 2023 Journal Article

Neuroimaging-based classification of PTSD using data-driven computational approaches: A multisite big data study from the ENIGMA-PGC PTSD consortium

  • Xi Zhu
  • Yoojean Kim
  • Orren Ravid
  • Xiaofu He
  • Benjamin Suarez-Jimenez
  • Sigal Zilcha-Mano
  • Amit Lazarov
  • Seonjoo Lee

BACKGROUND: Recent advances in data-driven computational approaches have been helpful in devising tools to objectively diagnose psychiatric disorders. However, current machine learning studies limited to small homogeneous samples, different methodologies, and different imaging collection protocols, limit the ability to directly compare and generalize their results. Here we aimed to classify individuals with PTSD versus controls and assess the generalizability using a large heterogeneous brain datasets from the ENIGMA-PGC PTSD Working group. METHODS: We analyzed brain MRI data from 3,477 structural-MRI; 2,495 resting state-fMRI; and 1,952 diffusion-MRI. First, we identified the brain features that best distinguish individuals with PTSD from controls using traditional machine learning methods. Second, we assessed the utility of the denoising variational autoencoder (DVAE) and evaluated its classification performance. Third, we assessed the generalizability and reproducibility of both models using leave-one-site-out cross-validation procedure for each modality. RESULTS: We found lower performance in classifying PTSD vs. controls with data from over 20 sites (60 % test AUC for s-MRI, 59 % for rs-fMRI and 56 % for d-MRI), as compared to other studies run on single-site data. The performance increased when classifying PTSD from HC without trauma history in each modality (75 % AUC). The classification performance remained intact when applying the DVAE framework, which reduced the number of features. Finally, we found that the DVAE framework achieved better generalization to unseen datasets compared with the traditional machine learning frameworks, albeit performance was slightly above chance. CONCLUSION: These results have the potential to provide a baseline classification performance for PTSD when using large scale neuroimaging datasets. Our findings show that the control group used can heavily affect classification performance. The DVAE framework provided better generalizability for the multi-site data. This may be more significant in clinical practice since the neuroimaging-based diagnostic DVAE classification models are much less site-specific, rendering them more generalizable.

NeurIPS Conference 2023 Conference Paper

Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time

  • Xiang Ji
  • Gen Li

A crucial problem in reinforcement learning is learning the optimal policy. We study this in tabular infinite-horizon discounted Markov decision processes under the online setting. The existing algorithms either fail to achieve regret optimality or have to incur a high memory and computational cost. In addition, existing optimal algorithms all require a long burn-in time in order to achieve optimal sample efficiency, i. e. , their optimality is not guaranteed unless sample size surpasses a high threshold. We address both open problems by introducing a model-free algorithm that employs variance reduction and a novel technique that switches the execution policy in a slow-yet-adaptive manner. This is the first regret-optimal model-free algorithm in the discounted setting, with the additional benefit of a low burn-in time.

NeurIPS Conference 2023 Conference Paper

Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning

  • Gen Li
  • Wenhao Zhan
  • Jason D. Lee
  • Yuejie Chi
  • Yuxin Chen

This paper studies tabular reinforcement learning (RL) in the hybrid setting, which assumes access to both an offline dataset and online interactions with the unknown environment. A central question boils down to how to efficiently utilize online data to strengthen and complement the offline dataset and enable effective policy fine-tuning. Leveraging recent advances in reward-agnostic exploration and offline RL, we design a three-stage hybrid RL algorithm that beats the best of both worlds --- pure offline RL and pure online RL --- in terms of sample complexities. The proposed algorithm does not require any reward information during data collection. Our theory is developed based on a new notion called single-policy partial concentrability, which captures the trade-off between distribution mismatch and miscoverage and guides the interplay between offline and online data.

NeurIPS Conference 2023 Conference Paper

The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model

  • Laixi Shi
  • Gen Li
  • Yuting Wei
  • Yuxin Chen
  • Matthieu Geist
  • Yuejie Chi

This paper investigates model robustness in reinforcement learning (RL) via the framework of distributionally robust Markov decision processes (RMDPs). Despite recent efforts, the sample complexity of RMDPs is much less understood regardless of the uncertainty set in use; in particular, there exist large gaps between existing upper and lower bounds, and it is unclear if distributional robustness bears any statistical implications when benchmarked against standard RL. In this paper, assuming access to a generative model, we derive the sample complexity of RMDPs---when the uncertainty set is measured via either total variation or $\chi^2$ divergence over the full range of uncertainty levels---using a model-based algorithm called distributionally robust value iteration, and develop minimax lower bounds to benchmark its tightness. Our results not only strengthen the prior art in both directions of upper and lower bounds, but also deliver surprising messages that learning RMDPs is not necessarily easier or more difficult than standard MDPs. In the case of total variation, we establish the minimax-optimal sample complexity of RMDPs which is always smaller than that of standard MDPs. In the case of $\chi^2$ divergence, we establish the sample complexity of RMDPs that is tight up to polynomial factors of the effective horizon, and grows linearly with respect to the uncertainty level when it approaches infinity.

AAAI Conference 2023 Conference Paper

WIERT: Web Information Extraction via Render Tree

  • Zimeng Li
  • Bo Shao
  • Linjun Shou
  • Ming Gong
  • Gen Li
  • Daxin Jiang

Web information extraction (WIE) is a fundamental problem in web document understanding, with a significant impact on various applications. Visual information plays a crucial role in WIE tasks as the nodes containing relevant information are often visually distinct, such as being in a larger font size or having a brighter color, from the other nodes. However, rendering visual information of a web page can be computationally expensive. Previous works have mainly focused on the Document Object Model (DOM) tree, which lacks visual information. To efficiently exploit visual information, we propose leveraging the render tree, which combines the DOM tree and Cascading Style Sheets Object Model (CSSOM) tree, and contains not only content and layout information but also rich visual information at a little additional acquisition cost compared to the DOM tree. In this paper, we present WIERT, a method that effectively utilizes the render tree of a web page based on a pretrained language model. We evaluate WIERT on the Klarna product page dataset, a manually labeled dataset of renderable e-commerce web pages, demonstrating its effectiveness and robustness.

YNIMG Journal 2022 Journal Article

A comparison of methods to harmonize cortical thickness measurements across scanners and sites

  • Delin Sun
  • Gopalkumar Rakesh
  • Courtney C. Haswell
  • Mark Logue
  • C. Lexi Baird
  • Erin N. O'Leary
  • Andrew S. Cotton
  • Hong Xie

Results of neuroimaging datasets aggregated from multiple sites may be biased by site-specific profiles in participants’ demographic and clinical characteristics, as well as MRI acquisition protocols and scanning platforms. We compared the impact of four different harmonization methods on results obtained from analyses of cortical thickness data: (1) linear mixed-effects model (LME) that models site-specific random intercepts (LMEINT), (2) LME that models both site-specific random intercepts and age-related random slopes (LMEINT+SLP), (3) ComBat, and (4) ComBat with a generalized additive model (ComBat-GAM). Our test case for comparing harmonization methods was cortical thickness data aggregated from 29 sites, which included 1, 340 cases with posttraumatic stress disorder (PTSD) (6. 2–81. 8 years old) and 2, 057 trauma-exposed controls without PTSD (6. 3–85. 2 years old). We found that, compared to the other data harmonization methods, data processed with ComBat-GAM was more sensitive to the detection of significant case-control differences (Χ 2(3) = 63. 704, p < 0. 001) as well as case-control differences in age-related cortical thinning (Χ 2(3) = 12. 082, p = 0. 007). Both ComBat and ComBat-GAM outperformed LME methods in detecting sex differences (Χ 2(3) = 9. 114, p = 0. 028) in regional cortical thickness. ComBat-GAM also led to stronger estimates of age-related declines in cortical thickness (corrected p-values < 0. 001), stronger estimates of case-related cortical thickness reduction (corrected p-values < 0. 001), weaker estimates of age-related declines in cortical thickness in cases than controls (corrected p-values < 0. 001), stronger estimates of cortical thickness reduction in females than males (corrected p-values < 0. 001), and stronger estimates of cortical thickness reduction in females relative to males in cases than controls (corrected p-values < 0. 001). Our results support the use of ComBat-GAM to minimize confounds and increase statistical power when harmonizing data with non-linear effects, and the use of either ComBat or ComBat-GAM for harmonizing data with linear effects.

NeurIPS Conference 2022 Conference Paper

AnimeSR: Learning Real-World Super-Resolution Models for Animation Videos

  • Yanze Wu
  • Xintao Wang
  • Gen Li
  • Ying Shan

This paper studies the problem of real-world video super-resolution (VSR) for animation videos, and reveals three key improvements for practical animation VSR. First, recent real-world super-resolution approaches typically rely on degradation simulation using basic operators without any learning capability, such as blur, noise, and compression. In this work, we propose to learn such basic operators from real low-quality animation videos, and incorporate the learned ones into the degradation generation pipeline. Such neural-network-based basic operators could help to better capture the distribution of real degradations. Second, a large-scale high-quality animation video dataset, AVC, is built to facilitate comprehensive training and evaluations for animation VSR. Third, we further investigate an efficient multi-scale network structure. It takes advantage of the efficiency of unidirectional recurrent networks and the effectiveness of sliding-window-based methods. Thanks to the above delicate designs, our method, AnimeSR, is capable of restoring real-world low-quality animation videos effectively and efficiently, achieving superior performance to previous state-of-the-art methods.

NeurIPS Conference 2022 Conference Paper

Minimax-Optimal Multi-Agent RL in Markov Games With a Generative Model

  • Gen Li
  • Yuejie Chi
  • Yuting Wei
  • Yuxin Chen

This paper studies multi-agent reinforcement learning in Markov games, with the goal of learning Nash equilibria or coarse correlated equilibria (CCE) sample-optimally. All prior results suffer from at least one of the two obstacles: the curse of multiple agents and the barrier of long horizon, regardless of the sampling protocol in use. We take a step towards settling this problem, assuming access to a flexible sampling mechanism: the generative model. Focusing on non-stationary finite-horizon Markov games, we develop a fast learning algorithm called Q-FTRL and an adaptive sampling scheme that leverage the optimism principle in online adversarial learning (particularly the Follow-the-Regularized-Leader (FTRL) method). Our algorithm learns an $\varepsilon$-approximate CCE in a general-sum Markov game using $$ \widetilde{O}\bigg( \frac{H^4 S \sum_{i=1}^m A_i}{\varepsilon^2} \bigg) $$ samples, where $m$ is the number of players, $S$ indicates the number of states, $H$ is the horizon, and $A_i$ denotes the number of actions for the $i$-th player. This is minimax-optimal (up to log factor) when $m$ is fixed. When applied to two-player zero-sum Markov games, our algorithm provably finds an $\varepsilon$-approximate Nash equilibrium with a minimal number of samples. Along the way, we derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities, which might be of independent interest.

YNICL Journal 2022 Journal Article

Reduced sensitivity to delayed time and delayed reward of the post-operative insular glioma patients in delay discounting

  • Wenjin Fu
  • Zhenxing Huang
  • Jun Li
  • Qi Dong
  • Yang Li
  • Gen Li
  • Yaokai Xu
  • Bowen Xue

Previous studies have shown that the insula is closely related to addiction, and the structure's role in delay discounting can be measured by a specific task, but the specific role of the insula has been less studied. In this study, we first conducted a lesion study in which we recruited healthy controls (n = 30) and patients with unilateral insula injury (n = 16) to complete a behavioral delay discounting task. Then we conducted a functional magnetic resonance imaging (fMRI) study, and a separate group healthy volunteers (n = 51) completed a delay discounting task during the fMRI scan. The lesion study showed a significant difference between the two groups in the delay discounting task, which revealed that insula injury was associated with impaired decision making. The fMRI study revealed choice-sensitive insula activation that was modulated by delayed time and delayed reward, indicating an important role of the insula in delay discounting. Overall, our results provide evidence for a role of the insular lobe in delay discounting and suggests that this structure may be considered an important factor in the future treatment and diagnosis of addiction disorders.

NeurIPS Conference 2021 Conference Paper

Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning

  • Gen Li
  • Laixi Shi
  • Yuxin Chen
  • Yuantao Gu
  • Yuejie Chi

Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with $S$ states, $A$ actions and horizon length $H$, substantial progress has been achieved towards characterizing the minimax-optimal regret, which scales on the order of $\sqrt{H^2SAT}$ (modulo log factors) with $T$ the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e. g. , $S^6A^4 \, \mathrm{poly}(H)$ for existing model-free methods). To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity $O(SAH)$, that achieves near-optimal regret as soon as the sample size exceeds the order of $SA\, \mathrm{poly}(H)$. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves --- by at least a factor of $S^5A^3$ --- upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called {\em reference-advantage decomposition}), the proposed algorithm employs an {\em early-settled} reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration-exploitation trade-offs.

NeurIPS Conference 2021 Conference Paper

Sample-Efficient Reinforcement Learning Is Feasible for Linearly Realizable MDPs with Limited Revisiting

  • Gen Li
  • Yuxin Chen
  • Yuejie Chi
  • Yuantao Gu
  • Yuting Wei

Low-complexity models such as linear function representation play a pivotal role in enabling sample-efficient reinforcement learning (RL). The current paper pertains to a scenario with value-based linear representation, which postulates linear realizability of the optimal Q-function (also called the ``linear $Q^{\star}$ problem''). While linear realizability alone does not allow for sample-efficient solutions in general, the presence of a large sub-optimality gap is a potential game changer, depending on the sampling mechanism in use. Informally, sample efficiency is achievable with a large sub-optimality gap when a generative model is available, but is unfortunately infeasible when we turn to standard online RL settings. We make progress towards understanding this linear $Q^{\star}$ problem by investigating a new sampling protocol, which draws samples in an online/exploratory fashion but allows one to backtrack and revisit previous states. This protocol is more flexible than the standard online RL setting, while being practically relevant and far more restrictive than the generative model. We develop an algorithm tailored to this setting, achieving a sample complexity that scales polynomially with the feature dimension, the horizon, and the inverse sub-optimality gap, but not the size of the state/action space. Our findings underscore the fundamental interplay between sampling protocols and low-complexity function representation in RL.

AIIM Journal 2021 Journal Article

Seizure detection from multi-channel EEG using entropy-based dynamic graph embedding

  • Gen Li
  • Jason J. Jung

An epileptic seizure is a chronic disease with sudden abnormal discharge of brain neurons, which leads to transient brain dysfunction. To detect epileptic seizures, we propose a novel idea based on a dynamic graph embedding model. The dynamic graph is built by identifying the correlation among the multi-channel EEG signals. Graph entropy measurement is exploited to calculate the similarity among the graph at each time interval and construct the graph embedding space. Since the abnormal electrical brain activity causes the epileptic seizure, the graph entropy during the seizure time interval is different from other time intervals. Therefore, we propose an entropy-based dynamic graph embedding model to cluster the graphs, and the graphs with epileptic seizures are discriminated. We applied the proposed approach to the Children Hospital Boston-Massachusetts Institute of Technology Scalp EEG database. The results have shown that the proposed approach outperformed the baselines by 1. 4% with respect to accuracy.

NeurIPS Conference 2020 Conference Paper

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

  • Gen Li
  • Yuting Wei
  • Yuejie Chi
  • Yuantao Gu
  • Yuxin Chen

We investigate the sample efficiency of reinforcement learning in a $\gamma$-discounted infinite-horizon Markov decision process (MDP) with state space S and action space A, assuming access to a generative model. Despite a number of prior work tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, prior results suffer from a sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least $ |S| |A| / (1-\gamma)^2 $ (up to some log factor). The current paper overcomes this barrier by certifying the minimax optimality of model-based reinforcement learning as soon as the sample size exceeds the order of $ |S| |A| / (1-\gamma) $ (modulo some log factor). More specifically, a perturbed model-based planning algorithm provably finds an $\epsilon$-optimal policy with an order of $ |S| |A| / ((1-\gamma)^3\epsilon^2 ) $ samples (up to log factor) for any $0< \epsilon < 1/(1-\gamma)$. Along the way, we derive improved (instance-dependent) guarantees for model-based policy evaluation. To the best of our knowledge, this work provides the first minimax-optimal guarantee in a generative model that accommodates the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically impossible).

NeurIPS Conference 2020 Conference Paper

Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction

  • Gen Li
  • Yuting Wei
  • Yuejie Chi
  • Yuantao Gu
  • Yuxin Chen

Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP), based on a single trajectory of Markovian samples induced by a behavior policy. Focusing on a $\gamma$-discounted MDP with state space S and action space A, we demonstrate that the $ \ell_{\infty} $-based sample complexity of classical asynchronous Q-learning --- namely, the number of samples needed to yield an entrywise $\epsilon$-accurate estimate of the Q-function --- is at most on the order of $ \frac{1}{ \mu_{\min}(1-\gamma)^5 \epsilon^2 }+ \frac{ t_{\mathsf{mix}} }{ \mu_{\min}(1-\gamma) } $ up to some logarithmic factor, provided that a proper constant learning rate is adopted. Here, $ t_{\mathsf{mix}} $ and $ \mu_{\min} $ denote respectively the mixing time and the minimum state-action occupancy probability of the sample trajectory. The first term of this bound matches the complexity in the case with independent samples drawn from the stationary distribution of the trajectory. The second term reflects the expense taken for the empirical distribution of the Markovian trajectory to reach a steady state, which is incurred at the very beginning and becomes amortized as the algorithm runs. Encouragingly, the above bound improves upon the state-of-the-art result by a factor of at least |S||A|. Further, the scaling on the discount complexity can be improved by means of variance reduction.

AAAI Conference 2020 Conference Paper

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

  • Gen Li
  • Nan Duan
  • Yuejian Fang
  • Ming Gong
  • Daxin Jiang

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pretrained models, such as XLM (Lample and Conneau 2019) and Unicoder (Huang et al. 2019), both visual and linguistic contents are fed into a multi-layer Transformer (Vaswani et al. 2017) for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked Language Modeling(MLM), Masked Object Classification(MOC) and Visual-linguistic Matching(VLM). The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and a text describe each other. After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer. We achieve state-of-the-art or comparable results on both two tasks and show the powerful ability of the crossmodal pre-training.

NeurIPS Conference 2019 Conference Paper

Nonconvex Low-Rank Tensor Completion from Noisy Data

  • Changxiao Cai
  • Gen Li
  • H. Vincent Poor
  • Yuxin Chen

We study a completion problem of broad practical interest: the reconstruction of a low-rank symmetric tensor from highly incomplete and randomly corrupted observations of its entries. While a variety of prior work has been dedicated to this problem, prior algorithms either are computationally too expensive for large-scale applications, or come with sub-optimal statistical guarantees. Focusing on ``incoherent'' and well-conditioned tensors of a constant CP rank, we propose a two-stage nonconvex algorithm --- (vanilla) gradient descent following a rough initialization --- that achieves the best of both worlds. Specifically, the proposed nonconvex algorithm faithfully completes the tensor and retrieves all low-rank tensor factors within nearly linear time, while at the same time enjoying near-optimal statistical guarantees (i. e. ~minimal sample complexity and optimal $\ell_2$ and $\ell_{\infty}$ statistical accuracy). The insights conveyed through our analysis of nonconvex optimization might have implications for other tensor estimation problems.

IJCAI Conference 2018 Conference Paper

MASTER: across Multiple social networks, integrate Attribute and STructure Embedding for Reconciliation

  • Sen Su
  • Li Sun
  • Zhongbao Zhang
  • Gen Li
  • Jielun Qu

Recently, reconciling social networks receives significant attention. Most of the existing studies have limitations in the following three aspects: multiplicity, comprehensiveness and robustness. To address these three limitations, we rethink this problem and propose the MASTER framework, i. e. , across Multiple social networks, integrate Attribute and STructure Embedding for Reconciliation. In this framework, we first design a novel Constrained Dual Embedding model by simultaneously embedding and reconciling multiple social networks to formulate our problem into a unified optimization. To address this optimization, we then design an effective algorithm called NS-Alternating. We also prove that this algorithm converges to KKT points. Through extensive experiments on real-world datasets, we demonstrate that MASTER outperforms the state-of-the-art approaches.