Arrow Research search

Author name cluster

Yitong Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers
2 author rows

Possible papers

21

AAAI Conference 2026 Conference Paper

Real Garment Benchmark (RGBench): A Comprehensive Benchmark for Robotic Garment Manipulation Featuring a High-Fidelity Scalable Simulator

  • Wenkang Hu
  • Xincheng Tang
  • Yanzhi E
  • Yitong Li
  • Zhengjie Shu
  • Wei Li
  • Huamin Wang
  • Ruigang Yang

While there has been significant progress to use simulated data to learn robotic manipulation of rigid objects, applying its success to deformable objects has been hindered by the lack of both deformable object models and realistic non-rigid body simulators. In this paper, we present Real Garment Benchmark (RGBench), a comprehensive benchmark for robotic manipulation of garments. It features a diverse set of over 6000 garment mesh models, a new high-performance simulator, and a comprehensive protocol to evaluate garment simulation quality with carefully measured real garment dynamics. Our experiments demonstrate that our simulator outperforms currently available cloth simulators by a large margin, reducing simulation error by 20% while maintaining a speed of 3 times faster. We will publicly release RGBench to accelerate future research in robotic garment manipulation.

NeurIPS Conference 2025 Conference Paper

AmorLIP: Efficient Language-Image Pretraining via Amortization

  • Haotian Sun
  • Yitong Li
  • Yuchen Zhuang
  • Niao He
  • Hanjun Dai
  • Bo Dai

Contrastive Language-Image Pretraining (CLIP) has demonstrated strong zero-shot performance across diverse downstream text-image tasks. Existing CLIP methods typically optimize a contrastive objective using negative samples drawn from each minibatch. To achieve robust representation learning, these methods require extremely large batch sizes and escalate computational demands to hundreds or even thousands of GPUs. Prior approaches to mitigate this issue often compromise downstream performance, prolong training duration, or face scalability challenges with very large datasets. To overcome these limitations, we propose AmorLIP, an efficient CLIP pretraining framework that amortizes expensive computations involved in contrastive learning through lightweight neural networks, which substantially improves training efficiency and performance. Leveraging insights from a spectral factorization of energy-based models, we introduce novel amortization objectives along with practical techniques to improve training stability. Extensive experiments across 38 downstream tasks demonstrate the superior zero-shot classification and retrieval capabilities of AmorLIP, consistently outperforming standard CLIP baselines with substantial relative improvements of up to 12. 24%.

IS Journal 2025 Journal Article

Boundaries Matters: A Novel Multibranch Semisupervised Semantic Segmentation Method

  • Yitong Li
  • Changlun Zhang
  • Hengyou Wang

In recent years, semisupervised semantic segmentation (SSS) research has been progressing rapidly. Existing methods usually ignore the classification of detailed pixels, such as boundaries, resulting in degraded segmentation performance. To overcome this challenge, we propose a new multibranch SSS framework, BoundaryMatch, which combines image- and feature-level perturbation branches with a boundary detail guidance branch, all utilizing a shared encoder. Specifically, this boundary module enhances segmentation by integrating the learning of spatial information into low-level layers in a single-stream manner. Finally, the low-level features and deeper features are fused together to predict the final segmentation result and achieve accurate correction of boundary pixels. This multibranch approach improves on the shortcomings of consistency regularization that focus only on maintaining the global consistency of the image. Extensive experiments conducted on the Cityscapes and PASCAL VOC 2012 datasets demonstrate that the method proposed in this article effectively enhances the performance of SSS.

ICML Conference 2025 Conference Paper

Mixture of Lookup Experts

  • Shibo Jie
  • Yehui Tang 0001
  • Kai Han 0002
  • Yitong Li
  • Duyu Tang
  • Zhi-Hong Deng 0001
  • Yunhe Wang 0001

Mixture-of-Experts (MoE) activates only a subset of experts during inference, allowing the model to maintain low inference FLOPs and latency even as the parameter count scales up. However, since MoE dynamically selects the experts, all the experts need to be loaded into VRAM. Their large parameter size still limits deployment, and offloading, which load experts into VRAM only when needed, significantly increase inference latency. To address this, we propose Mixture of Lookup Experts (MoLE), a new MoE architecture that is efficient in both communication and VRAM usage. In MoLE, the experts are Feed-Forward Networks (FFNs) during training, taking the output of the embedding layer as input. Before inference, these experts can be re-parameterized as lookup tables (LUTs) that retrieves expert outputs based on input ids, and offloaded to storage devices. Therefore, we do not need to perform expert computations during inference. Instead, we directly retrieve the expert’s computation results based on input ids and load them into VRAM, and thus the resulting communication overhead is negligible. Experiments show that, with the same FLOPs and VRAM usage, MoLE achieves inference speeds comparable to dense models and significantly faster than MoE with experts offloading, while maintaining performance on par with MoE. Code: https: //github. com/JieShibo/MoLE.

ICRA Conference 2025 Conference Paper

Neural Dynamics Augmented Diffusion Policy

  • Ruihai Wu
  • Haozhe Chen
  • Mingtong Zhang 0003
  • Haoran Lu
  • Yitong Li
  • Yunzhu Li

Imitation learning has been proven effective in mimicking demonstrations across various robotic manipulation tasks. However, to develop robust policies, current imitation methods, such as diffusion policy, require training on extensive demonstrations, making data collection labor-intensive. In contrast, model-based planning with dynamics models can effectively cover a sufficient range of configurations using only off-policy data. Yet, without the guidance of expert demonstrations, many tasks are difficult and time-consuming to plan using the dynamics models. Therefore, we take the best of both model learning and imitation learning, and propose neural dynamics augmented imitation learning that covers a large scene configurations with few-shot demonstrations. This method trains a robust diffusion policy in a local support region using few-shot demonstrations and rearranges objects outside this region into it using offline-trained neural dynamics models. Extensive experiments across various tasks in both simulations and real-world scenarios, including granular manipulation, contact-rich task and multi-object interaction task, have demonstrated that trained with only 1 to 30 demonstrations, our proposed method can robustly cover a significantly larger area than the policy trained purely from the demonstrations. Our project page is available at: https://dynamics-dp.github.io.

NeurIPS Conference 2025 Conference Paper

Vision‑Language‑Vision Auto‑Encoder: Scalable Knowledge Distillation from Diffusion Models

  • Tiezheng Zhang
  • Yitong Li
  • Yu-Cheng Chou
  • Jieneng Chen
  • Alan Yuille
  • Chen Wei
  • Junfei Xiao

Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o and Gemini 2. 0 Flash. Our method demonstrates exceptional cost-efficiency and significantly reduces data requirements; by primarily utilizing single-modal images for training and maximizing the utility of existing pretrained models (image encoder, T2I diffusion model, and LLM), it circumvents the need for massive paired image-text datasets, keeping the total training expenditure under $1, 000 USD.

NeurIPS Conference 2024 Conference Paper

Diffusion Spectral Representation for Reinforcement Learning

  • Dmitry Shribak
  • Chen-Xiao Gao
  • Yitong Li
  • Chenjun Xiao
  • Bo Dai

Diffusion-based models have achieved notable empirical successes in reinforcement learning (RL) due to their expressiveness in modeling complex distributions. Despite existing methods being promising, the key challenge of extending existing methods for broader real-world applications lies in the computational cost at inference time, i. e. , sampling from a diffusion model is considerably slow as it often requires tens to hundreds of iterations to generate even one sample. To circumvent this issue, we propose to leverage the flexibility of diffusion models for RL from a representation learning perspective. In particular, by exploiting the connection between diffusion models and energy-based models, we develop Diffusion Spectral Representation (Diff-SR), a coherent algorithm framework that enables extracting sufficient representations for value functions in Markov decision processes (MDP) and partially observable Markov decision processes (POMDP). We further demonstrate how Diff-SR facilitates efficient policy optimization and practical algorithms while explicitly bypassing the difficulty and inference cost of sampling from the diffusion model. Finally, we provide comprehensive empirical studies to verify the benefits of Diff-SR in delivering robust and advantageous performance across various benchmarks with both fully and partially observable settings.

NeurIPS Conference 2024 Conference Paper

GarmentLab: A Unified Simulation and Benchmark for Garment Manipulation

  • Haoran Lu
  • Ruihai Wu
  • Yitong Li
  • Sijie Li
  • Ziyu Zhu
  • Chuanruo Ning
  • Yan Shen
  • Longzan Luo

Manipulating garments and fabrics has long been a critical endeavor in the development of home-assistant robots. However, due to complex dynamics and topological structures, garment manipulations pose significant challenges. Recent successes in reinforcement learning and vision-based methods offer promising avenues for learning garment manipulation. Nevertheless, these approaches are severely constrained by current benchmarks, which exhibit offer limited diversity of tasks and unrealistic simulation behavior. Therefore, we present GarmentLab, a content-rich benchmark and realistic simulation designed for deformable object and garment manipulation. Our benchmark encompasses a diverse range of garment types, robotic systems and manipulators. The abundant tasks in the benchmark further explores of the interactions between garments, deformable objects, rigid bodies, fluids, and human body. Moreover, by incorporating multiple simulation methods such as FEM and PBD, along with our proposed sim-to-real algorithms and real-world benchmark, we aim to significantly narrow the sim-to-real gap. We evaluate state-of-the-art vision methods, reinforcement learning, and imitation learning approaches on these tasks, highlighting the challenges faced by current algorithms, notably their limited generalization capabilities. Our proposed open-source environments and comprehensive analysis show promising boost to future research in garment manipulation by unlocking the full potential of these methods. We guarantee that we will open-source our code as soon as possible. You can watch the videos in supplementary files to learn more about the details of our work.

AAAI Conference 2024 Conference Paper

Harnessing the Power of SVD: An SVA Module for Enhanced Signal Classification

  • Lei Zhai
  • Shuyuan Yang
  • Yitong Li
  • Zhixi Feng
  • Zhihao Chang
  • Quanwei Gao

Deep learning methods have achieved outstanding performance in various signal tasks. However, due to degraded signals in real electromagnetic environment, it is crucial to seek methods that can improve the representation of signal features. In this paper, a Singular Value decomposition-based Attention, SVA is proposed to explore structure of signal data for adaptively enhancing intrinsic feature. Using a deep neural network as a base model, SVA performs feature semantic subspace learning through a decomposition layer and combines it with an attention layer to achieve adaptive enhancement of signal features. Moreover, we consider the gradient explosion problem brought by SVA and optimize SVA to improve the stability of training. Extensive experimental results demon-strate that applying SVA to a generalized classification model can significantly improve its ability in representations, making its recognition performance competitive with, or even better than, the state-of-the-art task-specific models.

NeurIPS Conference 2024 Conference Paper

Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

  • Jiajun Wang
  • Morteza Ghahremani
  • Yitong Li
  • Björn Ommer
  • Christian Wachinger

Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Masked pose images are used to smoothly refine the attention maps based on target pose-related features in a hierarchical manner, transitioning from coarse to fine levels. Additionally, our loss function is formulated to allocate increased emphasis to the pose region, thereby augmenting the model's precision in capturing intricate pose details. We assessed the performance of Stable-Pose across five public datasets under a wide range of indoor and outdoor human pose scenarios. Stable-Pose achieved an AP score of 57. 1 in the LAION-Human dataset, marking around 13\% improvement over the established technique ControlNet. The project link and code is available at https: //github. com/ai-med/StablePose.

AAAI Conference 2023 Conference Paper

KPT: Keyword-Guided Pre-training for Grounded Dialog Generation

  • Qi Zhu
  • Fei Mi
  • Zheng Zhang
  • Yasheng Wang
  • Yitong Li
  • Xin Jiang
  • Qun Liu
  • Xiaoyan Zhu

Incorporating external knowledge into the response generation process is essential to building more helpful and reliable dialog agents. However, collecting knowledge-grounded conversations is often costly, calling for a better pre-trained model for grounded dialog generation that generalizes well w.r.t. different types of knowledge. In this work, we propose KPT (Keyword-guided Pre-Training), a novel self-supervised pre-training method for grounded dialog generation without relying on extra knowledge annotation. Specifically, we use a pre-trained language model to extract the most uncertain tokens in the dialog as keywords. With these keywords, we construct two kinds of knowledge and pre-train a knowledge-grounded response generation model, aiming at handling two different scenarios: (1) the knowledge should be faithfully grounded; (2) it can be selectively used. For the former, the grounding knowledge consists of keywords extracted from the response. For the latter, the grounding knowledge is additionally augmented with keywords extracted from other utterances in the same dialog. Since the knowledge is extracted from the dialog itself, KPT can be easily performed on a large volume and variety of dialogue data. We considered three data sources (open-domain, task-oriented, conversational QA) with a total of 2.5M dialogues. We conduct extensive experiments on various few-shot knowledge-grounded generation tasks, including grounding on dialog acts, knowledge graphs, persona descriptions, and Wikipedia passages. Our comprehensive experiments and analyses demonstrate that KPT consistently outperforms state-of-the-art methods on these tasks with diverse grounding knowledge.

AAAI Conference 2023 Conference Paper

Towards Diverse, Relevant and Coherent Open-Domain Dialogue Generation via Hybrid Latent Variables

  • Bin Sun
  • Yitong Li
  • Fei Mi
  • Weichao Wang
  • Yiwei Li
  • Kan Li

Conditional variational models, using either continuous or discrete latent variables, are powerful for open-domain dialogue response generation. However, previous works show that continuous latent variables tend to reduce the coherence of generated responses. In this paper, we also found that discrete latent variables have difficulty capturing more diverse expressions. To tackle these problems, we combine the merits of both continuous and discrete latent variables and propose a Hybrid Latent Variable (HLV) method. Specifically, HLV constrains the global semantics of responses through discrete latent variables and enriches responses with continuous latent variables. Thus, we diversify the generated responses while maintaining relevance and coherence. In addition, we propose Conditional Hybrid Variational Transformer (CHVT) to construct and to utilize HLV with transformers for dialogue generation. Through fine-grained symbolic-level semantic information and additive Gaussian mixing, we construct the distribution of continuous variables, prompting the generation of diverse expressions. Meanwhile, to maintain the relevance and coherence, the discrete latent variable is optimized by self-separation training. Experimental results on two dialogue generation datasets (DailyDialog and Opensubtitles) show that CHVT is superior to traditional transformer-based variational mechanism w.r.t. diversity, relevance and coherence metrics. Moreover, we also demonstrate the benefit of applying HLV to fine-tuning two pre-trained dialogue models (PLATO and BART-base).

AAAI Conference 2022 Conference Paper

CINS: Comprehensive Instruction for Few-Shot Learning in Task-Oriented Dialog Systems

  • Fei Mi
  • Yasheng Wang
  • Yitong Li

As the labeling cost for different modules in task-oriented dialog (ToD) systems is high, a major challenge is to learn different tasks with the least amount of labeled data. Recently, pre-trained language models (PLMs) have shown promising results for few-shot learning in ToD. To better utilize the power of PLMs, this paper proposes Comprehensive Instruction (CINS) that exploits PLMs with extra taskspecific instructions. We design a schema (definition, constraint, prompt) of instructions and their customized realizations for three important downstream tasks in ToD, ie. intent classification, dialog state tracking, and natural language generation. A sequence-to-sequence model (T5) is adopted to solve these three tasks in a unified framework. Extensive experiments are conducted on these ToD tasks in realistic fewshot learning scenarios with small validation data. Empirical results demonstrate that the proposed CINS approach consistently improves techniques that finetune PLMs with raw input or short prompt.

JMLR Journal 2021 Journal Article

Estimating Uncertainty Intervals from Collaborating Networks

  • Tianhui Zhou
  • Yitong Li
  • Yuan Wu
  • David Carlson

Effective decision making requires understanding the uncertainty inherent in a prediction. In regression, this uncertainty can be estimated by a variety of methods; however, many of these methods are laborious to tune, generate overconfident uncertainty intervals, or lack sharpness (give imprecise intervals). We address these challenges by proposing a novel method to capture predictive distributions in regression by defining two neural networks with two distinct loss functions. Specifically, one network approximates the cumulative distribution function, and the second network approximates its inverse. We refer to this method as Collaborating Networks (CN). Theoretical analysis demonstrates that a fixed point of the optimization is at the idealized solution, and that the method is asymptotically consistent to the ground truth distribution. Empirically, learning is straightforward and robust. We benchmark CN against several common approaches on two synthetic and six real-world datasets, including forecasting A1c values in diabetic patients from electronic health records, where uncertainty is critical. In the synthetic data, the proposed approach essentially matches ground truth. In the real-world datasets, CN improves results on many performance metrics, including log-likelihood estimates, mean absolute errors, coverage estimates, and prediction interval widths. [abs] [ pdf ][ bib ] &copy JMLR 2021. ( edit, beta )

ICRA Conference 2021 Conference Paper

Two-stream 2D/3D Residual Networks for Learning Robot Manipulations from Human Demonstration Videos

  • Xin Xu
  • Kun Qian 0005
  • Bo Zhou 0017
  • Shenghao Chen
  • Yitong Li

Learning manipulation skills from observing human demonstration videos is a promising aspect for intelligent robotic systems. Recent advances in video to command provide an end-to-end approach to translate a video into robot plans. However, the general video captioning methods focus more on the understanding of the full frame, while they lack the consideration of the spatio-temporal features in videos. In this paper, we proposed the two-stream 2D/3D residual networks for robots to learn manipulation tasks from human demonstration videos. We integrate spatial features with 2D residual network and temporal features with 3D residual network as inputs for RNN layers. An encoder-decoder architecture is then used to encode the spatio-temporal features and sequentially generate the command words. Experimental results on an extended manipulation dataset show that our approach outperforms the state-of-the-art methods. Real-world experiments results on a Baxter robotic arm indicate that our method could produce more accurate commands from video demonstrations.

AAAI Conference 2020 Conference Paper

Dynamic Embedding on Textual Networks via a Gaussian Process

  • Pengyu Cheng
  • Yitong Li
  • Xinyuan Zhang
  • Liqun Chen
  • David Carlson
  • Lawrence Carin

Textual network embedding aims to learn low-dimensional representations of text-annotated nodes in a graph. Prior work in this area has typically focused on fixed graph structures; however, real-world networks are often dynamic. We address this challenge with a novel end-to-end node-embedding model, called Dynamic Embedding for Textual Networks with a Gaussian Process (DetGP). After training, DetGP can be applied efficiently to dynamic graphs without re-training or backpropagation. The learned representation of each node is a combination of textual and structural embeddings. Because the structure is allowed to be dynamic, our method uses the Gaussian process to take advantage of its non-parametric properties. To use both local and global graph structures, diffusion is used to model multiple hops between neighbors. The relative importance of global versus local structure for the embeddings is learned automatically. With the nonparametric nature of the Gaussian process, updating the embeddings for a changed graph structure requires only a forward pass through the learned model. Considering link prediction and node classification, experiments demonstrate the empirical effectiveness of our method compared to baseline approaches. We further show that DetGP can be straightforwardly and efficiently applied to dynamic textual networks.

NeurIPS Conference 2019 Conference Paper

Kernel-Based Approaches for Sequence Modeling: Connections to Neural Methods

  • Kevin Liang
  • Guoyin Wang
  • Yitong Li
  • Ricardo Henao
  • Lawrence Carin

We investigate time-dependent data analysis from the perspective of recurrent kernel machines, from which models with hidden units and gated memory cells arise naturally. By considering dynamic gating of the memory cell, a model closely related to the long short-term memory (LSTM) recurrent neural network is derived. Extending this setup to $n$-gram filters, the convolutional neural network (CNN), Gated CNN, and recurrent additive network (RAN) are also recovered as special cases. Our analysis provides a new perspective on the LSTM, while also extending it to $n$-gram convolutional filters. Experiments are performed on natural language processing tasks and on analysis of local field potentials (neuroscience). We demonstrate that the variants we derive from kernels perform on par or even better than traditional neural methods. For the neuroscience application, the new models demonstrate significant improvements relative to the prior state of the art.

NeurIPS Conference 2018 Conference Paper

Diffusion Maps for Textual Network Embedding

  • Xinyuan Zhang
  • Yitong Li
  • Dinghan Shen
  • Lawrence Carin

Textual network embedding leverages rich text information associated with the network to learn low-dimensional vectorial representations of vertices. Rather than using typical natural language processing (NLP) approaches, recent research exploits the relationship of texts on the same edge to graphically embed text. However, these models neglect to measure the complete level of connectivity between any two texts in the graph. We present diffusion maps for textual network embedding (DMTE), integrating global structural information of the graph to capture the semantic relatedness between texts, with a diffusion-convolution operation applied on the text inputs. In addition, a new objective function is designed to efficiently preserve the high-order proximity using the graph diffusion. Experimental results show that the proposed approach outperforms state-of-the-art methods on the vertex-classification and link-prediction tasks.

NeurIPS Conference 2018 Conference Paper

Extracting Relationships by Multi-Domain Matching

  • Yitong Li
  • michael Murias
  • Geraldine Dawson
  • David Carlson

In many biological and medical contexts, we construct a large labeled corpus by aggregating many sources to use in target prediction tasks. Unfortunately, many of the sources may be irrelevant to our target task, so ignoring the structure of the dataset is detrimental. This work proposes a novel approach, the Multiple Domain Matching Network (MDMN), to exploit this structure. MDMN embeds all data into a shared feature space while learning which domains share strong statistical relationships. These relationships are often insightful in their own right, and they allow domains to share strength without interference from irrelevant data. This methodology builds on existing distribution-matching approaches by assuming that source domains are varied and outcomes multi-factorial. Therefore, each domain should only match a relevant subset. Theoretical analysis shows that the proposed approach can have a tighter generalization bound than existing multiple-domain adaptation approaches. Empirically, we show that the proposed methodology handles higher numbers of source domains (up to 21 empirically), and provides state-of-the-art performance on image, text, and multi-channel time series classification, including clinically relevant data of a novel treatment of Autism Spectrum Disorder.

AAAI Conference 2018 Conference Paper

Video Generation From Text

  • Yitong Li
  • Martin Min
  • Dinghan Shen
  • David Carlson
  • Lawrence Carin

Generating videos from text has proven to be a significant challenge for existing generative models. We tackle this problem by training a conditional generative model to extract both static and dynamic information from text. This is manifested in a hybrid framework, employing a Variational Autoencoder (VAE) and a Generative Adversarial Network (GAN). The static features, called “gist, ” are used to sketch text-conditioned background color and object layout structure. Dynamic features are considered by transforming input text into an image filter. To obtain a large amount of data for training the deep-learning model, we develop a method to automatically create a matched text-video corpus from publicly available online videos. Experimental results show that the proposed framework generates plausible and diverse short-duration smooth videos, while accurately reflecting the input text information. It significantly outperforms baseline models that directly adapt text-to-image generation procedures to produce videos. Performance is evaluated both visually and by adapting the inception score used to evaluate image generation in GANs.

NeurIPS Conference 2017 Conference Paper

Targeting EEG/LFP Synchrony with Neural Nets

  • Yitong Li
  • michael Murias
  • samantha Major
  • Geraldine Dawson
  • Kafui Dzirasa
  • Lawrence Carin
  • David Carlson

We consider the analysis of Electroencephalography (EEG) and Local Field Potential (LFP) datasets, which are “big” in terms of the size of recorded data but rarely have sufficient labels required to train complex models (e. g. , conventional deep learning methods). Furthermore, in many scientific applications, the goal is to be able to understand the underlying features related to the classification, which prohibits the blind application of deep networks. This motivates the development of a new model based on {\em parameterized} convolutional filters guided by previous neuroscience research; the filters learn relevant frequency bands while targeting synchrony, which are frequency-specific power and phase correlations between electrodes. This results in a highly expressive convolutional neural network with only a few hundred parameters, applicable to smaller datasets. The proposed approach is demonstrated to yield competitive (often state-of-the-art) predictive performance during our empirical tests while yielding interpretable features. Furthermore, a Gaussian process adapter is developed to combine analysis over distinct electrode layouts, allowing the joint processing of multiple datasets to address overfitting and improve generalizability. Finally, it is demonstrated that the proposed framework effectively tracks neural dynamics on children in a clinical trial on Autism Spectrum Disorder.