Author name cluster

Dong Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

FreqPolicy: Efficient Flow-based Visuomotor Policy via Frequency Consistency

Yifei Su
Ning Liu
Dong Chen
Zhen Zhao
Kun Wu
Meng Li
Zhiyuan Xu
Zhengping Che

Generative modeling-based visuomotor policies have been widely adopted in robotic manipulation, attributed to their ability to model multimodal action distributions. However, the high inference cost of multi-step sampling limits its applicability in real-time robotic systems. Existing approaches accelerate sampling in generative modeling-based visuomotor policies by adapting techniques originally developed to speed up image generation. However, a major distinction exists: image generation typically produces independent samples without temporal dependencies, while robotic manipulation requires generating action trajectories with continuity and temporal coherence. To this end, we propose FreqPolicy, a novel approach that first imposes frequency consistency constraints on flow-based visuomotor policies. Our work enables the action model to capture temporal structure effectively while supporting efficient, high-quality one-step action generation. Concretely, we introduce a frequency consistency constraint objective that enforces alignment of frequency-domain action features across different timesteps along the flow, thereby promoting convergence of one-step action generation toward the target distribution. In addition, we design an adaptive consistency loss to capture structural temporal variations inherent in robotic manipulation tasks. We assess FreqPolicy on $53$ tasks across $3$ simulation benchmarks, proving its superiority over existing one-step action generators. We further integrate FreqPolicy into the vision-language-action (VLA) model and achieve acceleration without performance degradation on $40$ tasks of Libero. Besides, we show efficiency and effectiveness in real-world robotic scenarios with an inference frequency of $93. 5$ Hz.

PDF Details

IJCAI Conference 2025 Conference Paper

Logic Distillation: Learning from Code Function by Function for Decision-making Tasks

Dong Chen
Shilin Zhang
Fei Gao
Yueting Zhuang
Siliang Tang
Qidong Liu
Mingliang Xu

Large language models (LLMs) have garnered increasing attention owing to their powerful comprehension and generation capabilities. Generally, larger LLMs (L-LLMs) that require paid interfaces exhibit significantly superior performance compared to smaller LLMs (S-LLMs) that can be deployed on a variety of devices. Knowledge distillation (KD) aims to empower S-LLMs with the capabilities of L-LLMs, while S-LLMs merely mimic the outputs of L-LLMs, failing to get the powerful decision-making capability for new situations. Consequently, S-LLMs are helpless when it comes to continuous decision-making tasks that require logical reasoning. To tackle the identified challenges, we propose a novel framework called Logic Distillation (LD). Initially, LD employs L-LLMs to instantiate complex instructions into discrete functions and illustrates their usage to establish a function base. Subsequently, LD fine-tunes S-LLMs based on the function base to learn the logic employed by L-LLMs in decision-making. During testing, S-LLMs will yield decision-making outcomes, function by function, based on current states. Experiments demonstrate that with the assistance of LD, S-LLMs can achieve outstanding results in continuous decision-making tasks, comparable to, or even surpassing, those of L-LLMs. The code and data for the proposed method are provided for research purposes https: //github. com/Anfeather/Logic-Distillation.

PDF Details DOI

IROS Conference 2025 Conference Paper

Model-Free Catheter Delivery Strategy for Robotic Transcatheter Tricuspid Valve Replacement

Haichuan Lin
Yiping Xie
Ziqi Wang
Dong Chen
Longyue Tan
Weizhao Wang
Yuen Chiu Ng
Xilong Hou 0001

Transcatheter tricuspid valve replacement (TTVR) has emerged as a promising minimally invasive procedure for treating severe tricuspid regurgitation (TR). However, accurate catheter delivery remains a significant challenge, primarily due to the reliance on 2D vision feedback, complex catheter kinematics, camera-to-robot pose calibration, which are difficult to generalize across patients. To address these issues, this paper presents a model-free robotic catheter delivery strategy for TTVR using Data-Enabled Predictive Control (DeePC). This approach leverages data-driven control to optimize catheter positioning without the need for prior knowledge of the system’s dynamics, eliminating the need for complex kinematic models or camera calibration. The proposed method incorporates environmental constraints to ensure the safety of the procedure, delivering the catheter to the desired location with high accuracy across varying catheters and camera poses. Experimental results demonstrate the effectiveness and versatility of the approach, suggesting its potential for broader applications in robotic-assisted surgeries. This work presents a new perspective for vision based robotic TTVR, as well as other clinical interventions involving robotic catheter control.

Details

IROS Conference 2025 Conference Paper

RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation

Liudi Yang
Yang Bai
George Eskandar
Fengyi Shen
Mohammad Altillawi
Dong Chen
Soumajit Majumder
Ziyuan Liu

We address the problem of generating long-horizon videos for robotic manipulation tasks. Text-to-video diffusion models have made significant progress in photorealism, language understanding, and motion generation but struggle with long-horizon robotic tasks. Recent works use video diffusion models for high-quality simulation data and predictive rollouts in robot planning. However, these works predict short sequences of the robot achieving one task and employ an autoregressive paradigm to extend to the long horizon, leading to error accumulations in the generated video and in the execution. To overcome these limitations, we propose a novel pipeline that bypasses the need for autoregressive generation. We achieve this through a threefold contribution: 1) we first decompose the high-level goals into smaller atomic tasks and generate keyframes aligned with these instructions. A second diffusion model then interpolates between each of the two generated frames, achieving the long-horizon video. 2) We propose a semantics preserving attention module to maintain consistency between the keyframes. 3) We design a lightweight policy model to regress the robot joint states from generated videos. Our approach achieves state-of-the-art results on two benchmarks in video quality and consistency while outperforming previous policy models on long-horizon tasks.

Details

IROS Conference 2025 Conference Paper

RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping

Yang Bai
Liudi Yang
George Eskandar
Fengyi Shen
Dong Chen
Mohammad Altillawi
Ziyuan Liu
Gitta Kutyniok

Recent advancements in generative models have revolutionized video synthesis and editing. However, the scarcity of diverse, high-quality datasets continues to hinder video-conditioned robotic learning, limiting cross-platform generalization. In this work, we address the challenge of swapping a robotic arm in one video with another— a key step for cross-embodiment learning. Unlike previous methods that depend on paired video demonstrations in the same environmental settings, our proposed framework, RoboSwap, operates on unpaired data from diverse environments, alleviating the data collection needs. RoboSwap introduces a novel video editing pipeline integrating both GANs and diffusion models, combining their isolated advantages. Specifically, we segment robotic arms from their backgrounds and train an unpaired GAN model to translate one robotic arm to another. The translated arm is blended with the original video background and refined with a diffusion model to enhance coherence, motion realism and object interaction. The GAN and diffusion stages are trained independently. Our experiments demonstrate that RoboSwap outperforms state-of-the-art video and image editing models on three benchmarks in terms of both structural coherence and motion consistency, thereby offering a robust solution for generating reliable, cross-embodiment data in robotic learning.

Details

YNIMG Journal 2025 Journal Article

Theta oscillations between the ventromedial prefrontal cortex and amygdala support dynamic representations of threat and safety

Pingping Lu
Dong Chen
Wenran Xia
Si Chen
Zheng Tan
Wenjing Zhou
Liang Wang

Details DOI

NeurIPS Conference 2024 Conference Paper

A Retrospective on the Robot Air Hockey Challenge: Benchmarking Robust, Reliable, and Safe Learning Techniques for Real-world Robotics

Puze Liu
Jonas Günster
Niklas Funk
Simon Gröger
Dong Chen
Haitham Bou-Ammar
Julius Jankowski
Ante Marić

Machine learning methods have a groundbreaking impact in many application domains, but their application on real robotic platforms is still limited. Despite the many challenges associated with combining machine learning technology with robotics, robot learning remains one of the most promising directions for enhancing the capabilities of robots. When deploying learning-based approaches on real robots, extra effort is required to address the challenges posed by various real-world factors. To investigate the key factors influencing real-world deployment and to encourage original solutions from different researchers, we organized the Robot Air Hockey Challenge at the NeurIPS 2023 conference. We selected the air hockey task as a benchmark, encompassing low-level robotics problems and high-level tactics. Different from other machine learning-centric benchmarks, participants need to tackle practical challenges in robotics, such as the sim-to-real gap, low-level control issues, safety problems, real-time requirements, and the limited availability of real-world data. Furthermore, we focus on a dynamic environment, removing the typical assumption of quasi-static motions of other real-world benchmarks. The competition's results show that solutions combining learning-based approaches with prior knowledge outperform those relying solely on data when real-world deployment is challenging. Our ablation study reveals which real-world factors may be overlooked when building a learning-based solution. The successful real-world air hockey deployment of best-performing agents sets the foundation for future competitions and follow-up research directions.

PDF Details DOI

ICML Conference 2024 Conference Paper

AegisFL: Efficient and Flexible Privacy-Preserving Byzantine-Robust Cross-silo Federated Learning

Dong Chen
Hongyuan Qu
Guangwu Xu

Privacy attacks and poisoning attacks are two of the thorniest problems in federation learning (FL). Homomorphic encryption (HE), which allows certain mathematical operations to be done in the ciphertext state, provides a way to solve these two problems simultaneously. However, existing Paillier-based and CKKS-based privacy-preserving byzantine-robust FL (PBFL) solutions not only suffer from low efficiency but also expose the final model to the server. Additionally, these methods are limited to one robust aggregation algorithm (AGR) and are therefore vulnerable to AGR-tailored poisoning attacks. In this paper, we present AegisFL, an efficient PBLF system that provides the flexibility to change the AGR. We first observe that the core of the existing advanced AGRs is to calculate the inner products, $L_2$ norms and mean values for vectors. Based on this observation, we tailor a packing scheme for PBFL, which fits perfectly with RLWE-based fully homomorphic encryption. Under this packing scheme, the server only needs to perform one ciphertext multiplication to construct any required AGR, while the global model only belongs to honest clients. Finally, we conduct extensive experiments on different datasets and adversary settings, which also confirm the effectiveness and efficiency of our scheme.

Details

AAAI Conference 2024 Conference Paper

Data Shunt: Collaboration of Small and Large Models for Lower Costs and Better Performance

Dong Chen
Yueting Zhuang
Shuo Zhang
Jinfeng Liu
Su Dong
Siliang Tang

Pretrained large models, particularly large language models, have garnered increasing attention, as they have demonstrated remarkable abilities through contextual learning. Pretrained large models are increasingly recognized as fundamental tools for solving various tasks. However, the substantial computational demands of large models have dissuaded most product teams and individuals from running them. In such scenarios, to leverage the exceptional performance of large models, one must solely depend on costly APIs, further burdening product teams and individuals. On the other hand, despite the overall inferior performance of small models compared to large models, there are certain distributions where small models can achieve comparable or even superior results. For instance, during training, small models may become trapped in a local optimum that is unique to certain distributions, leading to superior performance. Hence, we propose Data Shunt (DS), a general paradigm for collaboration of small and large models. DS not only substantially reduces the cost associated with deploying large models but also effectively enhances overall performance. Specifically, DS determines the shunting direction by evaluating the confidence level of small models. When the confidence level falls below a specific threshold, the input data is forwarded to large models. To further leverage the advantages of the small and large models, we introduce Prompt Pruning (PP) and 2-Stage Confidence Distillation (2CD), which facilitate mutual collaboration, leading to better results and less cost. The remarkable performance across diverse modalities and tasks demonstrates the superiority of the proposed DS over large models. For instance, ChatGPT achieves an accuracy of 94.43% on Amazon Product sentiment analysis, and DS achieves an accuracy of 95.64%, while the cost has been reduced to only 31.18%. The code for the proposed method are provided for research purposes https://github.com/Anfeather/Data-Shunt.

PDF Details DOI

IROS Conference 2024 Conference Paper

Enhancing 3D Single Object Tracking with Efficient Point Cloud Segmentation

Yushi Yang
Baojie Fan
Yuyu Jiang
Wuyang Zhou
Dong Chen
Hongxin Xu

3D single object tracking (SOT) based on point cloud has attracted much attention due to its important role in machine vision and autonomous driving. Recently, M 2 -Track proposes a two-stage tracking structure centered on motion, but they ignore the effect of segmentation errors in sparse point cloud scenarios, which hinder the ability of networks to accurately represent tracking targets. To solve the problems, we propose an efficient 3D single object tracker (Abbr. EST) that can effectively segment point cloud features. Firstly, the proposed fusion segmentation module makes up for the feature loss caused by the downsampling strategy and enhances the ability of the network to recognize foreground points. In addition, the global embedded module is used to further focus on the crucial features of the target. This module provides global information by using residual networks and adding background information. Numerous experiments conducted on KITTI and NuScenes benchmarks show that EST achieves superior point cloud tracking in both performance and efficiency.

Details

AAAI Conference 2024 Conference Paper

EPSD: Early Pruning with Self-Distillation for Efficient Model Compression

Dong Chen
Ning Liu
Yichen Zhu
Zhengping Che
Rui Ma
Fachao Zhang
Xiaofeng Mou
Yi Chang

Neural network compression techniques, such as knowledge distillation (KD) and network pruning, have received increasing attention. Recent work `Prune, then Distill' reveals that a pruned student-friendly teacher network can benefit the performance of KD. However, the conventional teacher-student pipeline, which entails cumbersome pre-training of the teacher and complicated compression steps, makes pruning with KD less efficient. In addition to compressing models, recent compression techniques also emphasize the aspect of efficiency. Early pruning demands significantly less computational cost in comparison to the conventional pruning methods as it does not require a large pre-trained model. Likewise, a special case of KD, known as self-distillation (SD), is more efficient since it requires no pre-training or student-teacher pair selection. This inspires us to collaborate early pruning with SD for efficient model compression. In this work, we propose the framework named Early Pruning with Self-Distillation (EPSD), which identifies and preserves distillable weights in early pruning for a given SD task. EPSD efficiently combines early pruning and self-distillation in a two-step process, maintaining the pruned network's trainability for compression. Instead of a simple combination of pruning and SD, EPSD enables the pruned network to favor SD by keeping more distillable weights before training to ensure better distillation of the pruned network. We demonstrated that EPSD improves the training of pruned networks, supported by visual and quantitative analyses. Our evaluation covered diverse benchmarks (CIFAR-10/100, Tiny-ImageNet, full ImageNet, CUB-200-2011, and Pascal VOC), with EPSD outperforming advanced pruning and SD techniques.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

GaussianCube: A Structured and Explicit Radiance Representation for 3D Generative Modeling

Bowen Zhang
Yiji Cheng
Jiaolong Yang
Chunyu Wang
Feng Zhao
Yansong Tang
Dong Chen
Baining Guo

We introduce a radiance representation that is both structured and fully explicit and thus greatly facilitates 3D generative modeling. Existing radiance representations either require an implicit feature decoder, which significantly degrades the modeling power of the representation, or are spatially unstructured, making them difficult to integrate with mainstream 3D diffusion methods. We derive GaussianCube by first using a novel densification-constrained Gaussian fitting algorithm, which yields high-accuracy fitting using a fixed number of free Gaussians, and then rearranging these Gaussians into a predefined voxel grid via Optimal Transport. Since GaussianCube is a structured grid representation, it allows us to use standard 3D U-Net as our backbone in diffusion modeling without elaborate designs. More importantly, the high-accuracy fitting of the Gaussians allows us to achieve a high-quality representation with orders of magnitude fewer parameters than previous structured representations for comparable quality, ranging from one to two orders of magnitude. The compactness of GaussianCube greatly eases the difficulty of 3D generative modeling. Extensive experiments conducted on unconditional and class-conditioned object generation, digital avatar creation, and text-to-3D synthesis all show that our model achieves state-of-the-art generation results both qualitatively and quantitatively, underscoring the potential of GaussianCube as a highly accurate and versatile radiance representation for 3D generative modeling.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Scaling the Codebook Size of VQ-GAN to 100,000 with a Utilization Rate of 99%

Lei Zhu
Fangyun Wei
Yanye Lu
Dong Chen

In the realm of image quantization exemplified by VQGAN, the process encodes images into discrete tokens drawn from a codebook with a predefined size. Recent advancements, particularly with LLAMA 3, reveal that enlarging the codebook significantly enhances model performance. However, VQGAN and its derivatives, such as VQGAN-FC (Factorized Codes) and VQGAN-EMA, continue to grapple with challenges related to expanding the codebook size and enhancing codebook utilization. For instance, VQGAN-FC is restricted to learning a codebook with a maximum size of 16, 384, maintaining a typically low utilization rate of less than 12% on ImageNet. In this work, we propose a novel image quantization model named VQGAN-LC (Large Codebook), which extends the codebook size to 100, 000, achieving an utilization rate exceeding 99%. Unlike previous methods that optimize each codebook entry, our approach begins with a codebook initialized with 100, 000 features extracted by a pre-trained vision encoder. Optimization then focuses on training a projector that aligns the entire codebook with the feature distributions of the encoder in VQGAN-LC. We demonstrate the superior performance of our model over its counterparts across a variety of tasks, including image reconstruction, image classification, auto-regressive image generation using GPT, and image creation with diffusion- and flow-based generative models.

PDF Details DOI

AAAI Conference 2023 Conference Paper

FreeEnricher: Enriching Face Landmarks without Additional Cost

Yangyu Huang
Xi Chen
Jongyoo Kim
Hao Yang
Chong Li
Jiaolong Yang
Dong Chen

Recent years have witnessed significant growth of face alignment. Though dense facial landmark is highly demanded in various scenarios, e.g., cosmetic medicine and facial beautification, most works only consider sparse face alignment. To address this problem, we present a framework that can enrich landmark density by existing sparse landmark datasets, e.g., 300W with 68 points and WFLW with 98 points. Firstly, we observe that the local patches along each semantic contour are highly similar in appearance. Then, we propose a weakly-supervised idea of learning the refinement ability on original sparse landmarks and adapting this ability to enriched dense landmarks. Meanwhile, several operators are devised and organized together to implement the idea. Finally, the trained model is applied as a plug-and-play module to the existing face alignment networks. To evaluate our method, we manually label the dense landmarks on 300W testset. Our method yields state-of-the-art accuracy not only in newly-constructed dense 300W testset but also in the original sparse 300W and WFLW testsets without additional cost.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

Locate, Refine and Restore: A Progressive Enhancement Network for Camouflaged Object Detection

Xiaofei Li
Jiaxin Yang
Shuohao LI
Jun Lei
Jun Zhang
Dong Chen

Camouflaged Object Detection (COD) aims to segment objects that blend in with their surroundings. Most existing methods mainly tackle this issue by a single-stage framework, which tends to degrade performance in the face of small objects, low-contrast objects and objects with diverse appearances. In this paper, we propose a novel Progressive Enhancement Network (PENet) for COD by imitating the human visual detection system, which follows a three-stage detection process: locate objects, refine textures and restore boundary. Specifically, our PENet contains three key modules, i. e. , the object location module (OLM), the group attention module (GAM) and the context feature restoration module (CFRM). The OLM is designed to position the object globally, the GAM is developed to refine both high-level semantic and low-level texture feature representation, and the CFRM is leveraged to effectively aggregate multi-level features for progressively restoring the clear boundary. Extensive results demonstrate that our PENet significantly outperforms 32 state-of-the-art methods on four widely used benchmark datasets

PDF Details DOI

AAAI Conference 2023 Conference Paper

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Xiaoyi Dong
Jianmin Bao
Ting Zhang
DongDong Chen
Weiming Zhang
Lu Yuan
Dong Chen
Fang Wen

This paper explores a better prediction target for BERT pre-training of vision transformers. We observe that current prediction targets disagree with human perception judgment. This contradiction motivates us to learn a perceptual prediction target. We argue that perceptually similar images should stay close to each other in the prediction target space. We surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. Moreover, we adopt a self-supervised transformer model for deep feature extraction and show that it works well for calculating perceptual similarity. We demonstrate that such learned visual tokens indeed exhibit better semantic meanings, and help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve 84.5% Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive method BEiT by +1.3% under the same pre-training epochs. Our approach also gets significant improvement on object detection and segmentation on COCO and semantic segmentation on ADE20K. Equipped with a larger backbone ViT-H, we achieve the state-of-the-art ImageNet accuracy (88.3%) among methods using only ImageNet-1K data.

PDF Details DOI

IJCAI Conference 2022 Conference Paper

I²R-Net: Intra- and Inter-Human Relation Network for Multi-Person Pose Estimation

Yiwei Ding
Wenjin Deng
Yinglin Zheng
Pengfei Liu
Meihong Wang
Xuan Cheng
Jianmin Bao
Dong Chen

In this paper, we present the Intra- and Inter-Human Relation Networks I²R-Net for Multi-Person Pose Estimation. It involves two basic modules. First, the Intra-Human Relation Module operates on a single person and aims to capture Intra-Human dependencies. Second, the Inter-Human Relation Module considers the relation between multiple instances and focuses on capturing Inter-Human interactions. The Inter-Human Relation Module can be designed very lightweight by reducing the resolution of feature map, yet learn useful relation information to significantly boost the performance of the Intra-Human Relation Module. Even without bells and whistles, our method can compete or outperform current competition winners. We conduct extensive experiments on COCO, CrowdPose, and OCHuman datasets. The results demonstrate that the proposed model surpasses all the state-of-the-art methods. Concretely, the proposed method achieves 77. 4% AP on CrowPose dataset and 67. 8% AP on OCHuman dataset respectively, outperforming existing methods by a large margin. Additionally, the ablation study and visualization analysis also prove the effectiveness of our model.

PDF Details DOI

IROS Conference 2021 Conference Paper

Reinforcement Learning based Negotiation-aware Motion Planning of Autonomous Vehicles

Zhitao Wang
Yuzheng Zhuang
Qiang Gu
Dong Chen
Hongbo Zhang
Wulong Liu

For autonomous vehicles integrating onto road-ways with human traffic participants, it requires understanding and adapting to the participants’ intention by responding in predictable ways. This paper proposes a reinforcement learning based negotiation-aware motion planning framework, which adopts RL to adjust the driving style of the planner by dynamically modifying the prediction horizon length of the motion planner in real time adaptively. The framework models the interaction between the autonomous vehicle and other traffic participants as a Markov Decision Process. A temporal sequence of occupancy grid maps are taken as inputs for RL module to embed an implicit intention reasoning. Curriculum learning is employed to enhance the training efficiency and the robustness of the algorithm. We applied our method to narrow lane navigation in both simulation and real world to demonstrate that the proposed method outperforms the common alternative due to its advantage in alleviating the social dilemma problem with proper negotiation skills.

Details

IJCAI Conference 2020 Conference Paper

A Speech-to-Knowledge-Graph Construction System

Xiaoyi Fu
Jie Zhang
Hao Yu
Jiachen Li
Dong Chen
Jie Yuan
Xindong Wu

This paper presents a HAO-Graph system that generates and visualizes knowledge graphs from a speech in real-time. When a user speaks to the system, HAO-Graph transforms the voice into knowledge graphs with key phrases from the original speech as nodes and edges. Different from language-to-language systems, such as Chinese-to-English and English-to-English, HAO-Graph converts a speech into graphs, and is the first of its kind. The effectiveness of our HAO-Graph system is verified by a two-hour chairman's talk in front of two thousand participants at an annual meeting in the form of a satisfaction survey.

PDF Details DOI

NeurIPS Conference 2020 Conference Paper

GreedyFool: Distortion-Aware Sparse Adversarial Attack

Xiaoyi Dong
DongDong Chen
Jianmin Bao
Chuan Qin
Lu Yuan
Weiming Zhang
Nenghai Yu
Dong Chen

Modern deep neural networks(DNNs) are vulnerable to adversarial samples. Sparse adversarial samples are a special branch of adversarial samples that can fool the target model by only perturbing a few pixels. The existence of the sparse adversarial attack points out that DNNs are much more vulnerable than people believed, which is also a new aspect for analyzing DNNs. However, current sparse adversarial attack methods still have some shortcomings on both sparsity and invisibility. In this paper, we propose a novel two-stage distortion-aware greedy-based method dubbed as ''GreedyFool". Specifically, it first selects the most effective candidate positions to modify by considering both the gradient(for adversary) and the distortion map(for invisibility), then drops some less important points in the reduce stage. Experiments demonstrate that compared with the start-of-the-art method, we only need to modify 3 times fewer pixels under the same sparse perturbation setting. For target attack, the success rate of our method is 9. 96% higher than the start-of-the-art method under the same pixel budget.

PDF Details

NeurIPS Conference 1989 Conference Paper

Higher Order Recurrent Networks and Grammatical Inference

C. Giles
Guo-Zheng Sun
Hsing-Hen Chen
Yee-Chun Lee
Dong Chen

A higher order single layer recursive network easily learns to simulate a deterministic finite state machine and recognize regular grammars. When an enhanced version of this neural net state machine is connected through a common error term to an external analog stack memory, the combination can be interpreted as a neural net pushdown automata. The neural net finite state machine is given the primitives, push and POP. and is able to read the top of the stack. Through a gradient descent learning rule derived from the common error function, the hybrid network learns to effectively use the stack actions to manipUlate the stack memory and to learn simple context(cid: 173) free grammars. INTRODUCTION Biological networks readily and easily process temporal information; artificial neural networks should do the same. Recurrent neural network models permit the encoding and learning of temporal sequences. There are many recurrent neural net models. for ex(cid: 173) ample see [Jordan 1986. Pineda 1987, Williams & Zipser 1988]. Nearly all encode the current state representation of the models in the activity of the neuron and the next state is determined by the current state and input. From an automata perspective, this dynamical structure is a state machine. One formal model of sequences and machines that generate and recognize them are formal grammars and their respective automata. These models formalize some of the foundations of computer science. In the Chomsky hierarchy of formal grammars [Hopcroft & Ullman 1979] the simplest level of com(cid: 173) plexity is defmed by the finite state machine and its regular grammars. (All machines Higher Order Recurrent Networks and Grammatical Inference 381 and grammars described here are deterministic. } The next level of complexity is de(cid: 173) scribed by pushdown automata and their associated context-free grammars. The push(cid: 173) down automaton is a fmite state machine with the added power to use a stack memory. Nemal networks should be able to perform the same type of computation and thus solve such learning problems as grammatical inference [pu 1982]. Simple grammatical inference is defined as the problem of finding (learning) a grammar from a fmite set of strings, often called the teaching sample. Recall that a grammar {phrase-structured} is defined as a 4-tuple (N, V, P, S) where N and V are a nonterm i(cid: 173) na1 and terminal vocabularies, P is a finite set of production rules and S is the start sym(cid: 173) bol. Here grammatical inference is also defined as the learning of the machine that recognizes the teaching and testing samples. Potential applications of grammatical in(cid: 173) ference include such various areas as pattern recognition, information retrieval, pro(cid: 173) gramming language design, translation and compiling and graphics languages [pu 1982]. There has been a great deal of interest in teaching nemal nets to recognize grammars and simulate automata [Allen 1989. Jordan 1986. Pollack 1989. Servant-Schreiber et. a1. 1989, Williams & Zipser 1988]. Some important extensions of that work are discussed here. In particular we construct recurrent higher order nemal net state machines which have no hidden layers and seem to be at least as powerful as any nemal net multilayer state machine discussed so far. For example, the learning time and training sample size are significantly reduced. In addition, we integrate this neural net fmite state machine with an external stack memory and inform the network through a common objective function that it has at its disposal the symbol at the top of the stack and the operation primitives of push and pop. By devising a common error function which integrates the stack and the nemal net state machine, this hybrid structure learns to effectively use the the interesting work of [Williams & stack to recognize context-free grammars. Zipser 1988] a recurrent net learns only the state machine part of a Turing Machine. since the associated move, read, write operations for each input string are known and are given as part of the training set. However, the model we present learns how to manipu(cid: 173) late the push, POP. and read primitives of an external stack memory plus learns the ad(cid: 173) ditional necessary state operations and structure. HIGHER ORDER RECURRENT NETWORK The recurrent neural network utilized can be considered as a higher order modification of the network model developed by [Williams & Zipser 1988]. Recall that in a recur(cid: 173) rent net the activation state S of the neurons at time (t+l) is defined as in a state ma(cid: 173) chine automata: In (1) where F maps the state S and the input I at time t to the next state. The weight matrix W forms the mapping and is usually learned. We use a higher order form for this map(cid: 173) ping: S(t+ 1) = F ( S(t), I(t); W }

PDF Details