Author name cluster

Long Ma

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

21 papers

2 author rows

AAAI Conference 2026 Conference Paper

Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception

Yuankun Xie
Ruibo Fu
Xiaopeng Wang
Zhiyong Wang
Songjun Cao
Long Ma
Haonan Cheng
Long Ye

The rapid advancement of audio generation technologies has escalated the risks of malicious deepfake audio across speech, sound, singing voice, and music, threatening multimedia security and trust. While existing countermeasures (CMs) perform well in single-type audio deepfake detection (ADD), their performance declines in cross-type scenarios. This paper is dedicated to studying the all-type ADD task. We are the first to comprehensively establish an all-type ADD benchmark to evaluate current CMs, incorporating cross-type deepfake detection across speech, sound, singing voice, and music. Then, we introduce the prompt tuning self-supervised learning (PT-SSL) training paradigm, which optimizes SSL front-end by learning specialized prompt tokens for ADD, requiring 458× fewer trainable parameters than fine-tuning (FT). Considering the auditory perception of different audio types, we propose the wavelet prompt tuning (WPT)-SSL method to capture type-invariant auditory deepfake information from the frequency domain without requiring additional training parameters, thereby enhancing performance over FT in the all-type ADD task. To achieve an universally CM, we utilize all types of deepfake audio for co-training. Experimental results demonstrate that WPT-XLSR-AASIST achieved the best performance, with an average EER of 3.58% across all evaluation sets.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Learning 3D Occupancy from Beam Overlap in 2D Rotating mmWave Radar

Yu Du
Ruifeng Nie
Long Ma
Chengpei Xu
Yu Liu
Weimin Wang

Robust 3D perception under adverse weather is critical for autonomous systems. While mmWave Radars are inherently weather-resistant, conventional 2D rotating Radar sensors lack direct elevation resolution, limiting their 3D perception ability. Although 4D imaging radars can provide elevation information, they typically suffer from limited coverage and range. In this work, we exploit a key observation about mechanically rotating 2D mmWave Radars: in each sweep, an overlap exists between adjacent azimuth beam coverage due to the width of the main lobe, which makes the reflected intensity difference imply object materials and geometric shapes, including elevation. With this observation, we propose a method that learns 3D occupancy by disentangling bird’s-eye view (BEV) layout and elevation estimation from one frame Radar scan. Specifically, we partition one sweep into two interleaved subsets, corresponding to overlapping beam directions, and utilize them to infer coarse geometric structure through spatial differences and intensity patterns. Extensive quantitative and qualitative evaluations on two real-world datasets demonstrate that our proposed method outperforms existing baselines. The codes will be publicly available.

PDF Details DOI

EAAI Journal 2025 Journal Article

Applications of machine vision technology for conveyor belt deviation detection: A review and roadmap

Jiaming Han
Ting Fang
Wensheng Liu
Chenxiao Zhang
Molin Zhu
Jibin Xu
Jie Ji
Xianhua He

Details DOI

ICML Conference 2025 Conference Paper

Behavior-agnostic Task Inference for Robust Offline In-context Reinforcement Learning

Long Ma
Fangwei Zhong
Yizhou Wang 0001

The ability to adapt to new environments with noisy dynamics and unseen objectives is crucial for AI agents. In-context reinforcement learning (ICRL) has emerged as a paradigm to build adaptive policies, employing a context trajectory of the test-time interactions to infer the true task and the corresponding optimal policy efficiently without gradient updates. However, ICRL policies heavily rely on context trajectories, making them vulnerable to distribution shifts from training to testing and degrading performance, particularly in offline settings where the training data is static. In this paper, we highlight that most existing offline ICRL methods are trained for approximate Bayesian inference based on the training distribution, rendering them vulnerable to distribution shifts at test time and resulting in poor generalization. To address this, we introduce Behavior-agnostic Task Inference (BATI) for ICRL, a model-based maximum-likelihood solution to infer the task representation robustly. In contrast to previous methods that rely on a learned encoder as the approximate posterior, BATI focuses purely on dynamics, thus insulating itself against the behavior of the context collection policy. Experiments on MuJoCo environments demonstrate that BATI effectively interprets out-of-distribution contexts and outperforms other methods, even in the presence of significant environmental noise.

Details

AAAI Conference 2025 Conference Paper

EchoDiffusion: Waveform Conditioned Diffusion Models for Echo-Based Depth Estimation

Wenjie Zhang
Jun Yin
Long Ma
Peng Yu
Xiaoheng Jiang
Zhen Tian
Mingliang Xu

To extract spatial information, depth estimation using conventional echo-based methods typically employs models with encoder-decoder architectures, such as UNet. However, these methods may face challenges in extracting fine details from echo waveforms and handling multi-scale feature extraction with high precision. To address these challenges, we introduce EchoDiffusion, a framework that incorporates diffusion models conditioned on waveform embeddings for echo-based depth estimation. This framework employs the Multi-Scale Adaptive Latent Feature Network (MALF-Net) to extract multi-scale spatial features and perform adaptive fusion, encoding the echo spectrograms into the latent space. Additionally, we propose the Echo Waveform Detail Embedder (EWDE), which leverages a pre-trained Wav2Vec model to extract detailed spatial information from echo waveforms, using these details as conditional inputs to guide the reverse diffusion process in the latent space. By embedding the echo waveforms into the reverse diffusion process, we can more accurately guide the generation of depth maps. Our extensive evaluations on the Replica and Matterport3D datasets demonstrate that EchoDiffusion establishes new benchmarks for state-of-the-art performance in echo-based depth estimation.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Enhancing Infrared Vision: Progressive Prompt Fusion Network and Benchmark

Jinyuan Liu
Zihang Chen
Zhu Liu
Zhiying Jiang
Long Ma
Xin Fan
Risheng Liu

We engage in the relatively underexplored task named thermal infrared image enhancement. Existing infrared image enhancement methods primarily focus on tackling individual degradations, such as noise, contrast, and blurring, making it difficult to handle coupled degradations. Meanwhile, all-in-one enhancement methods, commonly applied to RGB sensors, often demonstrate limited effectiveness due to the significant differences in imaging models. In sight of this, we first revisit the imaging mechanism and introduce a Recurrent Prompt Fusion Network (RPFN). Specifically, the RPFN initially establishes prompt pairs based on the thermal imaging process. For each type of degradation, we fuse the corresponding prompt pairs to modulate the model's features, providing adaptive guidance that enables the model to better address specific degradations under single or multiple conditions. In addition, a selective recurrent training mechanism is introduced to gradually refine the model's handling of composite cases to align the enhancement process, which not only allows the model to remove camera noise and retain key structural details, but also enhancing the overall contrast of the thermal image. Furthermore, we introduce the most comprehensive high-quality infrared benchmark covering a wide range of scenarios. Extensive experiments substantiate that our approach not only delivers promising visual results under specific degradation but also significantly improves performance on complex degradation scenes, achieving a notable 8. 76% improvement.

PDF Details

ICML Conference 2025 Conference Paper

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

Xiong Wang
Yangze Li
Chaoyou Fu
Yike Zhang
Yunhang Shen
Lei Xie 0001
Ke Li 0015
Xing Sun 0001

The GPT-4o’s excellent duplex speech interaction ability has given users an impressive experience. Researchers have recently proposed several multimodal LLMs to achieve user-agent speech-to-speech conversations. In this paper, we propose a novel speech-text multimodal LLM architecture called Freeze-Omni, and our main contribution is that the speech input and output modalities can be easily connected to a textual LLM while keeping the LLM’s parameters frozen throughout the training process. We effectively ensure that the intelligence of the Freeze-Omni in the speech modality is at the same level as that in the text modality of its backbone LLM while achieving low latency in the end-to-end spoken response. In addition, we also designed a method to achieve duplex dialogue ability through multitask training, giving Freeze-Omni a more natural style of dialogue ability between users and agents. In summary, Freeze-Omni holds great potential to conduct speech-to-speech dialogue based on a multimodal LLM under the condition of a frozen LLM, avoiding the catastrophic forgetting problem caused by limited data and training resources.

Details

NeurIPS Conference 2025 Conference Paper

From Specificity to Generality: Revisiting Generalizable Artifacts in Detecting Face Deepfakes

Long Ma
Zhiyuan Yan
Jin Xu
Yize Chen
Qinglang Guo
Zhen Bi
Yong Liao
Hui Lin

Detecting deepfakes has been an increasingly important topic, especially given the rapid development of AI generation techniques. In this paper, we ask: How can we build a universal detection framework that is effective for most facial deepfakes? One significant challenge is the wide variety of deepfake generators available, resulting in varying forgery artifacts (e. g. , lighting inconsistency, color mismatch, etc). But should we ``teach" the detector to learn all these artifacts separately? It is impossible and impractical to elaborate on them all. So the core idea is to pinpoint the more common and general artifacts across different deepfakes. Accordingly, we categorize deepfake artifacts into two distinct yet complementary types: Face Inconsistency Artifacts (FIA) and Up-Sampling Artifacts (USA). FIA arise from the challenge of generating all intricate details, inevitably causing inconsistencies between the complex facial features and relatively uniform surrounding areas. USA, on the other hand, are the inevitable traces left by the generator's decoder during the up-sampling process. This categorization stems from the observation that all existing deepfakes typically exhibit one or both of these artifacts. To achieve this, we propose a new data-level pseudo-fake creation framework that constructs fake samples with only the FIA and USA, without introducing extra less-general artifacts. Specifically, we employ a super-resolution to simulate the USA, while utilise image-level self-blending on diverse facial regions to create the FIA. We surprisingly found that, with this intuitive design, a standard image classifier trained only with our pseudo-fake data can non-trivially generalize well to previously unseen deepfakes.

PDF Details

NeurIPS Conference 2025 Conference Paper

Reinforced Context Order Recovery for Adaptive Reasoning and Planning

Long Ma
Fangwei Zhong
Yizhou Wang

Modern causal language models, followed by rapid developments in discrete diffusion models, can now produce a wide variety of interesting and useful content. However, these families of models are predominantly trained to output tokens with a fixed (left-to-right) or random order, which may deviate from the logical order in which tokens are generated originally. In this paper, we observe that current causal and diffusion models encounter difficulties in problems that require adaptive token generation orders to solve tractably, which we characterize with the $\mathcal{V}$-information framework. Motivated by this, we propose Reinforced Context Order Recovery (ReCOR), a reinforcement-learning-based framework to extract adaptive, data-dependent token generation orders from text data without annotations. Self-supervised by token prediction statistics, ReCOR estimates the hardness of predicting every unfilled token and adaptively selects the next token during both training and inference. Experiments on challenging reasoning and planning datasets demonstrate the superior performance of ReCOR compared with baselines, sometimes outperforming oracle models supervised with the ground-truth order.

PDF Details

ICLR Conference 2025 Conference Paper

Simulating Human-like Daily Activities with Desire-driven Autonomy

Yiding Wang
Yuxuan Chen
Fangwei Zhong
Long Ma
Yizhou Wang 0001

Desires motivate humans to interact autonomously with the complex world. In contrast, current AI agents require explicit task specifications, such as instructions or reward functions, which constrain their autonomy and behavioral diversity. In this paper, we introduce a Desire-driven Autonomous Agent (D2A) that can enable a large language model (LLM) to autonomously propose and select tasks, motivated by satisfying its multi-dimensional desires. Specifically, the motivational framework of D2A is mainly constructed by a dynamic $Value\ System$, inspired by the Theory of Needs. It incorporates an understanding of human-like desires, such as the need for social interaction, personal fulfillment, and self-care. At each step, the agent evaluates the value of its current state, proposes a set of candidate activities, and selects the one that best aligns with its intrinsic motivations. We conduct experiments on Concordia, a text-based simulator, to demonstrate that our agent generates coherent, contextually relevant daily activities while exhibiting variability and adaptability similar to human behavior. A comparative analysis with other LLM-based agents demonstrates that our approach significantly enhances the rationality of the simulated activities.

Details

NeurIPS Conference 2025 Conference Paper

Social World Model-Augmented Mechanism Design Policy Learning

Xiaoyuan Zhang
Yizhe Huang
Chengdong Ma
Zhixun Chen
Long Ma
Yali Du
Song-Chun Zhu
Yaodong Yang

Designing adaptive mechanisms to align individual and collective interests remains a central challenge in artificial social intelligence. Existing methods often struggle with modeling heterogeneous agents possessing persistent latent traits (e. g. , skills, preferences) and dealing with complex multi-agent system dynamics. These challenges are compounded by the critical need for high sample efficiency due to costly real-world interactions. World Models, by learning to predict environmental dynamics, offer a promising pathway to enhance mechanism design in heterogeneous and complex systems. In this paper, we introduce a novel method named SWM-AP (Social World Model-Augmented Mechanism Design Policy Learning), which learns a social world model hierarchically modeling agents' behavior to enhance mechanism design. Specifically, the social world model infers agents' traits from their interaction trajectories and learns a trait-based model to predict agents' responses to the deployed mechanisms. The mechanism design policy collects extensive training trajectories by interacting with the social world model, while concurrently inferring agents' traits online during real-world interactions to further boost policy learning efficiency. Experiments in diverse settings (tax policy design, team coordination, and facility location) demonstrate that SWM-AP outperforms established model-based and model-free RL baselines in cumulative rewards and sample efficiency.

PDF Details

IJCAI Conference 2025 Conference Paper

TextMEF: Text-guided Prompt Learning for Multi-exposure Image Fusion

Jinyuan Liu
Qianjun Huang
Guanyao Wu
Di Wang
Zhiying Jiang
Long Ma
Risheng Liu
Xin Fan

Multi-exposure image fusion~(MEF) aims to integrate a set of low dynamic range images, producing a single image with a higher dynamic range than either one. Despite significant advancements, current MEF approaches still struggle to handle extremely over- or under-exposed conditions, resulting in unsatisfactory visual effects such as hallucinated details and distorted color tones. With this regard, we propose TextMEF, a prompt-driven fusion method enhanced by prompt learning, for multi-exposure image fusion. Specifically, we learn a set of prompts based on text-image similarity among negative and positive samples (over-exposed, under-exposed images, and well-exposed ones). These learned prompts are seamlessly integrated into the loss function, providing high-level guidance for constraining non-uniform exposure regions. Furthermore, we develop a attention Mamba module effectively translates over-/under- exposed regional features into exposure invariant space and ensure them to build efficient long-range dependency to high dynamic range image. Extensive experimental results on three publicly available benchmarks demonstrate that our TextMEF significantly outperforms state-of-the-art approaches in both visual inspection and objective analysis.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Chaoyou Fu
Haojia Lin
Xiong Wang
Yifan Zhang
Yunhang Shen
Xiaoyu Liu
Haoyu Cao
Zuwei Long

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing against state-of-the-art counterparts across benchmarks for image, video, and speech, we demonstrate that our omni model is equipped with both strong visual and speech capabilities, making omni understanding and interaction.

PDF Details

ICML Conference 2024 Conference Paper

Fast Peer Adaptation with Context-aware Exploration

Long Ma
Yuanfei Wang
Fangwei Zhong
Song-Chun Zhu
Yizhou Wang 0001

Fast adapting to unknown peers (partners or opponents) with different strategies is a key challenge in multi-agent games. To do so, it is crucial for the agent to probe and identify the peer’s strategy efficiently, as this is the prerequisite for carrying out the best response in adaptation. However, exploring the strategies of unknown peers is difficult, especially when the games are partially observable and have a long horizon. In this paper, we propose a peer identification reward, which rewards the learning agent based on how well it can identify the behavior pattern of the peer over the historical context, such as the observation over multiple episodes. This reward motivates the agent to learn a context-aware policy for effective exploration and fast adaptation, i. e. , to actively seek and collect informative feedback from peers when uncertain about their policies and to exploit the context to perform the best response when confident. We evaluate our method on diverse testbeds that involve competitive (Kuhn Poker), cooperative (PO-Overcooked), or mixed (Predator-Prey-W) games with peer agents. We demonstrate that our method induces more active exploration behavior, achieving faster adaptation and better outcomes than existing methods.

Details

AAAI Conference 2024 Conference Paper

Hybrid-Supervised Dual-Search: Leveraging Automatic Learning for Loss-Free Multi-Exposure Image Fusion

Guanyao Wu
Hongming Fu
Jinyuan Liu
Long Ma
Xin Fan
Risheng Liu

Multi-exposure image fusion (MEF) has emerged as a prominent solution to address the limitations of digital imaging in representing varied exposure levels. Despite its advancements, the field grapples with challenges, notably the reliance on manual designs for network structures and loss functions, and the constraints of utilizing simulated reference images as ground truths. Consequently, current methodologies often suffer from color distortions and exposure artifacts, further complicating the quest for authentic image representation. In addressing these challenges, this paper presents a Hybrid-Supervised Dual-Search approach for MEF, dubbed HSDS-MEF, which introduces a bi-level optimization search scheme for automatic design of both network structures and loss functions. More specifically, we harness a unique dual research mechanism rooted in a novel weighted structure refinement architecture search. Besides, a hybrid supervised contrast constraint seamlessly guides and integrates with searching process, facilitating a more adaptive and comprehensive search for optimal loss functions. We realize the state-of-the-art performance in comparison to various competitive schemes, yielding a 10.61% and 4.38% improvement in Visual Information Fidelity (VIF) for general and no-reference scenarios, respectively, while providing results with high contrast, rich details and colors. The code is available at https://github.com/RollingPlain/HSDS_MEF.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Trash to Treasure: Low-Light Object Detection via Decomposition-and-Aggregation

Xiaohan Cui
Long Ma
Tengyu Ma
Jinyuan Liu
Xin Fan
Risheng Liu

Object detection in low-light scenarios has attracted much attention in the past few years. A mainstream and representative scheme introduces enhancers as the pre-processing for regular detectors. However, because of the disparity in task objectives between the enhancer and detector, this paradigm cannot shine at its best ability. In this work, we try to arouse the potential of enhancer + detector. Different from existing works, we extend the illumination-based enhancers (our newly designed or existing) as a scene decomposition module, whose removed illumination is exploited as the auxiliary in the detector for extracting detection-friendly features. A semantic aggregation module is further established for integrating multi-scale scene-related semantic information in the context space. Actually, our built scheme successfully transforms the "trash" (i.e., the ignored illumination in the detector) into the "treasure" for the detector. Plenty of experiments are conducted to reveal our superiority against other state-of-the-art methods. The code will be public if it is accepted.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Where Elegance Meets Precision: Towards a Compact, Automatic, and Flexible Framework for Multi-modality Image Fusion and Applications

Jinyuan Liu
Guanyao Wu
Zhu Liu
Long Ma
Risheng Liu
Xin Fan

Multi-modality image fusion aims to integrate images from multiple sensors, producing an image that is visually appealing and offers more comprehensive information than any single one. To ensure high visual quality and facilitate accurate subsequent perception tasks, previous methods have often cascaded networks using weighted loss functions. However, such simplistic strategies struggle to truly achieve the "Best of Both Worlds", and the adjustment of numerous hand-crafted parameters becomes burdensome. To address these challenges, this paper introduces a Compact, Automatic and Flexible framework, dubbed CAF, designed for infrared and visible image fusion, along with subsequent tasks. Concretely, we recast the combined problem of fusion and perception into a single objective, allowing mutual optimization of information from both tasks. Then we also utilize the perception task to inform the design of fusion loss functions, facilitating the automatic identification of optimal fusion objectives tailored to the task. Furthermore, CAF can support seamless integration with existing approaches easily, offering flexibility in adapting to various tasks and network structures. Extensive experiments demonstrate the superiority of CAF, which not only produces visually admirable fused results but also realizes 1. 7 higher detection mAP@. 5 and 2. 0 higher segmentation mIoU than the state-of-the-art methods. The code is available at https: //github. com/RollingPlain/CAF_IVIF.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

Bi-level Dynamic Learning for Jointly Multi-modality Image Fusion and Beyond

Zhu Liu
Jinyuan Liu
Guanyao Wu
Long Ma
Xin Fan
Risheng Liu

Recently, multi-modality scene perception tasks, e. g. , image fusion and scene understanding, have attracted widespread attention for intelligent vision systems. However, early efforts always consider boosting a single task unilaterally and neglecting others, seldom investigating their underlying connections for joint promotion. To overcome these limitations, we establish the hierarchical dual tasks-driven deep model to bridge these tasks. Concretely, we firstly construct an image fusion module to fuse complementary characteristics and cascade dual task-related modules, including a discriminator for visual effects and a semantic network for feature measurement. We provide a bi-level perspective to formulate image fusion and follow-up downstream tasks. To incorporate distinct task-related responses for image fusion, we consider image fusion as a primary goal and dual modules as learnable constraints. Furthermore, we develop an efficient first-order approximation to compute corresponding gradients and present dynamic weighted aggregation to balance the gradients for fusion learning. Extensive experiments demonstrate the superiority of our method, which not only produces visually pleasant fused results but also realizes significant promotion for detection and segmentation than the state-of-the-art approaches.

PDF Details DOI

IJCAI Conference 2022 Conference Paper

Hierarchical Bilevel Learning with Architecture and Loss Search for Hadamard-based Image Restoration

Guijing Zhu
Long Ma
Xin Fan
Risheng Liu

In the past few decades, Hadamard-based image restoration problems (e. g. , low-light image enhancement) attract wide concerns in multiple areas related to artificial intelligence. However, existing works mostly focus on heuristically defining architecture and loss by the engineering experiences that came from extensive practices. This way brings about expensive verification costs for seeking out the optimal solution. To this end, we develop a novel hierarchical bilevel learning scheme to discover the architecture and loss simultaneously for different Hadamard-based image restoration tasks. More concretely, we first establish a new Hadamard-inspired neural unit to aggregate domain knowledge into the network design. Then we model a triple-level optimization that consists of the architecture, loss and parameters optimizations to deliver a macro perspective for network learning. Then we introduce a new hierarchical bilevel learning scheme for solving the built triple-level model to progressively generate the desired architecture and loss. We also define an architecture search space consisting of a series of simple operations and an image quality-oriented loss search space. Extensive experiments on three Hadamard-based image restoration tasks (including low-light image enhancement, single image haze removal and underwater image enhancement) fully verify our superiority against state-of-the-art methods.

PDF Details DOI

AAAI Conference 2019 Conference Paper

Task Embedded Coordinate Update: A Realizable Framework for Multivariate Non-Convex Optimization

Yiyang Wang
Risheng Liu
Long Ma
Xiaoliang Song

We in this paper propose a realizable framework TECU, which embeds task-specific strategies into update schemes of coordinate descent, for optimizing multivariate non-convex problems with coupled objective functions. On one hand, TECU is capable of improving algorithm efficiencies through embedding productive numerical algorithms, for optimizing univariate sub-problems with nice properties. From the other side, it also augments probabilities to receive desired results, by embedding advanced techniques in optimizations of realistic tasks. Integrating both numerical algorithms and advanced techniques together, TECU is proposed in a unified framework for solving a class of non-convex problems. Although the task embedded strategies bring inaccuracies in sub-problem optimizations, we provide a realizable criterion to control the errors, meanwhile, to ensure robust performances with rigid theoretical analyses. By respectively embedding ADMM and a residual-type CNN in our algorithm framework, the experimental results verify both efficiency and effectiveness of embedding task-oriented strategies in coordinate descent for solving practical problems.

PDF Details

NeurIPS Conference 2018 Conference Paper

A Bridging Framework for Model Optimization and Deep Propagation

Risheng Liu
Shichao Cheng
Xiaokun Liu
Long Ma
Xin Fan
Zhongxuan Luo

Optimizing task-related mathematical model is one of the most fundamental methodologies in statistic and learning areas. However, generally designed schematic iterations may hard to investigate complex data distributions in real-world applications. Recently, training deep propagations (i. e. , networks) has gained promising performance in some particular tasks. Unfortunately, existing networks are often built in heuristic manners, thus lack of principled interpretations and solid theoretical supports. In this work, we provide a new paradigm, named Propagation and Optimization based Deep Model (PODM), to bridge the gaps between these different mechanisms (i. e. , model optimization and deep propagation). On the one hand, we utilize PODM as a deeply trained solver for model optimization. Different from these existing network based iterations, which often lack theoretical investigations, we provide strict convergence analysis for PODM in the challenging nonconvex and nonsmooth scenarios. On the other hand, by relaxing the model constraints and performing end-to-end training, we also develop a PODM based strategy to integrate domain knowledge (formulated as models) and real data distributions (learned by networks), resulting in a generic ensemble framework for challenging real-world applications. Extensive experiments verify our theoretical results and demonstrate the superiority of PODM against these state-of-the-art approaches.

PDF Details