Arrow Research search

Author name cluster

Yilin Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers
2 author rows

Possible papers

18

AAAI Conference 2026 Conference Paper

Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

  • Chun-Hsiao Yeh
  • Yilin Wang
  • Nanxuan Zhao
  • Richard Zhang
  • Yuheng Li
  • Yi Ma
  • Krishna Kumar Singh

Recent diffusion-based image editing methods have made great strides in text-guided tasks but often struggle with complex, indirect instructions. Additionally, current models frequently exhibit poor identity preservation, unintended edits, or rely on manual masks. To overcome these limitations, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that bridges user intent with editing model capabilities. X-Planner uses chain-of-thought reasoning to systematically break down complex instructions into simpler sub-instructions. For each one, X-Planner automatically generates precise edit types and segmentation masks, enabling localized, identity-preserving edits without applying external tools or models during inference. To enable the training of such a planner, we also introduce a fully automated, reproducible pipeline to generate large-scale, high-quality training data. Our complete system achieves state-of-the-art results on both existing and newly proposed complex instruction-based editing benchmarks.

AAAI Conference 2026 Conference Paper

TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering

  • Dongxing Mao
  • Yilin Wang
  • Linjie Li
  • Zhengyuan Yang
  • Alex Jinpeng Wang

Despite recent advances in text-to-image (T2I) generation, models still struggle to accurately render prompt-specified text with correct spatial layout—especially in multi-span, structured settings. This challenge is driven not only by the lack of datasets that align prompts with the exact text and layout expected in the image, but also by the absence of effective metrics for evaluating layout quality. To address these issues, we introduce TextGround4M, a large-scale dataset of over 4 million prompt-image pairs, each annotated with span-level text grounded in the prompt and corresponding bounding boxes. This enables fine-grained supervision for layout-aware, prompt-grounded text rendering. Building on this, we propose a lightweight training strategy for autoregressive T2I models that appends layout-aware span tokens during training, without altering model architecture or inference behavior. We further construct a benchmark with stratified layout complexity to evaluate both open-source and proprietary models in a zero-shot setting. In addition, we introduce two layout-aware metrics to address the long-standing lack of spatial evaluation in text rendering. Our results show that models trained on TextGround4M outperform strong baselines in text fidelity, spatial accuracy, and prompt consistency, highlighting the importance of fine-grained layout supervision for grounded T2I generation.

JBHI Journal 2025 Journal Article

Identifying Acute Thoracolumbar Vertebral Compression Fractures From Low-Quality Small-Sample X-Ray Images: A Transfer Learning-Based Approach

  • Yilin Wang
  • Weijun Li
  • Siyu Chen
  • Yang Yang
  • Aidi Fan
  • Chenhao Lei
  • Yuhui Kou
  • Na Han

Timely and accurate diagnosis of acute thoracolumbar vertebral compression fractures in X-ray images is critical for initiating prompt and effective treatment, preventing potential neurological damage and long-term disability. Recent advancements in artificial intelligence (AI) have significantly improved medical imaging analysis, providing sophisticated tools to assist clinicians in diagnosing acute thoracolumbar vertebral compression fractures. Nonetheless, detecting these fractures through imaging remains challenging due to the complex overlapping of bony structures in the thoracolumbar region, variability in fracture patterns, and often subtle nature of these injuries. Additionally, the limited availability and sometimes poor quality of medical images further complicate accurate AI-based detection. Addressing these challenges, this study introduces a transfer learning model optimized for recognizing acute thoracolumbar vertebral compression fractures from a small set of low-quality X-ray images. The model starts with a feature extraction model that analyzes multiple texture features of X-ray images. It then employs a Vision Transformer Detector (ViTDet) combined with a faster region-based convolutional neural network (Faster R-CNN) to recognize fractures efficiently. To enhance its performance on small datasets, the model employs a transfer learning approach for training. Extensive experiments with a large dataset of real-world images have shown that this model can effectively recognize acute thoracolumbar vertebral compression fractures from low-quality images, outperforming professionals with specialized knowledge in some cases.

ICML Conference 2025 Conference Paper

ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset

  • Yilin Wang
  • Peixuan Lei
  • Jie Song
  • Yuzhe Hao
  • Tao Chen
  • Yuxuan Zhang
  • Lei Jia
  • Yuanxiang Li

Time-series data are critical in diverse applications, such as industrial monitoring, medical diagnostics, and climate research. However, effectively integrating these high-dimensional temporal signals with natural language for dynamic, interactive tasks remains a significant challenge. To address this, we introduce the Time-Series Question Answering (Time-Series QA) task and release EngineMT-QA, the first large-scale, multi-task, temporal-textual QA dataset designed to capture complex interactions between time-series signals and natural language. Building on this resource, we propose the Instruct Time Transformer (ITFormer), a novel framework that bridges time-series encoders with frozen large language models (LLMs). ITFormer effectively extracts, aligns, and fuses temporal and textual features, achieving a strong improvement in QA accuracy over strong baselines with fewer than 1% additional trainable parameters. By combining computational efficiency with robust cross-modal modeling, our work establishes a adaptable paradigm for integrating temporal data with natural language, paving the way for new research and applications in multi-modal AI. More details about the project, including datasets and code, are available at: https: //pandalin98. github. io/itformer_site/.

ICLR Conference 2025 Conference Paper

MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer

  • Yilin Wang
  • Chuan Guo 0002
  • Yuxuan Mu
  • Muhammad Gohar Javed
  • Xinxin Zuo
  • Juwei Lu
  • Hai Jiang 0001
  • Li Cheng 0001

Generative masked transformer have demonstrated remarkable success across various content generation tasks, primarily due to their ability to effectively model large-scale dataset distributions with high consistency. However, in the animation domain, large datasets are not always available. Applying generative masked modeling to generate diverse instances from a single MoCap reference may lead to overfitting, a challenge that remains unexplored. In this work, we present MotionDreamer, a localized masked modeling paradigm designed to learn motion internal patterns from a given motion with arbitrary topology and duration. By embedding the given motion into quantized tokens with a novel distribution regularization method, MotionDreamer constructs a robust and informative codebook for local motion patterns. Moreover, a sliding window local attention is introduced in our masked transformer, enabling the generation of natural yet diverse animations that closely resemble the reference motion patterns. As demonstrated through comprehensive experiments, MotionDreamer outperforms the state-of-the-art methods that are typically GAN or Diffusion-based in both faithfulness and diversity. Thanks to the consistency and robustness of quantization-based approach, MotionDreamer can also effectively perform downstream tasks such as temporal motion editing, crowd motion synthesis, and beat-aligned dance generation, all using a single reference motion. Our implementation, learned models and results are to be made publicly available upon paper acceptance.

AAAI Conference 2025 Conference Paper

SigStyle: Signature Style Transfer via Personalized Text-to-Image Models

  • Ye Wang
  • Tongyuan Bai
  • Xuping Xie
  • Zili Yi
  • Yilin Wang
  • Rui Ma

Style transfer enables the seamless integration of artistic styles from a style image into a content image, resulting in visually striking and aesthetically enriched outputs. Despite numerous advances in this field, existing methods did not explicitly focus on the signature style, which represents the distinct and recognizable visual traits of the image such as geometric and structural patterns, color palettes and brush strokes etc. In this paper, we introduce SigStyle, a framework that leverages the semantic priors that embedded in a personalized text-to-image diffusion model to capture the signature style representation. This style capture process is powered by a hypernetwork that efficiently fine-tunes the diffusion model for any given single style image. Style transfer then is conceptualized as the reconstruction process of content image through learned style tokens from the personalized diffusion model. Additionally, to ensure the content consistency throughout the style transfer process, we introduce a time-aware attention swapping technique that incorporates content information from the original image into the early denoising steps of target image generation. Beyond enabling high-quality signature style transfer across a wide range of styles, SigStyle supports multiple interesting applications, such as local style transfer, texture transfer, style fusion and style-guided text-to-image generation. Quantitative and qualitative evaluations demonstrate our approach outperforms existing style transfer methods for recognizing and transferring the signature styles.

AAAI Conference 2024 Conference Paper

Amodal Scene Analysis via Holistic Occlusion Relation Inference and Generative Mask Completion

  • Bowen Zhang
  • Qing Liu
  • Jianming Zhang
  • Yilin Wang
  • Liyang Liu
  • Zhe Lin
  • Yifan Liu

Amodal scene analysis entails interpreting the occlusion relationship among scene elements and inferring the possible shapes of the invisible parts. Existing methods typically frame this task as an extended instance segmentation or a pair-wise object de-occlusion problem. In this work, we propose a new framework, which comprises a Holistic Occlusion Relation Inference (HORI) module followed by an instance-level Generative Mask Completion (GMC) module. Unlike previous approaches, which rely on mask completion results for occlusion reasoning, our HORI module directly predicts an occlusion relation matrix in a single pass. This approach is much more efficient than the pair-wise de-occlusion process and it naturally handles mutual occlusion, a common but often neglected situation. Moreover, we formulate the mask completion task as a generative process and use a diffusion-based GMC module for instance-level mask completion. This improves mask completion quality and provides multiple plausible solutions. We further introduce a large-scale amodal segmentation dataset with high-quality human annotations, including mutual occlusions. Experiments on our dataset and two public benchmarks demonstrate the advantages of our method. code public available at https://github.com/zbwxp/Amodal-AAAI.

IJCAI Conference 2023 Conference Paper

A Canonicalization-Enhanced Known Fact-Aware Framework For Open Knowledge Graph Link Prediction

  • Yilin Wang
  • Minghao Hu
  • Zhen Huang
  • Dongsheng Li
  • Wei Luo
  • Dong Yang
  • Xicheng Lu

Open knowledge graph (OpenKG) link prediction aims to predict missing factual triples in the form of (head noun phrase, relation phrase, tail noun phrase). Since triples are not canonicalized, previous methods either focus on canonicalizing noun phrases (NPs) to reduce graph sparsity, or utilize textual forms to improve type compatibility. However, they neglect to canonicalize relation phrases (RPs) and triples, making OpenKG maintain high sparsity and impeding the performance. To address the above issues, we propose a Canonicalization-Enhanced Known Fact-Aware (CEKFA) framework that boosts link prediction performance through sparsity reduction of RPs and triples. First, we propose a similarity-driven RP canonicalization method to reduce RPs' sparsity by sharing knowledge of semantically similar ones. Second, to reduce the sparsity of triples, a known fact-aware triple canonicalization method is designed to retrieve relevant known facts from training data. Finally, these two types of canonical information are integrated into a general two-stage re-ranking framework that can be applied to most existing knowledge graph embedding methods. Experiment results on two OpenKG datasets, ReVerb20K and ReVerb45K, show that our approach achieves state-of-the-art results. Extensive experimental analyses illustrate the effectiveness and generalization ability of the proposed framework.

UAI Conference 2023 Conference Paper

On the Role of Generalization in Transferability of Adversarial Examples

  • Yilin Wang
  • Farzan Farnia

Black-box adversarial attacks designing adversarial examples for unseen deep neural networks (DNNs) have received great attention over the past years. However, the underlying factors driving the transferability of black-box adversarial examples still lack a thorough understanding. In this paper, we aim to demonstrate the role of the generalization behavior of the substitute classifier used for generating adversarial examples in the transferability of the attack scheme to unobserved DNN classifiers. To do this, we apply the max-min adversarial example game framework and show the importance of the generalization properties of the substitute DNN from training to test data in the success of the black-box attack scheme in application to different DNN classifiers. We prove theoretical generalization bounds on the difference between the attack transferability rates on training and test samples. Our bounds suggest that operator norm-based regularization methods could improve the transferability of the designed adversarial examples. We support our theoretical results by performing several numerical experiments showing the role of the substitute network’s generalization in generating transferable adversarial examples. Our empirical results indicate the power of Lipschitz regularization and early stopping methods in improving the transferability of designed adversarial examples.

NeurIPS Conference 2023 Conference Paper

PHOTOSWAP: Personalized Subject Swapping in Images

  • Jing Gu
  • Yilin Wang
  • Nanxuan Zhao
  • Tsu-Jui Fu
  • Wei Xiong
  • Qing Liu
  • Zhifei Zhang
  • He Zhang

In an era where images and visual content dominate our digital landscape, the ability to manipulate and personalize these images has become a necessity. Envision seamlessly substituting a tabby cat lounging on a sunlit window sill in a photograph with your own playful puppy, all while preserving the original charm and composition of the image. We present \emph{Photoswap}, a novel approach that enables this immersive image editing experience through personalized subject swapping in existing images. \emph{Photoswap} first learns the visual concept of the subject from reference images and then swaps it into the target image using pre-trained diffusion models in a training-free manner. We establish that a well-conceptualized visual subject can be seamlessly transferred to any image with appropriate self-attention and cross-attention manipulation, maintaining the pose of the swapped subject and the overall coherence of the image. Comprehensive experiments underscore the efficacy and controllability of \emph{Photoswap} in personalized subject swapping. Furthermore, \emph{Photoswap} significantly outperforms baseline methods in human ratings across subject swapping, background preservation, and overall quality, revealing its vast application potential, from entertainment to professional editing.

IJCAI Conference 2023 Conference Paper

Scalable Optimal Margin Distribution Machine

  • Yilin Wang
  • Nan Cao
  • Teng Zhang
  • Xuanhua Shi
  • Hai Jin

Optimal margin Distribution Machine (ODM) is a newly proposed statistical learning framework rooting in the novel margin theory, which demonstrates better generalization performance than the traditional large margin based counterparts. Nonetheless, it suffers from the ubiquitous scalability problem regarding both computation time and memory as other kernel methods. This paper proposes a scalable ODM, which can achieve nearly ten times speedup compared to the original ODM training method. For nonlinear kernels, we propose a novel distribution-aware partition method to make the local ODM trained on each partition be close and converge faster to the global one. When linear kernel is applied, we extend a communication efficient SVRG method to accelerate the training further. Extensive empirical studies validate that our proposed method is highly computational efficient and almost never worsen the generalization.

NeurIPS Conference 2020 Conference Paper

A Feasible Level Proximal Point Method for Nonconvex Sparse Constrained Optimization

  • Digvijay Boob
  • Qi Deng
  • Guanghui Lan
  • Yilin Wang

Nonconvex sparse models have received significant attention in high-dimensional machine learning. In this paper, we study a new model consisting of a general convex or nonconvex objectives and a variety of continuous nonconvex sparsity-inducing constraints. For this constrained model, we propose a novel proximal point algorithm that solves a sequence of convex subproblems with gradually relaxed constraint levels. Each subproblem, having a proximal point objective and a convex surrogate constraint, can be efficiently solved based on a fast routine for projection onto the surrogate constraint. We establish the asymptotic convergence of the proposed algorithm to the Karush-Kuhn-Tucker (KKT) solutions. We also establish new convergence complexities to achieve an approximate KKT solution when the objective can be smooth/nonsmooth, deterministic/stochastic and convex/nonconvex with complexity that is on a par with gradient descent for unconstrained optimization problems in respective cases. To the best of our knowledge, this is the first study of the first-order methods with complexity guarantee for nonconvex sparse-constrained problems. We perform numerical experiments to demonstrate the effectiveness of our new model and efficiency of the proposed algorithm for large scale problems.

NeurIPS Conference 2018 Conference Paper

Generalizing Graph Matching beyond Quadratic Assignment Model

  • Tianshu Yu
  • Junchi Yan
  • Yilin Wang
  • Wei Liu
  • Baoxin Li

Graph matching has received persistent attention over decades, which can be formulated as a quadratic assignment problem (QAP). We show that a large family of functions, which we define as Separable Functions, can approximate discrete graph matching in the continuous domain asymptotically by varying the approximation controlling parameters. We also study the properties of global optimality and devise convex/concave-preserving extensions to the widely used Lawler's QAP form. Our theoretical findings show the potential for deriving new algorithms and techniques for graph matching. We deliver solvers based on two specific instances of Separable Functions, and the state-of-the-art performance of our method is verified on popular benchmarks.

AAAI Conference 2017 Conference Paper

CLARE: A Joint Approach to Label Classification and Tag Recommendation

  • Yilin Wang
  • Suhang Wang
  • Jiliang Tang
  • Guojun Qi
  • Huan Liu
  • Baoxin Li

Data classification and tag recommendation are both important and challenging tasks in social media. These two tasks are often considered independently and most efforts have been made to tackle them separately. However, labels in data classification and tags in tag recommendation are inherently related. For example, a Youtube video annotated with NCAA, stadium, pac12 is likely to be labeled as football, while a video/image with the class label of coast is likely to be tagged with beach, sea, water and sand. The existence of relations between labels and tags motivates us to jointly perform classification and tag recommendation for social media data in this paper. In particular, we provide a principled way to capture the relations between labels and tags, and propose a novel framework CLARE, which fuses data CLAssification and tag REcommendation into a coherent model. With experiments on three social media datasets, we demonstrate that the proposed framework CLARE achieves superior performance on both tasks compared to the state-of-the-art methods.

IJCAI Conference 2015 Conference Paper

Exploring Implicit Hierarchical Structures for Recommender Systems

  • Suhang Wang
  • Jiliang Tang
  • Yilin Wang
  • Huan Liu

Items in real-world recommender systems exhibit certain hierarchical structures. Similarly, user preferences also present hierarchical structures. Recent studies show that incorporating the explicit hierarchical structures of items or user preferences can improve the performance of recommender systems. However, explicit hierarchical structures are usually unavailable, especially those of user preferences. Thus, there’s a gap between the importance of hierarchical structures and their availability. In this paper, we investigate the problem of exploring the implicit hierarchical structures for recommender systems when they are not explicitly available. We propose a novel recommendation framework HSR to bridge the gap, which enables us to capture the implicit hierarchical structures of users and items simultaneously. Experimental results on two real world datasets demonstrate the effectiveness of the proposed framework.

IJCAI Conference 2015 Conference Paper

Unsupervised Sentiment Analysis for Social Media Images

  • Yilin Wang
  • Suhang Wang
  • Jiliang Tang
  • Huan Liu
  • Baoxin Li

Recently text-based sentiment prediction has been extensively studied, while image-centric sentiment analysis receives much less attention. In this paper, we study the problem of understanding human sentiments from large-scale social media images, considering both visual content and contextual information, such as comments on the images, captions, etc. The challenge of this problem lies in the “semantic gap” between low-level visual features and higher-level image sentiments. Moreover, the lack of proper annotations/labels in the majority of social media images presents another challenge. To address these two challenges, we propose a novel Unsupervised SEntiment Analysis (USEA) framework for social media images. Our approach exploits relations among visual content and relevant contextual information to bridge the “semantic gap” in prediction of image sentiments. With experiments on two large-scale datasets, we show that the proposed method is effective in addressing the two challenges.