Arrow Research search

Author name cluster

Lu Jiang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

30 papers
2 author rows

Possible papers

30

AAAI Conference 2026 Conference Paper

Comprehensive Urban Region Representation Learning via Multi-View Joint Learning and Contrastive Learning

  • Yingde Lin
  • Yuanbo Xu
  • Lu Jiang
  • Pengyang Wang

Urban region embedding, which learns dense vector representations for urban zones, plays a foundational role in data-driven urban intelligence. These representations are critical for downstream applications like public safety management and infrastructure development, requiring nuanced understanding of urban functionality. A core challenge remains effective fusion of multi-view data (e.g., human mobility flows and static regional attributes) into unified zone representations. To this end, we propose MVJC, a Multi-view Joint Learning and Contrastive Learning framework, which employs: (1) Multi-view Joint Learning (MVJL) layer to model intra-view dependencies to extract view-specific features and (2) Multi-view Contrastive Learning (MVCL) layer to perform cross-region aggregation to derive consensus representations while capturing the regional complementarity. We further introduce a structure-aware contrastive loss that mitigates false negatives by aligning representations through region topology instead of instance identity. Extensive experiments on New York City datasets demonstrate MVJC’s superiority: it reduces crime prediction MAE by 9.1% (vs. 66.9 baseline) and improves land use clustering F-measure by 55.6% (vs. 0.45 baseline) over state-of-the-art method, which is attributed to MVJC’s synergy of joint and contrastive learning, yielding representations that are simultaneously predictive and semantically discriminative.

TIST Journal 2026 Journal Article

Hierarchical Reinforcement Learning for Energy-Efficient Urban Planning

  • Yanan Xiao
  • Lu Jiang
  • Steven Jige Quan
  • Pengfei Wang
  • Minghao Yin
  • Pengyang Wang

Urban planning plays a pivotal role in fostering sustainable, energy-efficient communities amidst the escalating challenges of climate change and energy crises. However, traditional methods often overlook the hierarchical interdependencies inherent in urban structures, land use configurations, and building design, leading to suboptimal outcomes. Here, we introduce a Hierarchical Energy-Efficient Urban Planning (HEEUP) framework, leveraging hierarchical reinforcement learning (HRL) to integrate decision-making across three tiers: urban structure, land use configuration, and building design. Our approach facilitates collaboration between cascading agents, enabling energy-conscious planning decisions that align with functional and contextual urban demands. Evaluations across real-world datasets from Miami and Albuquerque demonstrate a 7.3% improvement in energy efficiency compared to baseline methods. Furthermore, HEEUP achieves an average 24.52% reduction in computational training time compared to traditional optimization algorithms and generative machine learning models. These findings underscore the potential of HEEUP to significantly advance energy-efficient urban planning, providing a scalable, adaptable paradigm for addressing contemporary sustainability challenges.

NeurIPS Conference 2025 Conference Paper

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation

  • Shanchuan Lin
  • Ceyuan Yang
  • Hao He
  • Jianwen Jiang
  • Yuxi Ren
  • Xin Xia
  • Yang Zhao
  • Xuefeng Xiao

Existing large-scale video generation models are computationally intensive, preventing adoption in real-time and interactive applications. In this work, we propose autoregressive adversarial post-training (AAPT) to turn a pre-trained latent video diffusion model into a real-time, interactive, streaming video generator. Our model autoregressively generates a latent frame at a time using a single neural function evaluation (1NFE). The model can stream the result to the user in real time and receive interactive responses as control to generate the next latent frame. Unlike existing approaches, our method explores adversarial training as an effective paradigm for autoregressive generation. This allows us to design a more efficient architecture for one-step generation and to train the model in a student-forcing way to mitigate error accumulation. The adversarial approach also enables us to train the model for long-duration generation fully utilizing the KV cache. As a result, our 8B model achieves real-time, 24fps, nonstop, streaming video generation at 736x416 resolution on a single H100, or 1280x720 on 8xH100 up to a minute long (1440 frames).

AAAI Conference 2025 Conference Paper

Beyond Prompt Engineering: A Reinforced Token-Level Input Refinement for Large Language Models

  • Guang Huang
  • Yanan Xiao
  • Lu Jiang
  • Minghao Yin
  • Pengyang Wang

In the rapidly developing field of automatic text generation and understanding, the quality of input data has been shown to be a key factor affecting the efficiency and accuracy of large language model (LLM) output. With the advent of advanced tools such as ChatGPT, input refinement work has mainly focused on prompt engineering. However, existing methods are often too dependent on specific contexts and are easily affected by individual expert experience and potential biases, limiting their wide applicability in diverse real-world applications. To address this problem, this study develops an Reinforced Token-Level Input Refinement, called RTLIR. We choose to optimize the input data at the fine-grained level of tokens, cleverly preserving the original text structure. Operationally, each state is defined by the token set of the current text, and each action is a binary decision process to decide whether to retain a specific token information. The agent automatically calculates and determines the selection probability of each token based on the current state, thereby optimizing the entire decision process. Through continuous exploration and learning, the agent can autonomously learn to identify the key inputs that have the greatest impact on the generation results and achieve refinement of the input data. In addition, RTLIR is a plug-and-play, LLM-agnostic module that can be used for a wide range of tasks and models. Experimental results show that RTLIR improves the performance of LLM in various input scenarios and tasks, with an average accuracy increase of 6%.

ICML Conference 2025 Conference Paper

Diffusion Adversarial Post-Training for One-Step Video Generation

  • Shanchuan Lin
  • Xin Xia 0014
  • Yuxi Ren
  • Ceyuan Yang
  • Xuefeng Xiao 0001
  • Lu Jiang

The diffusion models are widely used for image and video generation, but their iterative generation process is slow and expansive. While existing distillation approaches have demonstrated the potential for one-step generation in the image domain, they still suffer from significant quality degradation. In this work, we propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation. To improve the training stability and quality, we introduce several improvements to the model architecture and training procedures, along with an approximated R1 regularization objective. Empirically, our experiments show that our adversarial post-trained model can generate two-second, 1280x720, 24fps videos in real-time using a single forward evaluation step. Additionally, our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.

NeurIPS Conference 2025 Conference Paper

Prompt-guided Disentangled Representation for Action Recognition

  • Tianci Wu
  • Guangming Zhu
  • Lu Jiang
  • Siyuan Wang
  • Ning Wang
  • Nuoye Xiong
  • Liang Zhang

Action recognition is a fundamental task in video understanding. Existing methods typically extract unified features to process all actions in one video, which makes it challenging to model the interactions between different objects in multi-action scenarios. To alleviate this issue, we explore disentangling any specified actions from complex scenes as an effective solution. In this paper, we propose Prompt-guided Disentangled Representation for Action Recognition (ProDA), a novel framework that disentangles any specified actions from a multi-action scene. ProDA leverages Spatio-temporal Scene Graphs (SSGs) and introduces Dynamic Prompt Module (DPM) to guide a Graph Parsing Neural Network (GPNN) in generating action-specific representations. Furthermore, we design a video-adapted GPNN that aggregates information using dynamic weights. Extensive experiments on two complex video action datasets, Charades and SportsHHI, demonstrate the effectiveness of our approach against state-of-the-art methods. Our code can be found in https: //github. com/iamsnaping/ProDA. git.

AAAI Conference 2025 Conference Paper

RATT: A Thought Structure for Coherent and Correct LLM Reasoning

  • Jinghan Zhang
  • Xiting Wang
  • Weijieying Ren
  • Lu Jiang
  • Dongjie Wang
  • Kunpeng Liu

Large Language Models (LLMs) gain substantial reasoning and decision-making capabilities from thought structures. However, existing methods such as Tree of Thought and Retrieval Augmented Thoughts often fall short in complex tasks due to the limitations of insufficient local retrieval of factual knowledge and inadequate global selection of strategies. These limitations make it challenging for these methods to balance factual accuracy and comprehensive logical optimization effectively. To address these limitations, we introduce the Retrieval Augmented Thought Tree (RATT), a novel thought structure that considers both overall logical soundness and factual correctness at each step of the thinking process. Specifically, at every point of a thought branch, RATT performs planning and lookahead to explore and evaluate multiple potential reasoning steps, and integrate the fact-checking ability of Retrieval-Augmented Generation (RAG) with LLM's ability to assess overall strategy. Through this combination of factual knowledge and strategic feasibility, the RATT adjusts and integrates the thought tree structure to search for the most promising branches within the search space. This thought structure significantly enhances the model's coherence in logical inference and efficiency in decision-making, and thus increases the limit of the capacity of LLM to generate reliable inferences and decisions based on thought structures. A broad range of experiments on different types of tasks showcases that the RATT structure significantly outperforms existing methods in factual correctness and logical coherence.

AAAI Conference 2025 Conference Paper

Reducing AUV Energy Consumption Through Dynamic Sensor Directions Switching via Deep Reinforcement Learning

  • Jiawei Liu
  • Yuanbo Xu
  • Shanshan Song
  • Lu Jiang

Autonomous underwater vehicle (AUV) is crucial for marine applications such as ocean data collection, pollution monitoring, and navigation. However, their limited energy resources constrain their operational duration, posing a significant challenge for long-term operations. Due to the complex and unpredictable nature of the underwater environment, AUVs allocate energy to their sensing systems to sense the surrounding environment and avoid obstacles. Existing methods focus on reducing energy consumption on AUV computing and movement, neglecting sensing energy consumption and few attempts have been made to balance the AUV energy and sensing ability with a flexible sensing system. Along these lines, we consider both AUV energy consumption and flexible sensing abilities, and propose a deep reinforcement learning-based method to Reduce Energy Consumption by AUV Sensing system (RECS). Specifically, we build an AUV sensing system in a 2-dimension space, with controllable 8-direction sensing abilities to collect the environment information dynamically. Then we divide the underwater environment into several areas and assign weights on the edges of areas based on the AUV planned path. Additionally, we dynamically switch the sensors in different directions and radii to sense the edges of the area where the AUV is located. The Artificial Potential Field (APF) method is employed to re-plan the AUV path to avoid obstacles and reach the target point effectively. Experimental results demonstrate that compared to full sensors on, our method reduces energy consumption by 53.48% and is capable of generalizing to varying environments and varying sensing system radii.

NeurIPS Conference 2025 Conference Paper

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

  • Yuanxin Liu
  • Rui Zhu
  • Shuhuai Ren
  • Jiacong Wang
  • Haoyuan Guo
  • Xu Sun
  • Lu Jiang

With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with the increasing demands for finer-grained and more comprehensive evaluations. To address this issue, this work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AIGVs, leveraging their strong visual perception and language understanding capabilities. To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Using UVE-Bench, we extensively evaluate 18 MLLMs. Our empirical results suggest that while advanced MLLMs (e. g. , Qwen2VL-72B and InternVL2. 5-78B) still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods. Additionally, we conduct an in-depth analysis of key design choices that impact the performance of MLLM-driven evaluators, offering valuable insights for future research on AIGV evaluation.

NeurIPS Conference 2025 Conference Paper

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

  • Jiaming Han
  • Hao Chen
  • Yang Zhao
  • Hanyu Wang
  • Qi Zhao
  • Ziyan Yang
  • Hao He
  • Xiangyu Yue

This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. All code, models, and data will be made publicly available.

NeurIPS Conference 2024 Conference Paper

A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation

  • Gwanghyun Kim
  • Alonso Martinez
  • Yu-Chuan Su
  • Brendan Jou
  • José Lezama
  • Agrim Gupta
  • Lijun Yu
  • Lu Jiang

Training diffusion models for audiovisual sequences allows for a range of generation tasks by learning conditional distributions of various input-output combinations of the two modalities. Nevertheless, this strategy often requires training a separate model for each task which is expensive. Here, we propose a novel training approach to effectively learn arbitrary conditional distributions in the audiovisual space. Our key contribution lies in how we parameterize the diffusion timestep in the forward diffusion process. Instead of the standard fixed diffusion timestep, we propose applying variable diffusion timesteps across the temporal dimension and across modalities of the inputs. This formulation offers flexibility to introduce variable noise levels for various portions of the input, hence the term mixture of noise levels. We propose a transformer-based audiovisual latent diffusion model and show that it can be trained in a task-agnostic fashion using our approach to enable a variety of audiovisual generation tasks at inference time. Experiments demonstrate the versatility of our method in tackling cross-modal and multimodal interpolation tasks in the audiovisual space. Notably, our proposed approach surpasses baselines in generating temporally and perceptually consistent samples conditioned on the input. Project page: neurips13025. github. io

IJCAI Conference 2024 Conference Paper

Hierarchical Reinforcement Learning for Point of Interest Recommendation

  • Yanan Xiao
  • Lu Jiang
  • Kunpeng Liu
  • Yuanbo Xu
  • Pengyang Wang
  • Minghao Yin

With the increasing popularity of location-based services, accurately recommending points of interest (POIs) has become a critical task. Although existing technologies are proficient in processing time-series data, they fall short when it comes to accommodating the diversity and dynamism in users' POI selections, particularly in extracting key signals from complex historical behaviors. To address this challenge, we introduced the Hierarchical Reinforcement Learning Preprocessing Framework (HRL-PRP), a framework that can be integrated into existing recommendation models to effectively optimize user profiles. The HRL-PRP framework employs a two-tiered decision-making process, where the high-level process determines the necessity of modifying profiles, and the low-level process focuses on selecting POIs within the profiles. Through evaluations on multiple real-world datasets, we have demonstrated that HRL-PRP surpasses existing state-of-the-art methods in various recommendation performance metrics.

IJCAI Conference 2024 Conference Paper

Hierarchical Reinforcement Learning on Multi-Channel Hypergraph Neural Network for Course Recommendation

  • Lu Jiang
  • Yanan Xiao
  • Xinxin Zhao
  • Yuanbo Xu
  • Shuli Hu
  • Pengyang Wang
  • Minghao Yin

With the widespread popularity of massive open online courses, personalized course recommendation has become increasingly important due to enhancing users' learning efficiency. While achieving promising performances, current works suffering from the vary across the users and other MOOC entities. To address this problem, we propose hierarchical reinforcement learning with a multi-channel hypergraphs neural network for course recommendation(called HHCoR). Specifically, we first construct an online course hypergraph as the environment to capture the complex relationships and historical information by considering all entities. Then, we design a multi-channel propagation mechanism to aggregate embeddings in the online course hypergraph and extract user interest through an attention layer. Besides, we employ two-level decision-making: the low-level focuses on the rating courses, while the high-level integrates these considerations to finalize the decision. Furthermore, in co-optimization, we design a joint reward function to improve the policy of two-layer agents. Finally, we conducted extensive experiments on two real-world datasets and the quantitative results have demonstrated the effectiveness of the proposed method.

AAAI Conference 2024 Conference Paper

Spatial-Temporal Interplay in Human Mobility: A Hierarchical Reinforcement Learning Approach with Hypergraph Representation

  • Zhaofan Zhang
  • Yanan Xiao
  • Lu Jiang
  • Dingqi Yang
  • Minghao Yin
  • Pengyang Wang

In the realm of human mobility, the decision-making process for selecting the next-visit location is intricately influenced by a trade-off between spatial and temporal constraints, which are reflective of individual needs and preferences. This trade-off, however, varies across individuals, making the modeling of these spatial-temporal dynamics a formidable challenge. To address the problem, in this work, we introduce the "Spatial-temporal Induced Hierarchical Reinforcement Learning" (STI-HRL) framework, for capturing the interplay between spatial and temporal factors in human mobility decision-making. Specifically, STI-HRL employs a two-tiered decision-making process: the low-level focuses on disentangling spatial and temporal preferences using dedicated agents, while the high-level integrates these considerations to finalize the decision. To complement the hierarchical decision setting, we construct a hypergraph to organize historical data, encapsulating the multi-aspect semantics of human mobility. We propose a cross-channel hypergraph embedding module to learn the representations as the states to facilitate the decision-making cycle. Our extensive experiments on two real-world datasets validate the superiority of STI-HRL over state-of-the-art methods in predicting users' next visits across various performance metrics.

IJCAI Conference 2024 Conference Paper

TFWT: Tabular Feature Weighting with Transformer

  • Xinhao Zhang
  • Zaitian Wang
  • Lu Jiang
  • Wanfu Gao
  • Pengfei Wang
  • Kunpeng Liu

In this paper, we propose a novel feature weighting method to address the limitation of existing feature processing methods for tabular data. Typically the existing methods assume equal importance across all samples and features in one dataset. This simplified processing methods overlook the unique contributions of each feature, and thus may miss important feature information. As a result, it leads to suboptimal performance in complex datasets with rich features. To address this problem, we introduce Tabular Feature Weighting with Transformer, a novel feature weighting approach for tabular data. Our method adopts Transformer to capture complex feature dependencies and contextually assign appropriate weights to discrete and continuous features. Besides, we employ a reinforcement learning strategy to further fine-tune the weighting process. Our extensive experimental results across various real-world datasets and diverse downstream tasks show the effectiveness of TFWT and highlight the potential for enhancing feature weighting in tabular data analysis.

TMLR Journal 2024 Journal Article

VideoGLUE: Video General Understanding Evaluation of Foundation Models

  • Liangzhe Yuan
  • Nitesh Bharadwaj Gundavarapu
  • Long Zhao
  • Hao Zhou
  • Yin Cui
  • Lu Jiang
  • Xuan Yang
  • Menglin Jia

We evaluate the video understanding capabilities of existing foundation models (FMs) using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition,temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring an FM for downstream tasks. Furthermore,we jointly profile FMs’ efficacy and efficiency when adapting to general video understanding tasks using cost measurements during both training and inference. Our main findings areas follows. First, task-specialized models significantly outperform the seven FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second, video-native FMs, whose pretraining data mainly contains the video modality, are generally better than image-native FMs in classifying motion-rich videos,localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks (e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs. Our code is released under: https://github.com/tensorflow/models/tree/master/official/projects/videoglue

AAAI Conference 2023 Conference Paper

Multi-View MOOC Quality Evaluation via Information-Aware Graph Representation Learning

  • Lu Jiang
  • Yibin Wang
  • Jianan Wang
  • Pengyang Wang
  • Minghao Yin

In this paper, we study the problem of MOOC quality evaluation that is essential for improving the course materials, promoting students' learning efficiency, and benefiting user services. While achieving promising performances, current works still suffer from the complicated interactions and relationships of entities in MOOC platforms. To tackle the challenges, we formulate the problem as a course representation learning task based, and develop an Information-aware Graph Representation Learning(IaGRL) for multi-view MOOC quality evaluation. Specifically, We first build a MOOC Heterogeneous Network (HIN) to represent the interactions and relationships among entities in MOOC platforms. And then we decompose the MOOC HIN into multiple single-relation graphs based on meta-paths to depict multi-view semantics of courses. The course representation learning can be further converted to a multi-view graph representation task. Different from traditional graph representation learning, the learned course representations are expected to match the following three types of validity: (1) the agreement on expressiveness between the raw course portfolio and the learned course representations; (2) the consistency between the representations in each view and the unified representations; (3) the alignment between the course and MOOC platform representations. Therefore, we propose to exploit mutual information for preserving the validity of course representations. We conduct extensive experiments over real-world MOOC datasets to demonstrate the effectiveness of our proposed method.

TIST Journal 2023 Journal Article

Reinforced Explainable Knowledge Concept Recommendation in MOOCs

  • Lu Jiang
  • Kunpeng Liu
  • Yibin Wang
  • Dongjie Wang
  • Pengyang Wang
  • Yanjie Fu
  • Minghao Yin

In this article, we study knowledge concept recommendation in Massive Open Online Courses (MOOCs) in an explainable manner. Knowledge concepts, composing course units (e.g., videos) in MOOCs, refer to topics and skills that students are expected to master. Compared to traditional course recommendation in MOOCs, knowledge concepts recommendation has drawn more attention because students’ interests over knowledge concepts can better revealstudents’ real intention in a more refined granularity. However, there are three unique challenges in knowledge concept recommendation: (1) How to design an appropriate data structure to capture complex relationships between knowledge concepts, course units, and other participants (e.g., students, teachers)? (2) How to model interactions between students and knowledge concepts? (3) How to make explainable recommendation results to students? To tackle these challenges, we formulate the knowledge concept recommendation as a reinforcement learning task integrated with MOOC knowledge graph (KG). Specifically, we first construct MOOC KG as the environment to capture all the relationships and behavioral histories by considering all the entities (e.g., students, teachers, videos, courses, and knowledge concepts) on the MOOC provider. Then, to model the interactions between students and knowledge concepts, we train an agent to mimic students’ learning behavioral patterns facing the complex environment. Moreover, to provide explainable recommendation results, we generate recommended knowledge concepts in the format of a path from MOOC KG to indicate semantic reasons. Finally, we conduct extensive experiments on a real-world MOOC dataset to demonstrate the effectiveness of our proposed method.

NeurIPS Conference 2023 Conference Paper

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

  • Lijun Yu
  • Yong Cheng
  • Zhiruo Wang
  • Vivek Kumar
  • Wolfgang Macherey
  • Yanping Huang
  • David Ross
  • Irfan Essa

In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the rich semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3. 5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.

NeurIPS Conference 2023 Conference Paper

StyleDrop: Text-to-Image Synthesis of Any Style

  • Kihyuk Sohn
  • Lu Jiang
  • Jarred Barber
  • Kimin Lee
  • Nataniel Ruiz
  • Dilip Krishnan
  • Huiwen Chang
  • Yuanzhen Li

Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language, and out-of-distribution effects make it hard to synthesize arbitrary image styles, leveraging a specific design pattern, texture or material. In this paper, we introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. StyleDrop is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. StyleDrop works by efficiently learning a new style by fine-tuning very few trainable parameters (less than 1\% of total model parameters), and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a single image specifying the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. More results are available at our project website: https: //styledrop. github. io.

JAIR Journal 2021 Journal Article

Confident Learning: Estimating Uncertainty in Dataset Labels

  • Curtis Northcutt
  • Lu Jiang
  • Isaac Chuang

Learning exists in the context of data, yet notions of confidence typically focus on model predictions, not label quality. Confident learning (CL) is an alternative approach which focuses instead on label quality by characterizing and identifying label errors in datasets, based on the principles of pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence. Whereas numerous studies have developed these principles independently, here, we combine them, building on the assumption of a class-conditional noise process to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels. This results in a generalized CL which is provably consistent and experimentally performant. We present sufficient conditions where CL exactly finds label errors, and show CL performance exceeding seven recent competitive approaches for learning with noisy labels on the CIFAR dataset. Uniquely, the CL framework is not coupled to a specific data modality or model (e.g., we use CL to find several label errors in the presumed error-free MNIST dataset and improve sentiment classification on text data in Amazon Reviews). We also employ CL on ImageNet to quantify ontological class overlap (e.g., estimating 645 missile images are mislabeled as their parent class projectile), and moderately increase model accuracy (e.g., for ResNet) by cleaning data prior to training. These results are replicable using the open-source cleanlab release.

AAAI Conference 2017 System Paper

An Event Reconstruction Tool for Conflict Monitoring Using Social Media

  • Junwei Liang
  • Desai Fan
  • Han Lu
  • Poyao Huang
  • Jia Chen
  • Lu Jiang
  • Alexander Hauptmann

What happened during the Boston Marathon in 2013? Nowadays, at any major event, lots of people take videos and share them on social media. To fully understand exactly what happened in these major events, researchers and analysts often have to examine thousands of these videos manually. To reduce this manual effort, we present an investigative system that automatically synchronizes these videos to a global timeline and localizes them on a map. In addition to alignment in time and space, our system combines various functions for analysis, including gunshot detection, crowd size estimation, 3D reconstruction and person tracking. To our best knowledge, this is the first time a unified framework has been built for comprehensive event reconstruction for social media videos.

AAAI Conference 2017 System Paper

Visual Memory QA: Your Personal Photo and Video Search Agent

  • Lu Jiang
  • LiangLiang Cao
  • Yannis Kalantidis
  • Sachin Farfade
  • Alex Hauptmann

The boom of mobile devices and cloud services has led to an explosion of personal photo and video data. However, due to the missing user-generated metadata such as titles or descriptions, it usually takes a user a lot of swipes to find some video on the cell phone. To solve the problem, we present an innovative idea called Visual Memory QA which allow a user not only to search but also to ask questions about her daily life captured in the personal videos. The proposed system automatically analyzes the content of personal videos without usergenerated metadata, and offers a conversational interface to accept and answer questions. To the best of our knowledge, it is the first to answer personal questions discovered in personal photos or videos. The example questions are “what was the lat time we went hiking in the forest near San Francisco? ”; “did we have pizza last week? ”; “with whom did I have dinner in AAAI 2015? ”.

AAAI Conference 2017 System Paper

Webly-Supervised Learning of Multimodal Video Detectors

  • Junwei Liang
  • Lu Jiang
  • Alexander Hauptmann

Given any complicated or specialized video content search query, e. g. ”Batkid (a kid in batman costume)” or ”destroyed buildings”, existing methods require manually labeled data to build detectors for searching. We present a demonstration of an artificial intelligence application, Webly-labeled Learning (WELL) that enables learning of ad-hoc concept detectors over unlimited Internet videos without any manual annotations. A considerable number of videos on the web are associated with rich but noisy contextual information, such as the title, which provides a type of weak annotations or labels of the video content. To leverage this information, our system employs state-of-the-art webly-supervised learning (WELL) (Liang et al. ). WELL considers multi-modal information including deep learning visual, audio and speech features, to automatically learn accurate video detectors based on the user query. The learned detectors from a large number of web videos allow users to search relevant videos over their personal video archives, not requiring any textual metadata, but as convenient as searching on Youtube.

IJCAI Conference 2016 Conference Paper

Learning to Detect Concepts from Webly-Labeled Video Data

  • Junwei Liang
  • Lu Jiang
  • Deyu Meng
  • Alexander Hauptmann

Learning detectors that can recognize concepts, such as people actions, objects, etc. , in video content is an interesting but challenging problem. In this paper, we study the problem of automatically learning detectors from the big video data on the web without any additional manual annotations. The contextual information available on the web provides noisy labels to the video content. To leverage the noisy web labels, we propose a novel method called WEbly-Labeled Learning (WELL). It is established on two theories called curriculum learning and self-paced learning and exhibits useful properties that can be theoretically verified. We provide compelling insights on the latent non-convex robust loss that is being minimized on the noisy data. In addition, we propose two novel techniques that not only enable WELL to be applied to big data but also lead to more accurate results. The efficacy and the scalability of WELL have been extensively demonstrated on two public benchmarks, including the largest multimedia dataset and the largest manually-labeled video set. Experimental results show that WELL significantly outperforms the state-of-the-art methods. To the best of our knowledge, WELL achieves by far the best reported performance on these two webly-labeled big video datasets.

AAAI Conference 2015 Conference Paper

Self-Paced Learning for Matrix Factorization

  • Qian Zhao
  • Deyu Meng
  • Lu Jiang
  • Qi Xie
  • Zongben Xu
  • Alexander Hauptmann

Matrix factorization (MF) has been attracting much attention due to its wide applications. However, since MF models are generally non-convex, most of the existing methods are easily stuck into bad local minima, especially in the presence of outliers and missing data. To alleviate this deficiency, in this study we present a new MF learning methodology by gradually including matrix elements into MF training from easy to complex. This corresponds to a recently proposed learning fashion called self-paced learning (SPL), which has been demonstrated to be beneficial in avoiding bad local minima. We also generalize the conventional binary (hard) weighting scheme for SPL to a more effective realvalued (soft) weighting manner. The effectiveness of the proposed self-paced MF method is substantiated by a series of experiments on synthetic, structure from motion and background subtraction data.

NeurIPS Conference 2014 Conference Paper

Self-Paced Learning with Diversity

  • Lu Jiang
  • Deyu Meng
  • Shoou-I Yu
  • Zhenzhong Lan
  • Shiguang Shan
  • Alexander Hauptmann

Self-paced learning (SPL) is a recently proposed learning regime inspired by the learning process of humans and animals that gradually incorporates easy to more complex samples into training. Existing methods are limited in that they ignore an important aspect in learning: diversity. To incorporate this information, we propose an approach called self-paced learning with diversity (SPLD) which formalizes the preference for both easy and diverse samples into a general regularizer. This regularization term is independent of the learning objective, and thus can be easily generalized into various learning tasks. Albeit non-convex, the optimization of the variables included in this SPLD regularization term for sample selection can be globally solved in linearithmic time. We demonstrate that our method significantly outperforms the conventional SPL on three real-world datasets. Specifically, SPLD achieves the best MAP so far reported in literature on the Hollywood2 and Olympic Sports datasets.

AAAI Conference 2010 Conference Paper

Learning to Surface Deep Web Content

  • Zhaohui Wu
  • Lu Jiang
  • Qinghua Zheng
  • Jun Liu

We propose a novel deep web crawling framework based on reinforcement learning. The crawler is regarded as an agent and deep web database as the environment. The agent perceives its current state and submits a selected action (query) to the environment according to Q-value. Based on the framework we develop an adaptive crawling method. Experimental results show that it outperforms the state of art methods in crawling capability and breaks through the assumption of full-text search implied by existing methods.