Author name cluster

Han Lin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

2 author rows

AAAI Conference 2026 Conference Paper

DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation

Zun Wang
Jialu Li
Han Lin
Jaehong Yoon
Mohit Bansal

Storytelling video generation (SVG) aims to produce coherent and visually rich multi-scene videos that follow a structured narrative. Existing methods primarily employ LLM for high-level planning to decompose a story into scene-level descriptions, which are then independently generated and stitched together. However, these approaches struggle with generating high-quality videos aligned with the complex single-scene description, as visualizing such complex description involves coherent composition of multiple objects/events, complex motion synthesis and character customization with sequential motions. To address these challenges, we propose DREAMRUNNER, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout planning. Next, DREAMRUNNER presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame spatial-temporal semantic control. We compare DREAMRUNNER with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DREAMRUNNER exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we demonstrate DREAMRUNNER’s ability to generate multi-character interactions with qualitative examples.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

Han Lin
Jaemin Cho
Amir Zadeh
Chuan Li
Mohit Bansal

There is growing interest in integrating high-fidelity visual synthesis capabilities into large language models (LLMs) without compromising their strong reasoning capabilities. Existing methods that directly train LLMs or bridge LLMs and diffusion models usually suffer from costly training since the backbone LLMs have not seen image representations during pretraining. We present Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables, which are natively aligned with the MLLM's CLIP visual encoder. These patch-level image embeddings are integrated into the diffusion model with a lightweight adaptation of its ControlNet. To retain the original multimodal reasoning capabilities of MLLMs, we equip the MLLM with a visual generation branch initialized from the original MLLM parameters when predicting the patch-level image embeddings. By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation with significant training efficiency. Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding, with substantially lower compute during training. We also provide comprehensive ablation studies showing the effectiveness of our design choices. Project page: https: //bifrost-1. github. io.

PDF Details

ICLR Conference 2025 Conference Paper

Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

Han Lin
Jaemin Cho 0001
Abhay Zala
Mohit Bansal

ControlNets are widely used for adding spatial control to text-to-image diffusion models. However, when it comes to controllable video generation, ControlNets cannot be directly integrated into new backbones due to feature space mismatches, and training ControlNets for new backbones can be a significant burden for many users. Furthermore, applying ControlNets independently to different frames can not effectively maintain object temporal consistency. To address these challenges, we introduce Ctrl-Adapter, an efficient and versatile framework that adds diverse controls to any image/video diffusion models through the adaptation of pretrained ControlNets. Ctrl-Adapter offers strong and diverse capabilities, including image and video control, sparse-frame video control, fine-grained patch-level multi-condition control, zero-shot adaptation to unseen conditions, and supports a variety of downstream tasks beyond spatial control, including video editing, video style transfer, and text-guided motion control. With six diverse U-Net/DiT-based image/video diffusion models (SDXL, PixArt-α, I2VGen-XL, SVD, Latte, Hotshot-XL), Ctrl-Adapter matches the performance of pretrained ControlNets on COCO and achieves the state-of-the-art on DAVIS 2017 with significantly lower computation (< 10 GPU hours).

Details

ICLR Conference 2025 Conference Paper

VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning

Han Lin
Tushar Nagarajan
Nicolas Ballas
Mahmoud Assran
Mojtaba Komeili
Mohit Bansal
Koustuv Sinha

Procedural video representation learning is an active research area where the objective is to learn an agent which can anticipate and forecast the future given the present video input, typically in conjunction with textual annotations. Prior works often rely on large-scale pretraining of visual encoders and prediction models with language supervision. However, the necessity and effectiveness of extending compute intensive pretraining to learn video clip sequences with noisy text supervision have not yet been fully validated by previous works. In this work, we show that a strong off-the-shelf frozen pretrained visual encoder, along with a well designed prediction model, can achieve state-of-the-art (SoTA) performance in forecasting and procedural planning without the need for pretraining the prediction model, nor requiring additional supervision from language or ASR. Instead of learning representations from pixel space, our method utilizes the latent embedding space of publicly available vision encoders. By conditioning on frozen clip-level embeddings from observed steps to predict the actions of unseen steps, our prediction model is able to learn robust representations for forecasting through iterative denoising —leveraging the recent advances in diffusion transformers (Peebles & Xie, 2023). Empirical studies over a total of five procedural learning tasks across four datasets (NIV, CrossTask, COIN and Ego4D-v2) show that our model advances the strong baselines in long-horizon action anticipation (+2.6% in Verb ED@20, +3.1% in Noun ED@20), and significantly improves the SoTA in step forecasting (+5.0%), task classification (+3.8%), and procedure planning tasks (up to +2.28% in success rate, +3.39% in mAcc, and +0.90% in mIoU).

Details

NeurIPS Conference 2024 Conference Paper

Fast Tree-Field Integrators: From Low Displacement Rank to Topological Transformers

Krzysztof Choromanski
Arijit Sehanobish
Somnath B. Chowdhury
Han Lin
Avinava Dubey
Tamas Sarlos
Snigdha Chaturvedi

We present a new class of fast polylog-linear algorithms based on the theory of structured matrices (in particular low displacement rank ) for integrating tensor fields defined on weighted trees. Several applications of the resulting fast tree-field integrators (FTFIs) are presented, including: (a) approximation of graph metrics with tree metrics, (b) graph classification, (c) modeling on meshes, and finally (d) Topological Transformers (TTs) (Choromanski et al. , 2022) for images. For Topological Transformers, we propose new relative position encoding (RPE) masking mechanisms with as few as three extra learnable parameters per Transformer layer, leading to 1. 0-1. 5\%+ accuracy gains. Importantly, most of FTFIs are exact methods, thus numerically equivalent to their brute-force counterparts. When applied to graphs with thousands of nodes, those exact algorithms provide 5. 7-13x speedups. We also provide an extensive theoretical analysis of our methods.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

MCM: Multi-condition Motion Synthesis Framework

Zeyu Ling
Bo Han
Yongkang Wong
Han Lin
Mohan Kankanhalli
Weidong Geng

Conditional human motion synthesis (HMS) aims to generate human motion sequences that conform to specific conditions. Text and audio represent the two predominant modalities employed as HMS control conditions. While existing research has primarily focused on single conditions, the multi-condition human motion synthesis remains underexplored. In this study, we propose a multi-condition HMS framework, termed MCM, based on a dual-branch structure composed of a main branch and a control branch. This framework effectively extends the applicability of the diffusion model, which is initially predicated solely on textual conditions, to auditory conditions. This extension encompasses both music-to-dance and co-speech HMS while preserving the intrinsic quality of motion and the capabilities for semantic association inherent in the original model. Furthermore, we propose the implementation of a Transformer-based diffusion model, designated as MWNet, as the main branch. This model adeptly apprehends the spatial intricacies and inter-joint correlations inherent in motion sequences, facilitated by the integration of multi-wise self-attention modules. Extensive experiments show that our method achieves competitive results in single-condition and multi-condition HMS tasks.

PDF Details DOI

ICML Conference 2024 Conference Paper

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Kaining Ying
Fanqing Meng
Jin Wang
Zhiqian Li
Han Lin
Yue Yang
Hao Zhang 0117
Wenbo Zhang 0009

Large Vision-Language Models (LVLMs) show significant strides in general-propose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, and reasoning. MMT-Bench comprises $31, 325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $20$ publicly available LVLMs such as the proprietary GeminiProVision model, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.

Details

ICML Conference 2023 Conference Paper

Efficient Graph Field Integrators Meet Point Clouds

Krzysztof Choromanski
Arijit Sehanobish
Han Lin
Yunfan Zhao
Eli Berger
Tetiana Parshakova
Alvin Pan
David Watkins

We present two new classes of algorithms for efficient field integration on graphs encoding point cloud data. The first class, $\mathrm{SeparatorFactorization}$ (SF), leverages the bounded genus of point cloud mesh graphs, while the second class, $\mathrm{RFDiffusion}$ (RFD), uses popular $\epsilon$-nearest-neighbor graph representations for point clouds. Both can be viewed as providing the functionality of Fast Multipole Methods (FMMs), which have had a tremendous impact on efficient integration, but for non-Euclidean spaces. We focus on geometries induced by distributions of walk lengths between points (e. g. shortest-path distance). We provide an extensive theoretical analysis of our algorithms, obtaining new results in structural graph theory as a byproduct. We also perform exhaustive empirical evaluation, including on-surface interpolation for rigid and deformable objects (in particular for mesh-dynamics modeling) as well as Wasserstein distance computations for point clouds, including the Gromov-Wasserstein variant.

Details

ICRA Conference 2023 Conference Paper

TANDEM3D: Active Tactile Exploration for 3D Object Recognition

Jingxi Xu 0002
Han Lin
Shuran Song
Matei Ciocarlie

Tactile recognition of 3D objects remains a challenging task. Compared to 2D shapes, the complex geometry of 3D surfaces requires richer tactile signals, more dexterous actions, and more advanced encoding techniques. In this work, we propose TANDEM3D, a method that applies a co-training framework for exploration and decision making to 3D object recognition with tactile signals. Starting with our previous work, which introduced a co-training paradigm for 2D recognition problems, we introduce a number of advances that enable us to scale up to 3D. TANDEM3D is based on a novel encoder that builds 3D object representation from contact positions and normals using PointNet++. Furthermore, by enabling 6DOF movement, TANDEM3D explores and collects discriminative touch information with high efficiency. Our method is trained entirely in simulation and validated with real-world experiments. Compared to state-of-the-art baselines, TANDEM3D achieves higher accuracy and a lower number of actions in recognizing 3D objects and is also shown to be more robust to different types and amounts of sensor noise.

Details

ICML Conference 2022 Conference Paper

From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers

Krzysztof Choromanski
Han Lin
Haoxian Chen 0002
Tianyi Zhang
Arijit Sehanobish
Valerii Likhosherstov
Jack Parker-Holder
Tamás Sarlós

In this paper we provide, to the best of our knowledge, the first comprehensive approach for incorporating various masking mechanisms into Transformers architectures in a scalable way. We show that recent results on linear causal attention (Choromanski et al. , 2021) and log-linear RPE-attention (Luo et al. , 2021) are special cases of this general mechanism. However by casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before, including efficient d-dimensional RPE-masking and graph-kernel masking. We leverage many mathematical techniques ranging from spectral analysis through dynamic programming and random walks to new algorithms for solving Markov processes on graphs. We provide a corresponding empirical evaluation.

Details

ICLR Conference 2022 Conference Paper

Hybrid Random Features

Krzysztof Choromanski
Han Lin
Haoxian Chen 0002
Arijit Sehanobish
Yuanzhe Ma
Deepali Jain
Jake Varley
Andy Zeng 0001

We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi & Recht, 2007) or (recently introduced in the context of linear-attention Transformers) positive random features (Choromanski et al., 2021). By generalizing Bochner’s Theorem for softmax/Gaussian kernels and leveraging random features for compositional kernels, the HRF-mechanism provides strong theoretical guarantees - unbiased approximation and strictly smaller worst-case relative errors than its counterparts. We conduct exhaustive empirical evaluation of HRF ranging from pointwise kernel estimation experiments, through tests on data admitting clustering structure to benchmarking implicit-attention Transformers (also for downstream Robotics applications), demonstrating its quality in a wide spectrum of machine learning problems.

Details

NeurIPS Conference 2020 Conference Paper

Demystifying Orthogonal Monte Carlo and Beyond

Han Lin
Haoxian Chen
Krzysztof M. Choromanski
Tianyi Zhang
Clement Laroche

Orthogonal Monte Carlo (OMC) is a very effective sampling algorithm imposing structural geometric conditions (orthogonality) on samples for variance reduction. Due to its simplicity and superior performance as compared to its Quasi Monte Carlo counterparts, OMC is used in a wide spectrum of challenging machine learning applications ranging from scalable kernel methods to predictive recurrent neural networks, generative models and reinforcement learning. However theoretical understanding of the method remains very limited. In this paper we shed new light on the theoretical principles behind OMC, applying theory of negatively dependent random variables to obtain several new concentration results. As a corollary, we manage to obtain first uniform convergence results for OMCs and consequently, substantially strengthen best known downstream guarantees for kernel ridge regression via OMCs. We also propose novel extensions of the method leveraging theory of algebraic varieties over finite fields and particle algorithms, called Near-Orthogonal Monte Carlo (NOMC). We show that NOMC is the first algorithm consistently outperforming OMC in applications ranging from kernel methods to approximating distances in probabilistic metric spaces.

PDF Details

AAMAS Conference 2016 Conference Paper

A Kinect-based Interactive Game to Improve the Cognitive Inhibition of the Elderly (Demonstration)

Siyuan Liu
Zhiqi Shen
Han Yu
Han Lin
Zhengjin Guo
Zhengxiang Pan
Chunyan Miao
Cryil Leung

Cognitive abilities, including cognitive inhibition, degenerate with the aging process. In this demonstration, we present a Kinect-based interactive game which aims to improve the cognitive inhibition ability of the elderly. The game is designed in the table tennis theme, and the adoption of Kinect makes it convenient for the elderly to use. The players’ in-game behaviour data are recorded for the health advisor agent to conduct personalization, analysis, and decision making. A pilot study has been conducted to investigate the relationship between the players’ cognitive inhibition abilities and their in-game performance. The study results suggest that the in-game performance can reflect a player’s cognitive inhibition ability, and indicate that the game can be used to improve the cognitive inhibition ability of the elderly in the future.

PDF

AAAI Conference 2008 Conference Paper

Within-problem Learning for Efficient Lower Bound Computation in Max-SAT Solving

Han Lin

This paper focuses on improving branch-and-bound Max-SAT solvers by speeding up the lower bound computation. We notice that the existing propagation-based computing methods and the resolution-based computing methods, which have been studied intensively, both suffer from several drawbacks. In order to overcome these drawbacks, we propose a new method with a nice property that guarantees the increment of lower bounds. The new method exploits within-problem learning techniques. More specifically, at each branch point in the search-tree, the current node is enabled to inherit inconsistencies from its parent and learn information about effectiveness of the lower bound computing procedure from previous nodes. Furthermore, after branching on a new variable, the inconsistencies may shrink by applying unit propagation to them, and such process increases the probability of getting better lower bounds. We graft the new techniques into maxsatz and the experimental results demonstrate that the new solver outperforms the best state-of-the-art solvers on a wide range of instances including random and structured ones.

PDF Details

IJCAI Conference 2007 Conference Paper

Han Lin
Kaile Su

In this paper we present a general logical framework for (weighted) MAX-SAT problem, and study properties of inference rules for branch and bound MAX-SAT solver. Several new rules, which are not equivalent but $\Lambda$-equivalent, are proposed, and we show that $\Lambda$-equivalent rules are also sound. As an example, we show how to exploit inference rules to achieve a new lower bound function for a MAX-2-SAT solver. Our new function is admissible and consistently better than the well-known lower bound function. Based on the study of inference rules, we implement an efficient solver and the experimental results demonstrate that our solver outperforms the most efficient solver that has been implemented very recently[Heras and Larrosa, 2006], especially for large instances.

PDF Details

AAAI Conference 2007 Conference Paper

A Modal Logic for Beliefs and Pro Attitudes

Su K
Han Lin

Agents’ pro attitudes such as goals, intentions, desires, wishes, and judgements of satisfactoriness play an important role in how agents act rationally. To provide a natural and satisfying formalization of these attitudes is a longstanding problem in the community of agent theory. Most of existing modal logic approaches are based on Kripke structures and have to face the so-called side-effect problem. This paper presents a new modal logic formalizing agents’ pro attitudes, based on neighborhood models. There are three distinguishing features of this logic. Firstly, this logic naturally satisﬁes Bratman’s requirements for agents’ beliefs and pro attitudes, as well as some interesting properties that have not been discussed before. Secondly, we give a sound and complete axiom system for characterizing all the valid properties of beliefs and pro attitudes. We introduce for the ﬁrst time the notion of linear neighborhood frame for obtaining the semantic model, and this brings a new member to the family of non-normal modal logics. Finally, we argue that the present logic satisﬁes an important requirement proposed from the viewpoint of computation, that is, computational grounding, which means that properties in this logic can be given an interpretation in terms of some concrete computational model. Indeed, the presented neighborhood frame can be naturally derived from probabilistic programming with utilities.

PDF Details