Arrow Research search

Author name cluster

Max Ku

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers
2 author rows

Possible papers

6

TMLR Journal 2024 Journal Article

AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks

  • Max Ku
  • Cong Wei
  • Weiming Ren
  • Huan Yang
  • Wenhu Chen

In the dynamic field of digital content creation using generative models, state-of-the-art video editing models still do not offer the level of quality and control that users desire. Previous works on video editing either extended from image-based generative models in a zero-shot manner or necessitated extensive fine-tuning, which can hinder the production of fluid video edits. Furthermore, these methods frequently rely on textual input as the editing guidance, leading to ambiguities and limiting the types of edits they can perform. Recognizing these challenges, we introduce AnyV2V, a novel tuning-free paradigm designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model to modify the first frame, (2) utilizing an existing image-to-video generation model to generate the edited video through temporal feature injection. AnyV2V can leverage any existing image editing tools to support an extensive array of video editing tasks, including prompt-based editing, reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. AnyV2V can also support any video length. Our evaluation shows that AnyV2V achieved CLIP Scores comparable to other baseline methods. Furthermore, AnyV2V significantly outperformed these baselines in human evaluations, demonstrating notable improvements in visual consistency with the source video while producing high-quality edits across all editing tasks.

NeurIPS Conference 2024 Conference Paper

GenAI Arena: An Open Evaluation Platform for Generative Models

  • Dongfu Jiang
  • Max Ku
  • Tianle Li
  • Yuansheng Ni
  • Shizhuo Sun
  • Rongqi Fan
  • Wenhu Chen

Generative AI has made remarkable strides to revolutionize fields such as image and video generation. These advancements are driven by innovative algorithms, architecture, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the absence of trustworthy evaluation metrics. Current automatic assessments such as FID, CLIP, FVD, etc often fail to capture the nuanced quality and user satisfaction associated with generative outputs. This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three tasks of text-to-image generation, text-to-video generation, and image editing respectively. Currently, we cover a total of 35 open-source generative models. GenAI-Arena has been operating for seven months, amassing over 9000 votes from the community. We describe our platform, analyze the data, and explain the statistical methods for ranking the models. To further promote the research in building model-based evaluation metrics, we release a cleaned version of our preference data for the three tasks, namely GenAI-Bench. We prompt the existing multi-modal models like Gemini, and GPT-4o to mimic human voting. We compute the accuracy by comparing the model voting with the human voting to understand their judging abilities. Our results show existing multimodal models are still lagging in assessing the generated visual content, even the best model GPT-4o only achieves an average accuracy of $49. 19\%$ across the three generative tasks. Open-source MLLMs perform even worse due to the lack of instruction-following and reasoning ability in complex vision scenarios.

ICLR Conference 2024 Conference Paper

ImagenHub: Standardizing the evaluation of conditional image generation models

  • Max Ku
  • Tianle Li
  • Kai Zhang 0033
  • Yujie Lu
  • Xingyu Fu
  • Wenwen Zhuang
  • Wenhu Chen

Recently, a myriad of conditional image generation and editing models have been developed to serve different downstream tasks, including text-to-image generation, text-guided image editing, subject-driven image generation, control-guided image generation, etc. However, we observe huge inconsistencies in experimental conditions: datasets, inference, and evaluation metrics -- render fair comparisons difficult. This paper proposes ImagenHub, which is a one-stop library to standardize the inference and evaluation of all the conditional image generation models. Firstly, we define seven prominent tasks and curate high-quality evaluation datasets for them. Secondly, we built a unified inference pipeline to ensure fair comparison. Thirdly, we design two human evaluation scores, i.e. Semantic Consistency and Perceptual Quality, along with comprehensive guidelines to evaluate generated images. We train expert raters to evaluate the model outputs based on the proposed metrics. Our human evaluation achieves a high inter-worker agreement of Krippendorff’s alpha on 76\% models with a value higher than 0.4. We comprehensively evaluated a total of around 30 models and observed three key takeaways: (1) the existing models’ performance is generally unsatisfying except for Text-guided Image Generation and Subject-driven Image Generation, with 74\% models achieving an overall score lower than 0.5. (2) we examined the claims from published papers and found 83\% of them hold with a few exceptions. (3) None of the existing automatic metrics has a Spearman's correlation higher than 0.2 except subject-driven image generation. Moving forward, we will continue our efforts to evaluate newly published models and update our leaderboard to keep track of the progress in conditional image generation.

TMLR Journal 2024 Journal Article

Mantis: Interleaved Multi-Image Instruction Tuning

  • Dongfu Jiang
  • Xuan He
  • Huaye Zeng
  • Cong Wei
  • Max Ku
  • Qian Liu
  • Wenhu Chen

Large multimodal models (LMMs) have shown great results in single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing LMMs like OpenFlamingo, Emu2, and Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective. In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K multi-image instruction data to train a family of Mantis models. The instruction tuning empowers Mantis with different multi-image skills like co-reference, comparison, reasoning, and temporal understanding. We evaluate Mantis on 8 multi-image benchmarks and 6 single-image benchmarks. Mantis-Idefics2 can achieve SoTA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 13 absolute points. Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than Mantis-Instruct. We observe that Mantis performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis also maintains a strong single-image performance on par with CogVLM and Emu2. Our results show that multi-image abilities are not necessarily gained through massive pre-training, instead, they can be gained by low-cost instruction tuning. The training and evaluation of Mantis has paved the road for future work to improve LMMs' multi-image abilities.

NeurIPS Conference 2024 Conference Paper

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

  • Yubo Wang
  • Xueguang Ma
  • Ge Zhang
  • Yuansheng Ni
  • Abhranil Chandra
  • Shiguang Guo
  • Weiming Ren
  • Aaran Arulraj

In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates part of the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16\% to 33\% compared to MMLU, but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5\% in MMLU to just 2\% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is more discriminative benchmark to better track progress in the field.

TMLR Journal 2023 Journal Article

DreamEdit: Subject-driven Image Editing

  • Tianle Li
  • Max Ku
  • Cong Wei
  • Wenhu Chen

Subject-driven image generation aims at generating images containing customized subjects, which has recently drawn enormous attention from the research community. Nevertheless, the previous works cannot precisely control the background and position of the target subject. In this work, we aspire to fill the void of the existing subject-driven generation tasks. To this end, we propose two novel subject-driven editing sub-tasks, i.e., Subject Replacement and Subject Addition. The new tasks are challenging in multiple aspects: replacing a subject with a customized one can totally change its shape, texture, and color, while adding a target subject to a designated position in a provided scene necessitates a rational context-aware posture of the subject. To conquer these two novel tasks, we first manually curate a new dataset called DreamEditBench containing 22 different types of subjects, and 440 source images, which cover diverse scenarios with different difficulty levels. We plan to host DreamEditBench as a platform and hire trained evaluators for standardized human evaluation. We also devise an innovative method DreamEditor to resolve these tasks by performing iterative generation, which enables a smooth adaptation to the customized subject. In this project, we conduct automatic and human evaluations to understand the performance of our DreamEditor and baselines on DreamEditBench. We found that the new tasks are challenging for the existing models. For Subject Replacement, we found that the existing models are particularly sensitive to the shape and color of the original subject. When the original subject and the customized subject are highly different, the model failure rate will dramatically increase. For Subject Addition, we found that the existing models cannot easily blend the customized subjects into the background smoothly, which causes noticeable artifacts in the generated image. We hope that DreamEditBench can become a standardized platform to enable future investigations towards building more controllable subject-driven image editing. Our project and benchmark homepage is https://dreameditbenchteam.github.io/