Arrow Research search

Author name cluster

Xuan He

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers
2 author rows

Possible papers

6

ICLR Conference 2025 Conference Paper

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

  • Jiacheng Chen
  • Tianhao Liang
  • Sherman Siu
  • Zhengqing Wang
  • Kai Wang 0068
  • Yubo Wang 0019
  • Yuansheng Ni
  • Ziyan Jiang

We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MM-Bench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth. We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions.

IJCAI Conference 2024 Conference Paper

CF-Deformable DETR: An End-to-End Alignment-Free Model for Weakly Aligned Visible-Infrared Object Detection

  • Haolong Fu
  • Jin Yuan
  • Guojin Zhong
  • Xuan He
  • Jiacheng Lin
  • Zhiyong Li

Weakly aligned visible-infrared object detection poses significant challenges due to the imprecise alignment between visible and infrared images. Most existing methods explore the alignment strategies between visible and infrared images, yielding unbearable computation costs. This paper first proposes an end-to-end alignment-free architecture Cross-modal Fusion Deformable DEtection TRansformer (``CF-Deformable DETR'') for weakly aligned visible-infrared object detection. Abandoning the traditional image alignment, CF-Deformable DETR introduces a simple yet effective cross-modal deformable attention mechanism to directly implement automatic cross-modal point mapping, generating well-aligned bimodal features with high efficiency. Moreover, we design a Point-level Feature Consistency Loss to guide the cross-modal point mapping, ensuring the consistency of paired features to support the following fusion. Extensive experiments are conducted on three benchmark datasets. The experimental results demonstrate that CF-Deformable DETR achieves close accuracy on weakly aligned and strictly aligned data as well as maintains stable performance to a certain extent against various offset degrees of weakly aligned data. Code is available at https: //github. com/116508/CF-Deformable-DETR.

IJCAI Conference 2024 Conference Paper

LocMoE: A Low-overhead MoE for Large Language Model Training

  • Jing Li
  • Zhijie Sun
  • Xuan He
  • Li Zeng
  • Yi Lin
  • Entong Li
  • Binfan Zheng
  • Rongqian Zhao

The Mixtures-of-Experts (MoE) model is a widespread distributed and integrated learning method for large language models (LLM), which is favored due to its ability to sparsify and expand models efficiently. However, the performance of MoE is limited by load imbalance and high latency of All-to-All communication, along with relatively redundant computation owing to large expert capacity. Load imbalance may result from existing routing policies that consistently tend to select certain experts. The frequent inter-node communication in the All-to-All procedure also significantly prolongs the training time. To alleviate the above performance problems, we propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node. Notably, we elucidate that there is a minimum threshold for expert capacity, calculated through the maximal angular deviation between the gating weights of the experts and the assigned tokens. We port these modifications on the PanGu-Σ model based on the MindSpore framework with multi-level routing and conduct experiments on Ascend clusters. The experiment results demonstrate that the proposed LocMoE reduces training time per epoch by 12. 68% to 22. 24% compared to classical routers, such as hash router and switch router, without impacting the model accuracy.

TMLR Journal 2024 Journal Article

Mantis: Interleaved Multi-Image Instruction Tuning

  • Dongfu Jiang
  • Xuan He
  • Huaye Zeng
  • Cong Wei
  • Max Ku
  • Qian Liu
  • Wenhu Chen

Large multimodal models (LMMs) have shown great results in single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing LMMs like OpenFlamingo, Emu2, and Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective. In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K multi-image instruction data to train a family of Mantis models. The instruction tuning empowers Mantis with different multi-image skills like co-reference, comparison, reasoning, and temporal understanding. We evaluate Mantis on 8 multi-image benchmarks and 6 single-image benchmarks. Mantis-Idefics2 can achieve SoTA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 13 absolute points. Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than Mantis-Instruct. We observe that Mantis performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis also maintains a strong single-image performance on par with CogVLM and Emu2. Our results show that multi-image abilities are not necessarily gained through massive pre-training, instead, they can be gained by low-cost instruction tuning. The training and evaluation of Mantis has paved the road for future work to improve LMMs' multi-image abilities.

NeurIPS Conference 2024 Conference Paper

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

  • Yubo Wang
  • Xueguang Ma
  • Ge Zhang
  • Yuansheng Ni
  • Abhranil Chandra
  • Shiguang Guo
  • Weiming Ren
  • Aaran Arulraj

In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates part of the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16\% to 33\% compared to MMLU, but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5\% in MMLU to just 2\% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is more discriminative benchmark to better track progress in the field.

JBHI Journal 2020 Journal Article

MR-Forest: A Deep Decision Framework for False Positive Reduction in Pulmonary Nodule Detection

  • Hongbo Zhu
  • Hai Zhao
  • Chunhe Song
  • Zijian Bian
  • Yuanguo Bi
  • Tong Liu
  • Xuan He
  • Dongxiang Yang

With the development of deep learning methods such as convolutional neural network (CNN), the accuracy of automated pulmonary nodule detection has been greatly improved. However, the high computational and storage costs of the large-scale network have been a potential concern for the future widespread clinical application. In this paper, an alternative Multi-ringed (MR)-Forest framework, against the resource-consuming neural networks (NN)-based architectures, has been proposed for false positive reduction in pulmonary nodule detection, which consists of three steps. First, a novel multi-ringed scanning method is used to extract the order ring facets (ORFs) from the surface voxels of the volumetric nodule models; Second, Mesh-LBP and mapping deformation are employed to estimate the texture and shape features. By sliding and resampling the multi-ringed ORFs, feature volumes with different lengths are generated. Finally, the outputs of multi-level are cascaded to predict the candidate class. On 1034 scans merging the dataset from the Affiliated Hospital of Liaoning University of Traditional Chinese Medicine (AH-LUTCM) and the LUNA16 Challenge dataset, our framework performs enough competitiveness than state-of-the-art in false positive reduction task (CPM score of 0. 865). Experimental results demonstrate that MR-Forest is a successful solution to satisfy both resource-consuming and effectiveness for automated pulmonary nodule detection. The proposed MR-forest is a general architecture for 3D target detection, it can be easily extended in many other medical imaging analysis tasks, where the growth trend of the targeting object is approximated as a spheroidal expansion.