Arrow Research search

Author name cluster

Hantian Ding

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
2 author rows

Possible papers

5

ICML Conference 2024 Conference Paper

Bifurcated Attention for Single-Context Large-Batch Sampling

  • Ben Athiwaratkun
  • Sujan Kumar Gonugondla
  • Sanjay Krishna Gouda
  • Haifeng Qian
  • Hantian Ding
  • Qing Sun 0013
  • Jun Wang 0022
  • Jiacheng Guo

In our study, we present bifurcated attention, a method developed for language model inference in single-context batch sampling contexts. This approach aims to reduce redundant memory IO costs, a significant factor in latency for high batch sizes and long context lengths. Bifurcated attention achieves this by dividing the attention mechanism during incremental decoding into two distinct GEMM operations, focusing on the KV cache from prefill and the decoding process. This method ensures precise computation and maintains the usual computational load (FLOPs) of standard attention mechanisms, but with reduced memory IO. Bifurcated attention is also compatible with multi-query attention mechanism known for reduced memory IO for KV cache, further enabling higher batch size and context length. The resulting efficiency leads to lower latency, improving suitability for real-time applications, e. g. , enabling massively-parallel answer generation without substantially increasing latency, enhancing performance when integrated with post-processing techniques such as reranking.

ICLR Conference 2024 Conference Paper

Code Representation Learning at Scale

  • Dejiao Zhang
  • Wasi Uddin Ahmad
  • Ming Tan
  • Hantian Ding
  • Ramesh Nallapati
  • Dan Roth 0001
  • Xiaofei Ma 0001
  • Bing Xiang

Recent studies have shown that code language model at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred million parameter scale using very limited pretraining corpora. In this work, we fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking language modeling and implicit structure and semantic aspects of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner. We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks by large margins. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size.

ICML Conference 2024 Conference Paper

Fewer Truncations Improve Language Modeling

  • Hantian Ding
  • Zijian Wang 0002
  • Giovanni Paolini
  • Varun Kumar
  • Anoop Deoras
  • Dan Roth 0001
  • Stefano Soatto

In large language model training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity—it inevitably breaks many documents into incomplete pieces, leading to excessive truncations that hinder the model from learning to compose logically coherent and factually consistent content that is grounded on the complete context. To address the issue, we propose Best-fit Packing, a scalable and efficient method that packs documents into training sequences through length-aware combinatorial optimization. Our method completely eliminates unnecessary truncations while retaining the same training efficiency as concatenation. Empirical results from both text and code pre-training show that our method achieves superior performance (e. g. , +4. 7% on reading comprehension; +16. 8% in context following; and +9. 2% on program synthesis), and reduces closed-domain hallucination effectively by up to 58. 3%.

NeurIPS Conference 2023 Conference Paper

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

  • Yangruibo Ding
  • Zijian Wang
  • Wasi Ahmad
  • Hantian Ding
  • Ming Tan
  • Nihal Jain
  • Murali Krishna Ramanathan
  • Ramesh Nallapati

Code completion models have made significant progress in recent years, yet current popular evaluation datasets, such as HumanEval and MBPP, predominantly focus on code completion tasks within a single file. This over-simplified setting falls short of representing the real-world software development scenario where repositories span multiple files with numerous cross-file dependencies, and accessing and understanding cross-file context is often required to complete the code correctly. To fill in this gap, we propose CrossCodeEval, a diverse and multilingual code completion benchmark that necessitates an in-depth cross-file contextual understanding to complete the code accurately. CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#. To create examples that strictly require cross-file context for accurate completion, we propose a straightforward yet efficient static-analysis-based approach to pinpoint the use of cross-file context within the current file. Extensive experiments on state-of-the-art code language models like CodeGen and StarCoder demonstrate that CrossCodeEval is extremely challenging when the relevant cross-file context is absent, and we see clear improvements when adding these context into the prompt. However, despite such improvements, the pinnacle of performance remains notably unattained even with the highest-performing model, indicating that CrossCodeEval is also capable of assessing model's capability in leveraging extensive context to make better code completion. Finally, we benchmarked various methods in retrieving cross-file context, and show that CrossCodeEval can also be used to measure the capability of code retrievers.

ICLR Conference 2023 Conference Paper

Multi-lingual Evaluation of Code Generation Models

  • Ben Athiwaratkun
  • Sanjay Krishna Gouda
  • Zijian Wang 0002
  • Xiaopeng Li 0002
  • Yuchen Tian
  • Ming Tan
  • Wasi Uddin Ahmad
  • Shiqi Wang 0002

We present two new benchmarks, MBXP and Multilingual HumanEval, designed to evaluate code completion models in over 10 programming languages. These datasets are generated using a conversion framework that transpiles prompts and test cases from the original MBPP and HumanEval datasets into the corresponding data in the target language. By using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities. In addition, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other code-related evaluations such as code insertion, robustness, or summarization tasks.