Arrow Research search

Author name cluster

Ming Tan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers
2 author rows

Possible papers

8

ICLR Conference 2024 Conference Paper

Code Representation Learning at Scale

  • Dejiao Zhang
  • Wasi Uddin Ahmad
  • Ming Tan
  • Hantian Ding
  • Ramesh Nallapati
  • Dan Roth 0001
  • Xiaofei Ma 0001
  • Bing Xiang

Recent studies have shown that code language model at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred million parameter scale using very limited pretraining corpora. In this work, we fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking language modeling and implicit structure and semantic aspects of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner. We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks by large margins. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size.

NeurIPS Conference 2023 Conference Paper

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

  • Yangruibo Ding
  • Zijian Wang
  • Wasi Ahmad
  • Hantian Ding
  • Ming Tan
  • Nihal Jain
  • Murali Krishna Ramanathan
  • Ramesh Nallapati

Code completion models have made significant progress in recent years, yet current popular evaluation datasets, such as HumanEval and MBPP, predominantly focus on code completion tasks within a single file. This over-simplified setting falls short of representing the real-world software development scenario where repositories span multiple files with numerous cross-file dependencies, and accessing and understanding cross-file context is often required to complete the code correctly. To fill in this gap, we propose CrossCodeEval, a diverse and multilingual code completion benchmark that necessitates an in-depth cross-file contextual understanding to complete the code accurately. CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages: Python, Java, TypeScript, and C#. To create examples that strictly require cross-file context for accurate completion, we propose a straightforward yet efficient static-analysis-based approach to pinpoint the use of cross-file context within the current file. Extensive experiments on state-of-the-art code language models like CodeGen and StarCoder demonstrate that CrossCodeEval is extremely challenging when the relevant cross-file context is absent, and we see clear improvements when adding these context into the prompt. However, despite such improvements, the pinnacle of performance remains notably unattained even with the highest-performing model, indicating that CrossCodeEval is also capable of assessing model's capability in leveraging extensive context to make better code completion. Finally, we benchmarked various methods in retrieving cross-file context, and show that CrossCodeEval can also be used to measure the capability of code retrievers.

ICLR Conference 2023 Conference Paper

Multi-lingual Evaluation of Code Generation Models

  • Ben Athiwaratkun
  • Sanjay Krishna Gouda
  • Zijian Wang 0002
  • Xiaopeng Li 0002
  • Yuchen Tian
  • Ming Tan
  • Wasi Uddin Ahmad
  • Shiqi Wang 0002

We present two new benchmarks, MBXP and Multilingual HumanEval, designed to evaluate code completion models in over 10 programming languages. These datasets are generated using a conversion framework that transpiles prompts and test cases from the original MBPP and HumanEval datasets into the corresponding data in the target language. By using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities. In addition, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other code-related evaluations such as code insertion, robustness, or summarization tasks.

NeurIPS Conference 2013 Conference Paper

Direct 0-1 Loss Minimization and Margin Maximization with Boosting

  • Shaodan Zhai
  • Tian Xia
  • Ming Tan
  • Shaojun Wang

We propose a boosting method, DirectBoost, a greedy coordinate descent algorithm that builds an ensemble classifier of weak classifiers through directly minimizing empirical classification error over labeled training examples; once the training classification error is reduced to a local coordinatewise minimum, DirectBoost runs a greedy coordinate ascent algorithm that continuously adds weak classifiers to maximize any targeted arbitrarily defined margins until reaching a local coordinatewise maximum of the margins in a certain sense. Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives consistently better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n'th order bottom sample margin.

AAAI Conference 1991 Conference Paper

Cost-Sensitive Reinforcement Learning for Adaptive Classification and Control

  • Ming Tan

Standard reinforcement learning methods assume they can identify each state distinctly before making an action decision. In reality, a robot agent only has a limited sensing capability and identifying each state by extensive sensing can be time consuming. This paper describes an approach that learns active perception strategies in reinforcement learning and considers sensing costs explicitly. The approach integrates cost-sensitive learning with reinforcement learning to learn an efficient internal state representation and a decision policy simultaneously in a finite, deterministic environment. It not only maximizes the long-term discounted reward per action but also reduces the average sensing cost per state. The initial experimental results in a simulated robot navigation domain are encouraging.

ICRA Conference 1990 Conference Paper

CSL: a cost-sensitive learning system for sensing and grasping objects

  • Ming Tan

The goal of the research reported is to build a learning robot which can survive in an unknown environment for a long time. Such a robot must learn which sensors to use, where to use them, and how to generate an inexpensive and reliable robot control procedure to accomplish its task. This is beyond machine learning methods because they usually ignore robot execution costs and are ill-prepared to handle failures. A cost-sensitive, noise-tolerant and inductive robot learning system, CSL, that represents the first steps toward achieving this goal is described, emphasizing the cost and noise issues in learning. CSL has been implemented in a real-world robot for sensing objects and selecting their grasping procedures. >

AAAI Conference 1990 Conference Paper

Two Case Studies in Cost-Sensitive Concept Acquisition

  • Ming Tan

This paper explores the problem of learning from examples when feature measurement costs are significant. It then extends two effective and familiar learning methods, ID3 and IBL, to address this problem. The extensions, CS-ID3 and CS-IBL, are described in detail and are tested in a natural robot domain and a synthetic domain. Empirical studies support the hypothesis that the extended methods are indeed sensitive to feature costs: they deal effectively with varying cost distributions and with irrelevant features.