Arrow Research search

Author name cluster

Ee-Peng Lim

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
1 author row

Possible papers

12

AAAI Conference 2026 Conference Paper

Towards Provably Unlearnable Examples via Bayes Error Optimization

  • Ruihan Zhang
  • Jun Sun
  • Ee-Peng Lim
  • Peixin Zhang

The recent success of machine learning models, especially large-scale classifiers and language models, relies heavily on training with massive data. These data are often collected from online sources. This raises serious concerns about the protection of user data, as individuals may not have given consent for their data to be used in training. To address this concern, recent studies introduce the concept of unlearnable examples, i.e., data instances that appear natural but are intentionally altered to prevent models from effectively learning from them. While existing methods demonstrate empirical effectiveness, they typically rely on heuristic trials and lack formal guarantees. Besides, when unlearnable examples are mixed with clean data, as is often the case in practice, their unlearnability disappears. In this work, we propose a novel approach to constructing unlearnable examples by systematically maximising the Bayes error, a measurement of irreducible classification error. We develop an optimisation-based approach and provide an efficient solution using projected gradient ascent. Our method provably increases the Bayes error and remains effective when the unlearning examples are mixed with clean samples. Experimental results across multiple datasets and model architectures are consistent with our theoretical analysis and show that our approach can restrict data learnability, effectively in practice.

NeurIPS Conference 2024 Conference Paper

Generative Semi-supervised Graph Anomaly Detection

  • Hezhe Qiao
  • Qingsong Wen
  • Xiaoli Li
  • Ee-Peng Lim
  • Guansong Pang

This work considers a practical semi-supervised graph anomaly detection (GAD) scenario, where part of the nodes in a graph are known to be normal, contrasting to the extensively explored unsupervised setting with a fully unlabeled graph. We reveal that having access to the normal nodes, even just a small percentage of normal nodes, helps enhance the detection performance of existing unsupervised GAD methods when they are adapted to the semi-supervised setting. However, their utilization of these normal nodes is limited. In this paper, we propose a novel Generative GAD approach (namely GGAD) for the semi-supervised scenario to better exploit the normal nodes. The key idea is to generate pseudo anomaly nodes, referred to as 'outlier nodes', for providing effective negative node samples in training a discriminative one-class classifier. The main challenge here lies in the lack of ground truth information about real anomaly nodes. To address this challenge, GGAD is designed to leverage two important priors about the anomaly nodes -- asymmetric local affinity and egocentric closeness -- to generate reliable outlier nodes that assimilate anomaly nodes in both graph structure and feature representations. Comprehensive experiments on six real-world GAD datasets are performed to establish a benchmark for semi-supervised GAD and show that GGAD substantially outperforms state-of-the-art unsupervised and semi-supervised GAD methods with varying numbers of training normal nodes.

TIST Journal 2024 Journal Article

Temporal Implicit Multimodal Networks for Investment and Risk Management

  • Gary Ang
  • Ee-Peng Lim

Many deep learning works on financial time-series forecasting focus on predicting future prices/returns of individual assets with numerical price-related information for trading, and hence propose models designed for univariate, single-task, and/or unimodal settings. Forecasting for investment and risk management involves multiple tasks in multivariate settings: forecasts of expected returns and risks of assets in portfolios, and correlations between these assets. As different sources/types of time-series influence future returns, risks, and correlations of assets in different ways, it is also important to capture time-series from different modalities. Hence, this article addresses financial time-series forecasting for investment and risk management in a multivariate, multitask, and multimodal setting. Financial time-series forecasting, however, is challenging due to the low signal-to-noise ratios typical in financial time-series, and as intra-series and inter-series relationships of assets evolve across time. To address these challenges, our proposed Temporal Implicit Multimodal Network (TIME) model learns implicit inter-series relationship networks between assets from multimodal financial time-series at multiple time-steps adaptively. TIME then uses dynamic network and temporal encoding modules to jointly capture such evolving relationships, multimodal financial time-series, and temporal representations. Our experiments show that TIME outperforms other state-of-the-art models on multiple forecasting tasks and investment and risk management applications.

AAAI Conference 2022 System Paper

MWPToolkit: An Open-Source Framework for Deep Learning-Based Math Word Problem Solvers

  • Yihuai Lan
  • Lei Wang
  • Qiyuan Zhang
  • Yunshi Lan
  • Bing Tian Dai
  • Yan Wang
  • Dongxiang Zhang
  • Ee-Peng Lim

While Math Word Problem (MWP) solving has emerged as a popular field of study and made great progress in recent years, most existing methods are benchmarked solely on one or two datasets and implemented with different configurations. In this paper, we introduce the first open-source library for solving MWPs called MWPToolkit, which provides a unified, comprehensive, and extensible framework for the research purpose. Specifically, we deploy 17 deep learningbased MWP solvers and 6 MWP datasets in our toolkit. These MWP solvers are advanced models for MWP solving, covering the categories of Seq2seq, Seq2Tree, Graph2Tree, and Pre-trained Language Models. And these MWP datasets are popular datasets that are commonly used as benchmarks in existing work. Our toolkit is featured with highly modularized and reusable components, which can help researchers quickly get started and develop their own models. We have released the code and documentation of MWPToolkit in https: //github. com/LYH-YF/MWPToolkit.

AAAI Conference 2020 Conference Paper

A Near-Optimal Change-Detection Based Algorithm for Piecewise-Stationary Combinatorial Semi-Bandits

  • Huozhi Zhou
  • Lingda Wang
  • Lav Varshney
  • Ee-Peng Lim

We investigate the piecewise-stationary combinatorial semibandit problem. Compared to the original combinatorial semi-bandit problem, our setting assumes the reward distributions of base arms may change in a piecewise-stationary manner at unknown time steps. We propose an algorithm, GLR-CUCB, which incorporates an efficient combinatorial semi-bandit algorithm, CUCB, with an almost parameter-free change-point detector, the Generalized Likelihood Ratio Test (GLRT). Our analysis shows that the regret of GLR-CUCB is upper bounded by O( √ NKT log T), where N is the number of piecewise-stationary segments, K is the number of base arms, and T is the number of time steps. As a complement, we also derive a nearly matching regret lower bound on the order of Ω( √ NKT), for both piecewise-stationary multi-armed bandits and combinatorial semi-bandits, using information-theoretic techniques and judiciously constructed piecewise-stationary bandit instances. Our lower bound is tighter than the best available regret lower bound, which is Ω( √ T). Numerical experiments on both synthetic and realworld datasets demonstrate the superiority of GLR-CUCB compared to other state-of-the-art algorithms.

IJCAI Conference 2020 Conference Paper

Teacher-Student Networks with Multiple Decoders for Solving Math Word Problem

  • Jipeng Zhang
  • Roy Ka-Wei Lee
  • Ee-Peng Lim
  • Wei Qin
  • Lei Wang
  • Jie Shao
  • Qianru Sun

Math word problem (MWP) is challenging due to the limitation in training data where only one “standard” solution is available. MWP models often simply fit this solution rather than truly understand or solve the problem. The generalization of models (to diverse word scenarios) is thus limited. To address this problem, this paper proposes a novel approach, TSN-MD, by leveraging the teacher network to integrate the knowledge of equivalent solution expressions and then to regularize the learning behavior of the student network. In addition, we introduce the multiple-decoder student network to generate multiple candidate solution expressions by which the final answer is voted. In experiments, we conduct extensive comparisons and ablative studies on two large-scale MWP benchmarks, and show that using TSN-MD can surpass the state-of-the-art works by a large margin. More intriguingly, the visualization results demonstrate that TSN-MD not only produces correct final answers but also generates diverse equivalent expressions of the solution.

TIST Journal 2017 Journal Article

Modeling Topics and Behavior of Microbloggers

  • Tuan-Anh Hoang
  • Ee-Peng Lim

Microblogging encompasses both user-generated content and behavior. When modeling microblogging data, one has to consider personal and background topics, as well as how these topics generate the observed content and behavior. In this article, we propose the Generalized Behavior-Topic (GBT) model for simultaneously modeling background topics and users’ topical interest in microblogging data. GBT considers multiple topical communities (or realms) with different background topical interests while learning the personal topics of each user and the user’s dependence on realms to generate both content and behavior. This differentiates GBT from other previous works that consider either one realm only or content data only. By associating user behavior with the latent background and personal topics, GBT helps to model user behavior by the two types of topics. GBT also distinguishes itself from other earlier works by modeling multiple types of behavior together. Our experiments on two Twitter datasets show that GBT can effectively mine the representative topics for each realm. We also demonstrate that GBT significantly outperforms other state-of-the-art models in modeling content topics and user profiling.

AAAI Conference 2017 Conference Paper

Streaming Classification with Emerging New Class by Class Matrix Sketching

  • Xin Mu
  • Feida Zhu
  • Juan Du
  • Ee-Peng Lim
  • Zhi-Hua Zhou

Streaming classification with emerging new class is an important problem of great research challenge and practical value. In many real applications, the task often needs to handle large matrices issues such as textual data in the bag-ofwords model and large-scale image analysis. However, the methodologies and approaches adopted by the existing solutions, most of which involve massive distance calculation, have so far fallen short of successfully addressing a real-time requested task. In this paper, the proposed method dynamically maintains two low-dimensional matrix sketches to 1) detect emerging new classes; 2) classify known classes; and 3) update the model in the data stream. The update efficiency is superior to the existing methods. The empirical evaluation shows the proposed method not only receives the comparable performance but also strengthens modelling on largescale data sets.

AAAI Conference 2015 Conference Paper

Are Features Equally Representative? A Feature-Centric Recommendation

  • Chenyi Zhang
  • Ke Wang
  • Ee-Peng Lim
  • Qinneng Xu
  • Jianling Sun
  • Hongkun Yu

Typically a user prefers an item (e. g. , a movie) because she likes certain features of the item (e. g. , director, genre, producer). This observation motivates us to consider a featurecentric recommendation approach to item recommendation: instead of directly predicting the rating on items, we predict the rating on the features of items, and use such ratings to derive the rating on an item. This approach offers several advantages over the traditional item-centric approach: it incorporates more information about why a user chooses an item, it generalizes better due to the denser feature rating data, it explains the prediction of item ratings through the predicted feature ratings. Another contribution is turning a principled item-centric solution into a feature-centric solution, instead of inventing a new algorithm that is feature-centric. This approach maximally leverages previous research. We demonstrate this approach by turning the traditional item-centric latent factor model into a feature-centric solution and demonstrate its superiority over item-centric approaches.

JMLR Journal 2014 Journal Article

Detecting Click Fraud in Online Advertising: A Data Mining Approach

  • Richard Oentaryo
  • Ee-Peng Lim
  • Michael Finegold
  • David Lo
  • Feida Zhu
  • Clifton Phua
  • Eng-Yeow Cheu
  • Ghim-Eng Yap

Click fraud--the deliberate clicking on advertisements with no real interest on the product or service offered--is one of the most daunting problems in online advertising. Building an effective fraud detection method is thus pivotal for online advertising businesses. We organized a Fraud Detection in Mobile Advertising (FDMA) 2012 Competition, opening the opportunity for participants to work on real-world fraud data from BuzzCity Pte. Ltd., a global mobile advertising company based in Singapore. In particular, the task is to identify fraudulent publishers who generate illegitimate clicks, and distinguish them from normal publishers. The competition was held from September 1 to September 30, 2012, attracting 127 teams from more than 15 countries. The mobile advertising data are unique and complex, involving heterogeneous information, noisy patterns with missing values, and highly imbalanced class distribution. The competition results provide a comprehensive study on the usability of data mining-based fraud detection approaches in practical setting. Our principal findings are that features derived from fine-grained time-series analysis are crucial for accurate fraud detection, and that ensemble methods offer promising solutions to highly-imbalanced nonlinear classification tasks with mixed variable types and noisy/missing patterns. The competition data remain available for further studies at palanteer.sis.smu.edu.sg/fdma2012. [abs] [ pdf ][ bib ] &copy JMLR 2014. ( edit, beta )

AAAI Conference 2014 Conference Paper

Lifetime Lexical Variation in Social Media

  • Lizi Liao
  • Jing Jiang
  • Ying Ding
  • Heyan Huang
  • Ee-Peng Lim

As the rapid growth of online social media attracts a large number of Internet users, the large volume of content generated by these users also provides us with an opportunity to study the lexical variation of people of different ages. In this paper, we present a latent variable model that jointly models the lexical content of tweets and Twitter users’ ages. Our model inherently assumes that a topic has not only a word distribution but also an age distribution. We propose a Gibbs-EM algorithm to perform inference on our model. Empirical evaluation shows that our model can learn meaningful age-specific topics such as “school” for teenagers and “health” for older people. Our model can also be used for age prediction and performs better than a number of baseline methods.

IS Journal 2005 Journal Article

Agents and Stream Data Mining: A New Perspective

  • Kok-Leong Ong
  • Zili Zhang
  • Wee-Keong Ng
  • Ee-Peng Lim

Many organizations struggle with the massive amount of data they collect. Today, data does more than serve as the ingredients for churning out statistical reports. They help support efficient operations in many organizations, and to some extent, data provide the competitive intelligence organizations need to survive in today's economy. Data mining can't always deliver timely and relevant results because data are constantly changing. However, stream-data processing might be more effective, judging by the Matrix project.