Author name cluster

Albert Bifet

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

22 papers

2 author rows

AAAI Conference 2026 Conference Paper

Binary Split Categorical Feature with Mean Absolute Error Criteria in CART

Peng Yu
Yike Chen
Chao Xu
Albert Bifet
Jesse Read

In the context of the Classification and Regression Trees (CART) algorithm, the efficient splitting of categorical features using standard criteria like GINI and Entropy is well-established. However, using the Mean Absolute Error (MAE) criterion for categorical features has traditionally relied on various numerical encoding methods. This paper demonstrates that unsupervised numerical encoding methods are not viable for MAE criteria. Furthermore, we present a novel and efficient splitting algorithm that addresses the challenges of handling categorical features with the MAE criterion. Our findings underscore the limitations of existing approaches and offer a promising solution to enhance the handling of categorical data in CART algorithms.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Salvador Urban Network Transportation (SUNT): A Landmark Spatiotemporal Dataset for Public Transportation (Abstract Reprint)

Marcos V. Ferreira
Matheus Souza
Tatiane N. Rios
Islame F. C. Fernandes
Jorge Nery
João Gama
Albert Bifet
Ricardo A. Rios

Efficient public transportation management is essential for the development of large urban centers, providing several benefits such as comprehensive coverage of population mobility, reduction of transport costs, better control of traffic congestion, and significant reduction of environmental impact limiting gas emissions and pollution. Realizing these benefits requires a deeply understanding the population and transit patterns and the adoption of approaches to model multiple relations and characteristics efficiently. This work addresses these challenges by providing a novel dataset that includes various public transportation components from three different systems: regular buses, subway, and BRT (Bus Rapid Transit). Our dataset comprises daily information from about 700,000 passengers in Salvador, one of Brazil’s largest cities, and local public transportation data with approximately 2,000 vehicles operating across nearly 400 lines, connecting almost 3,000 stops and stations. With data collected from March 2024 to March 2025 at a frequency lower than one minute, SUNT stands as one of the largest, most comprehensive, and openly available urban datasets in the literature.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Simulation-Driven Railway Delay Prediction: An Imitation Learning Approach

Clément Elliker
Jesse Read
Sonia Vanier
Albert Bifet

Reliable prediction of train delays is essential for enhancing the robustness and efficiency of railway transportation systems. In this work, we reframe delay forecasting as a stochastic simulation task, modeling state-transition dynamics through imitation learning. We introduce Drift-Corrected Imitation Learning (DCIL), a novel self-supervised algorithm that extends DAgger by incorporating distance-based drift correction, thereby mitigating covariate shift during rollouts without requiring access to an external oracle or adversarial schemes. Our approach synthesizes the dynamical fidelity of event-driven models with the representational capacity of data-driven methods, enabling uncertainty-aware forecasting via Monte Carlo simulation. We evaluate DCIL using a comprehensive real-world dataset from \textsc{Infrabel}, the Belgian railway infrastructure manager, which encompasses over three million train movements. Our results, focused on predictions up to 30 minutes ahead, demonstrate superior predictive performance of DCIL over traditional regression models and behavioral cloning on deep learning architectures, highlighting its effectiveness in capturing the sequential and uncertain nature of delay propagation in large-scale networks.

PDF Details DOI

TMLR Journal 2025 Journal Article

BELLA: Black-box model Explanations by Local Linear Approximations

Nedeljko Radulovic
Albert Bifet
Fabian M. Suchanek

Understanding the decision-making process of black-box models has become not just a legal requirement, but also an additional way to assess their performance. However, the state of the art post-hoc explanation approaches for regression models rely on synthetic data generation, which introduces uncertainty and can hurt the reliability of the explanations. Furthermore, they tend to produce explanations that apply to only very few data points. In this paper, we present BELLA, a deterministic model-agnostic post-hoc approach for explaining the individual predictions of regression black-box models. BELLA provides explanations in the form of a linear model trained in the feature space. BELLA maximizes the size of the neighborhood to which the linear model applies so that the explanations are accurate, simple, general, and robust.

PDF Details

ICML Conference 2025 Conference Paper

Branches: Efficiently Seeking Optimal Sparse Decision Trees via AO

Ayman Chaouki
Jesse Read
Albert Bifet

Decision Tree (DT) Learning is a fundamental problem in Interpretable Machine Learning, yet it poses a formidable optimisation challenge. Practical algorithms have recently emerged, primarily leveraging Dynamic Programming and Branch & Bound. However, most of these approaches rely on a Depth-First-Search strategy, which is inefficient when searching for DTs at high depths and requires the definition of a maximum depth hyperparameter. Best-First-Search was also employed by other methods to circumvent these issues. The downside of this strategy is its higher memory consumption, as such, it has to be designed in a fully efficient manner that takes full advantage of the problem’s structure. We formulate the problem as an AND/OR graph search which we solve with a novel AO*-type algorithm called Branches. We prove both optimality and complexity guarantees for Branches and we show that it is more efficient than the state of the art theoretically and on a variety of experiments. Furthermore, Branches supports non-binary features unlike the other methods, we show that this property can further induce larger gains in computational efficiency.

Details

TMLR Journal 2025 Journal Article

Rethinking Memory in Continual Learning: Beyond a Monolithic Store of the Past

Yaqian Zhang
Bernhard Pfahringer
Eibe Frank
Albert Bifet

Memory is a critical component in replay-based continual learning (CL). Prior research has largely treated CL memory as a monolithic store of past data, focusing on how to select and store representative past examples. However, this perspective overlooks the higher-level memory architecture that governs the interaction between old and new data. In this work, we identify and characterize a dual-memory system that is inherently present in both online and offline CL settings. This system comprises: a short-term memory, which temporarily buffers recent data for immediate model updates, and a long-term memory, which maintains a carefully curated subset of past experiences for future replay and consolidation. We propose \textit{memory capacity ratio} (MCR), the ratio between short-term memory and long-term memory capacities, to characterize online and offline CL. Based on this framework, we systematically investigate how MCR influences generalization, stability, and plasticity. Across diverse CL settings—class-incremental, task-incremental, and domain-incremental—and multiple data modalities (e.g., image and text classification), we observe that a smaller MCR, characteristic of \textit{online CL}, can yield comparable or even superior performance relative to a larger one, characteristic of \textit{offline CL}, when both are evaluated under equivalent computational and data storage budgets. This advantage holds consistently across several state-of-the-art replay strategies, such as ER, DER, and SCR. Theoretical analysis further reveals that a reduced MCR yields a better trade-off between stability and plasticity by lowering a bound on generalization error when learning from non-stationary data streams with limited memory. These findings offer new insights into the role of memory allocation in continual learning and underscore the underexplored potential of online CL approaches.

PDF Details

ICML Conference 2024 Conference Paper

Online Isolation Forest

Filippo Leveni
Guilherme Weigert Cassales
Bernhard Pfahringer
Albert Bifet
Giacomo Boracchi

The anomaly detection literature is abundant with offline methods, which require repeated access to data in memory, and impose impractical assumptions when applied to a streaming context. Existing online anomaly detection methods also generally fail to address these constraints, resorting to periodic retraining to adapt to the online context. We propose Online-iForest, a novel method explicitly designed for streaming conditions that seamlessly tracks the data generating process as it evolves over time. Experimental validation on real-world datasets demonstrated that Online-iForest is on par with online alternatives and closely rivals state-of-the-art offline anomaly detection techniques that undergo periodic retraining. Notably, Online-iForest consistently outperforms all competitors in terms of efficiency, making it a promising solution in applications where fast identification of anomalies is of primary importance such as cybersecurity, fraud and fault detection.

Details

IJCAI Conference 2024 Conference Paper

Recurrent Concept Drifts on Data Streams

Nuwan Gunasekara
Bernhard Pfahringer
Heitor Murilo Gomes
Albert Bifet
Yun Sing Koh

In an era where machine learning permeates every facet of human existence, and data evolves incessantly, the application of machine learning models transcends mere data processing. It involves navigating constant changes exemplified by the phenomenon of concept drift, which often affects model performance. These drifts can be recurrent due to the cyclic nature of the underlying data generation processes, which could be influenced by recurrent phenomena such as weather and time of the day. Stream Learning on data streams with recurrent concept drifts attempts to learn from such streams of data. The survey underscores the significance of the field and its practical applications, delving into nuanced definitions of machine learning for data streams afflicted by recurrent concept drifts. It explores diverse methodological approaches, elucidating their key design components. Additionally, it examines various evaluation techniques, benchmark datasets, and available software tailored for simulating and analysing data streams with recurrent concept drifts. Concluding, the survey offers insights into potential avenues for future research in the field.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Time-Evolving Data Science and Artificial Intelligence for Advanced Open Environmental Science (TAIAO) Programme

Yun Sing Koh
Albert Bifet
Karin Bryan
Guilherme Cassales
Olivier Graffeuille
Nick Lim
Phil Mourot
Ding Ning

New Zealand's unique ecosystems face increasing threats from climate change, impacting biodiversity and posing challenges to safety, livelihoods, and well-being. To tackle these complex issues, advanced data science and artificial intelligence techniques can provide unique solutions. Currently, in its fourth year of a seven-year program, TAIAO focuses on methods for analyzing environmental datasets. Recognizing this urgency, the open-source TAIAO platform was developed. This platform enables new artificial intelligence research for environmental data and offers an open-access repository to enhance reproducibility in the field. This paper will showcase four environmental case studies, artificial intelligence research, platform implementation details, and future development plans.

PDF Details DOI

EWRL Workshop 2023 Workshop Paper

Online Decision Tree Construction with Deep Reinforcement Learning

Ayman Chaouki
Jesse Read
Albert Bifet

Decision Trees are widely used in machine learning and data mining tasks, mainly because they can be easily interpreted; due to their popularity, Decision Trees were adapted for settings with streams of data, commonly in the form of Hoeffding Trees. While these methods are fast and incremental, they are also greedy in the sense that they optimise multiple local criteria (generally based on Entropy or Gini impurity) which makes them prone to suboptimality with respect to a global objective metric. On the other hand, Reinforcement Learning (RL) aims at maximizing a long term objective, and as such, it is a good candidate for alleviating this suboptimality problem of the standard Decision Tree methods. In this work, we show that looking for the most accurate Decision Tree with the lowest depth is equivalent to solving an RL problem, then we implement Deep RL algorithms DQN, Double DQN and Advantage Actor-Critic to seek the optimal Decision Tree, this choice being motivated by the scalability of these methods to problems with large state spaces unlike Q-Learning. We compare these methods with Hoeffding Trees on real-world data sets and show that DQN and Double DQN perform best in general.

PDF

IJCAI Conference 2023 Conference Paper

Survey on Online Streaming Continual Learning

Nuwan Gunasekara
Bernhard Pfahringer
Heitor Murilo Gomes
Albert Bifet

Stream Learning (SL) attempts to learn from a data stream efficiently. A data stream learning algorithm should adapt to input data distribution shifts without sacrificing accuracy. These distribution shifts are known as ”concept drifts” in the literature. SL provides many supervised, semi-supervised, and unsupervised methods for detecting and adjusting to concept drift. On the other hand, Continual Learning (CL) attempts to preserve previous knowledge while performing well on the current concept when confronted with concept drift. In Online Continual Learning (OCL), this learning happens online. This survey explores the intersection of those two online learning paradigms to find synergies. We identify this intersection as Online Streaming Continual Learning (OSCL). The study starts with a gentle introduction to SL and then explores CL. Next, it explores OSCL from SL and OCL perspectives to point out new research trends and give directions for future research.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

A simple but strong baseline for online continual learning: Repeated Augmented Rehearsal

Yaqian Zhang
Bernhard Pfahringer
Eibe Frank
Albert Bifet
Nick Jin Sean Lim
Yunzhe Jia

Online continual learning (OCL) aims to train neural networks incrementally from a non-stationary data stream with a single pass through data. Rehearsal-based methods attempt to approximate the observed input distributions over time with a small memory and revisit them later to avoid forgetting. Despite their strong empirical performance, rehearsal methods still suffer from a poor approximation of past data’s loss landscape with memory samples. This paper revisits the rehearsal dynamics in online settings. We provide theoretical insights on the inherent memory overfitting risk from the viewpoint of biased and dynamic empirical risk minimization, and examine the merits and limits of repeated rehearsal. Inspired by our analysis, a simple and intuitive baseline, repeated augmented rehearsal (RAR), is designed to address the underfitting-overfitting dilemma of online rehearsal. Surprisingly, across four rather different OCL benchmarks, this simple baseline outperforms vanilla rehearsal by 9\%-17\% and also significantly improves the state-of-the-art rehearsal-based methods MIR, ASER, and SCR. We also demonstrate that RAR successfully achieves an accurate approximation of the loss landscape of past data and high-loss ridge aversion in its learning trajectory. Extensive ablation studies are conducted to study the interplay between repeated and augmented rehearsal, and reinforcement learning (RL) is applied to dynamically adjust the hyperparameters of RAR to balance the stability-plasticity trade-off online.

PDF Details

NeurIPS Conference 2022 Conference Paper

Linear tree shap

Peng Yu
Albert Bifet
Jesse Read
Chao Xu

Decision trees are well-known due to their ease of interpretability. To improve accuracy, we need to grow deep trees or ensembles of trees. These are hard to interpret, offsetting their original benefits. Shapley values have recently become a popular way to explain the predictions of tree-based machine learning models. It provides a linear weighting to features independent of the tree structure. The rise in popularity is mainly due to TreeShap, which solves a general exponential complexity problem in polynomial time. Following extensive adoption in the industry, more efficient algorithms are required. This paper presents a more efficient and straightforward algorithm: Linear TreeShap. Like TreeShap, Linear TreeShap is exact and requires the same amount of memory.

PDF Details

JMLR Journal 2021 Journal Article

River: machine learning for streaming data in Python

Jacob Montiel
Max Halford
Saulo Martiello Mastelini
Geoffrey Bolmier
Raphael Sourty
Robin Vaysse
Adil Zouitine
Heitor Murilo Gomes

River is a machine learning library for dynamic data streams and continual learning. It provides multiple state-of-the-art learning methods, data generators/transformers, performance metrics and evaluators for different stream learning problems. It is the result from the merger of two popular packages for stream learning in Python: Creme and scikit-multiflow. River introduces a revamped architecture based on the lessons learnt from the seminal packages. River's ambition is to be the go-to library for doing machine learning on streaming data. Additionally, this open source package brings under the same umbrella a large community of practitioners and researchers. The source code is available at https://github.com/online-ml/river. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2021. ( edit, beta )

PDF Details

ECAI Conference 2020 Conference Paper

Compressed k-Nearest Neighbors Ensembles for Evolving Data Streams

Maroua Bahri
Albert Bifet
Silviu Maniu
Rodrigo Fernandes de Mello
Nikolaos Tziortziotis

The unbounded and multidimensional nature, the evolution of data distributions with time, and the requirement of single-pass algorithms comprise the main challenges of data stream classification, which makes it impossible to infer learning models in the same manner as for batch scenarios. Data dimensionality reduction arises as a key factor to transform and select only the most relevant features from those streams in order to reduce algorithm space and time demands. In that context, Compressed Sensing (CS) encodes an input signal into lower-dimensional space, guaranteeing its reconstruction up to some distortion factor ϵ. This paper employs CS on data streams as a pre-processing step to support a k-Nearest Neighbors (kNN) classification algorithm, one of the most often used algorithms in the data stream mining area – all this while ensuring the key properties of CS hold. Based on topological properties, we show that our classification algorithm also preserves the neighborhood (withing an ϵ factor) of kNN after reducing the stream dimensionality with CS. As a consequence, end-users can set an acceptable error margin while performing such projections for kNN. For further improvements, we incorporate this method into an ensemble classifier, Leveraging Bagging, by combining a set of different CS matrices which increases the diversity inside the ensemble. An extensive set of experiments is performed on various datasets, and the results were compared against those yielded by current state-of-the-art approaches, confirming the good performance of our approaches.

Details

IJCAI Conference 2020 Conference Paper

Survey on Feature Transformation Techniques for Data Streams

Maroua Bahri
Albert Bifet
Silviu Maniu
Heitor Murilo Gomes

Mining high-dimensional data streams poses a fundamental challenge to machine learning as the presence of high numbers of attributes can remarkably degrade any mining task's performance. In the past several years, dimension reduction (DR) approaches have been successfully applied for different purposes (e. g. , visualization). Due to their high-computational costs and numerous passes over large data, these approaches pose a hindrance when processing infinite data streams that are potentially high-dimensional. The latter increases the resource-usage of algorithms that could suffer from the curse of dimensionality. To cope with these issues, some techniques for incremental DR have been proposed. In this paper, we provide a survey on reduction approaches designed to handle data streams and highlight the key benefits of using these approaches for stream mining algorithms.

PDF Details DOI

JMLR Journal 2018 Journal Article

Scikit-Multiflow: A Multi-output Streaming Framework

Jacob Montiel
Jesse Read
Albert Bifet
Talel Abdessalem

scikit-multiflow is a framework for learning from data streams and multi-output learning in Python. Conceived to serve as a platform to encourage the democratization of stream learning research, it provides multiple state-of-the-art learning methods, data generators and evaluators for different stream learning problems, including single-output, multi-output and multi-label. scikit-multiflow builds upon popular open source frameworks including scikit-learn, MOA and MEKA. Development follows the FOSS principles. Quality is enforced by complying with PEP8 guidelines, using continuous integration and functional testing. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2018. ( edit, beta )

PDF Details

JMLR Journal 2015 Journal Article

SAMOA: Scalable Advanced Massive Online Analysis

Gianmarco De Francisci Morales
Albert Bifet

SAMOA (Scalable Advanced Massive Online Analysis) is a platform for mining big data streams. It provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm, S4, and Samza. SAMOA is written in Java, is open source, and is available at samoa-project.net under the Apache Software License version 2.0. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2015. ( edit, beta )

PDF Details

ECAI Conference 2014 Conference Paper

Random Forests of Very Fast Decision Trees on GPU for Mining Evolving Big Data Streams

Diego Marron
Albert Bifet
Gianmarco De Francisci Morales

Random Forest is a classical ensemble method used to improve the performance of single tree classifiers. It is able to obtain superior performance by increasing the diversity of the single classifiers. However, in the more challenging context of evolving data streams, the classifier has also to be adaptive and work under very strict constraints of space and time. Furthermore, the computational load of using a large number of classifiers can make its application extremely expensive.

Details

TIST Journal 2012 Journal Article

Ensembles of Restricted Hoeffding Trees

Albert Bifet
Eibe Frank
Geoff Holmes
Bernhard Pfahringer

The success of simple methods for classification shows that is is often not necessary to model complex attribute interactions to obtain good classification accuracy on practical problems. In this article, we propose to exploit this phenomenon in the data stream context by building an ensemble of Hoeffding trees that are each limited to a small subset of attributes. In this way, each tree is restricted to model interactions between attributes in its corresponding subset. Because it is not known a priori which attribute subsets are relevant for prediction, we build exhaustive ensembles that consider all possible attribute subsets of a given size. As the resulting Hoeffding trees are not all equally important, we weigh them in a suitable manner to obtain accurate classifications. This is done by combining the log-odds of their probability estimates using sigmoid perceptrons, with one perceptron per class. We propose a mechanism for setting the perceptrons’ learning rate using the change detection method for data streams, and also use to reset ensemble members (i.e., Hoeffding trees) when they no longer perform well. Our experiments show that the resulting ensemble classifier outperforms bagging for data streams in terms of accuracy when both are used in conjunction with adaptive naive Bayes Hoeffding trees, at the expense of runtime and memory consumption. We also show that our stacking method can improve the performance of a bagged ensemble.

Details DOI

ECAI Conference 2010 Conference Paper

GNUsmail: Open Framework for On-line Email Classification

José M. Carmona-Cejudo
Manuel Baena-García
José del Campo-Ávila
Rafael Morales Bueno
Albert Bifet

Real-time classification of massive email data is a challenging task that presents its own particular difficulties. Since email data presents an important temporal component, several problems arise: emails arrive continuously, and the criteria used to classify those emails can change, so the learning algorithms have to be able to deal with concept drift. Our problem is more general than spam detection, which has received much more attention in the literature.

Details

JMLR Journal 2010 Journal Article

MOA: Massive Online Analysis

Albert Bifet
Geoff Holmes
Richard Kirkby
Bernhard Pfahringer

M assive O nline A nalysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA includes a collection of offline and online methods as well as tools for evaluation. In particular, it implements boosting, bagging, and Hoeffding Trees, all with and without Naïve Bayes classifiers at the leaves. MOA supports bi-directional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and is released under the GNU GPL license. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2010. ( edit, beta )

PDF Details