Arrow Research search

Author name cluster

Gal Chechik

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

66 papers
2 author rows

Possible papers

66

AAAI Conference 2026 Conference Paper

Beyond Next Token Probabilities: Learnable, Fast Detection of Hallucinations and Data Contamination on LLM Output Distributions

  • Guy Bar-Shalom
  • Fabrizio Frasca
  • Derek Lim
  • Yoav Gelberg
  • Yftah Ziser
  • Ran El-Yaniv
  • Gal Chechik
  • Haggai Maron

The automated detection of hallucinations and training data contamination is pivotal to the safe deployment of Large Language Models (LLMs). These tasks are particularly challenging in settings where no access to model internals is available. Current approaches in this setup typically leverage only the probabilities of actual tokens in the text, relying on simple task-specific heuristics. Crucially, they overlook the information contained in the full sequence of next-token probability distributions. We propose to go beyond hand-crafted decision rules by learning directly from the complete observable output of LLMs — consisting not only of next-token probabilities, but also the full sequence of next-token distributions. We refer to this as the LLM Output Signature (LOS), and treat it as a reference data type for detecting hallucinations and data contamination. To that end, we introduce LOS-Net, a lightweight attention-based architecture trained on an efficient encoding of the LOS, which can provably approximate a broad class of existing techniques for both tasks. Empirically, LOS-Net achieves superior performance across diverse benchmarks and LLMs, while maintaining extremely low detection latency. Furthermore, it demonstrates promising transfer capabilities across datasets and LLMs.

ICLR Conference 2025 Conference Paper

Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

  • Yoad Tewel
  • Rinon Gal
  • Dvir Samuel
  • Yuval Atzmon
  • Lior Wolf
  • Gal Chechik

Adding Object into images based on text instructions is a challenging task in semantic image editing, requiring a balance between preserving the original scene and seamlessly integrating the new object in a fitting location. Despite extensive efforts, existing models often struggle with this balance, particularly with finding a natural location for adding an object in complex scenes. We introduce Add-it, a training-free approach that extends diffusion models' attention mechanisms to incorporate information from three key sources: the scene image, the text prompt, and the generated image itself. Our weighted extended-attention mechanism maintains structural consistency and fine details while ensuring natural object placement. Without task-specific fine-tuning, Add-it achieves state-of-the-art results on both real and generated image insertion benchmarks, including our newly constructed "Additing Affordance Benchmark" for evaluating object placement plausibility, outperforming supervised methods. Human evaluations show that Add-it is preferred in over 80% of cases, and it also demonstrates improvements in various automated metrics.

ICML Conference 2025 Conference Paper

IT3: Idempotent Test-Time Training

  • Nikita Durasov
  • Assaf Shocher
  • Doruk Öner
  • Gal Chechik
  • Alexei A. Efros
  • Pascal Fua

Deep learning models often struggle when deployed in real-world settings due to distribution shifts between training and test data. While existing approaches like domain adaptation and test-time training (TTT) offer partial solutions, they typically require additional data or domain-specific auxiliary tasks. We present Idempotent Test-Time Training (IT3), a novel approach that enables on-the-fly adaptation to distribution shifts using only the current test instance, without any auxiliary task design. Our key insight is that enforcing idempotence—where repeated applications of a function yield the same result—can effectively replace domain-specific auxiliary tasks used in previous TTT methods. We theoretically connect idempotence to prediction confidence and demonstrate that minimizing the distance between successive applications of our model during inference leads to improved out-of-distribution performance. Extensive experiments across diverse domains (including image classification, aerodynamics prediction, and aerial segmentation) and architectures (MLPs, CNNs, GNNs) show that IT3 consistently outperforms existing approaches while being simpler and more widely applicable. Our results suggest that idempotence provides a universal principle for test-time adaptation that generalizes across domains and architectures.

ICLR Conference 2025 Conference Paper

Lightning-Fast Image Inversion and Editing for Text-to-Image Diffusion Models

  • Dvir Samuel
  • Barak Meiri
  • Haggai Maron
  • Yoad Tewel
  • Nir Darshan
  • Shai Avidan
  • Gal Chechik
  • Rami Ben-Ari

Diffusion inversion is the problem of taking an image and a text prompt that describes it and finding a noise latent that would generate the exact same image. Most current deterministic inversion techniques operate by approximately solving an implicit equation and may converge slowly or yield poor reconstructed images. We formulate the problem by finding the roots of an implicit equation and devlop a method to solve it efficiently. Our solution is based on Newton-Raphson (NR), a well-known technique in numerical analysis. We show that a vanilla application of NR is computationally infeasible while naively transforming it to a computationally tractable alternative tends to converge to out-of-distribution solutions, resulting in poor reconstruction and editing. We therefore derive an efficient guided formulation that fastly converges and provides high-quality reconstructions and editing. We showcase our method on real image editing with three popular open-sourced diffusion models: Stable Diffusion, SDXL-Turbo, and Flux with different deterministic schedulers. Our solution, **Guided Newton-Raphson Inversion**, inverts an image within 0.4 sec (on an A100 GPU) for few-step models (SDXL-Turbo and Flux.1), opening the door for interactive image editing. We further show improved results in image interpolation and generation of rare objects.

ICML Conference 2025 Conference Paper

Policy Gradient with Tree Expansion

  • Gal Dalal
  • Assaf Hallak
  • Gugan Thoppe
  • Shie Mannor
  • Gal Chechik

Policy gradient methods are notorious for having a large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax—a generalization of softmax that employs planning. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We analyze SoftTreeMax and explain how tree expansion helps to reduce its gradient variance. We prove that the variance depends on the chosen tree-expansion policy. Specifically, we show that the closer the induced transitions are to being state-independent, the stronger the variance decay. With approximate forward models, we prove that the resulting gradient bias diminishes with the approximation error while retaining the same variance reduction. Ours is the first result to bound the gradient bias for an approximate model. In a practical implementation of SoftTreeMax we utilize a parallel GPU-based simulator for fast and efficient tree expansion. Using this implementation in Atari, we show that SoftTreeMax reduces the gradient variance by three orders of magnitude. This leads to better sample complexity and improved performance compared to distributed PPO.

NeurIPS Conference 2025 Conference Paper

Policy Optimized Text-to-Image Pipeline Design

  • Uri Gadot
  • Rinon Gal
  • Yftah Ziser
  • Gal Chechik
  • Shie Mannor

Text-to-image generation has evolved beyond single monolithic models to complex multi-component pipelines that combine various enhancement tools. While these pipelines significantly improve image quality, their effective design requires substantial expertise. Recent approaches automating this process through large language models (LLMs) have shown promise but suffer from two critical limitations: extensive computational requirements from generating images with hundreds of predefined pipelines, and poor generalization beyond memorized training examples. We introduce a novel reinforcement learning-based framework that addresses these inefficiencies. Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations, eliminating the need for costly image generation during training. We then implement a two-phase training strategy: initial workflow prediction training followed by GRPO-based optimization that guides the model toward higher-performing regions of the workflow space. Additionally, we incorporate a classifier-free guidance based enhancement technique that extrapolates along the path between the initial and GRPO-tuned models, further improving output quality. We validate our approach through a set of comparisons, showing that it can successfully create new flows with greater diversity and lead to superior image quality compared to existing baselines.

ICML Conference 2024 Conference Paper

Bayesian Uncertainty for Gradient Aggregation in Multi-Task Learning

  • Idan Achituve
  • Idit Diamant
  • Arnon Netzer
  • Gal Chechik
  • Ethan Fetaya

As machine learning becomes more prominent there is a growing demand to perform several inference tasks in parallel. Multi-task learning (MTL) addresses this challenge by learning a single model that solves several tasks simultaneously and efficiently. Often optimizing MTL models entails first computing the gradient of the loss for each task, and then aggregating all the gradients to obtain a combined update direction. However, common methods following this approach do not consider an important aspect, the sensitivity in the dimensions of the gradients. Some dimensions may be more lenient for changes while others may be more restrictive. Here, we introduce a novel gradient aggregation procedure using Bayesian inference. We place a probability distribution over the task-specific parameters, which in turn induce a distribution over the gradients of the tasks. This valuable information allows us to quantify the uncertainty associated with each of the gradients’ dimensions which is factored in when aggregating them. We empirically demonstrate the benefits of our approach in a variety of datasets, achieving state-of-the-art performance.

ICML Conference 2024 Conference Paper

Equivariant Deep Weight Space Alignment

  • Aviv Navon
  • Aviv Shamsian
  • Ethan Fetaya
  • Gal Chechik
  • Nadav Dym
  • Haggai Maron

Permutation symmetries of deep networks make basic operations like model merging and similarity estimation challenging. In many cases, aligning the weights of the networks, i. e. , finding optimal permutations between their weights, is necessary. Unfortunately, weight alignment is an NP-hard problem. Prior research has mainly focused on solving relaxed versions of the alignment problem, leading to either time-consuming methods or sub-optimal solutions. To accelerate the alignment process and improve its quality, we propose a novel framework aimed at learning to solve the weight alignment problem, which we name Deep-Align. To that end, we first prove that weight alignment adheres to two fundamental symmetries and then, propose a deep architecture that respects these symmetries. Notably, our framework does not require any labeled data. We provide a theoretical analysis of our approach and evaluate Deep-Align on several types of network architectures and learning setups. Our experimental results indicate that a feed-forward pass with Deep-Align produces better or equivalent alignments compared to those produced by current optimization algorithms. Additionally, our alignments can be used as an effective initialization for other methods, leading to improved solutions with a significant speedup in convergence.

AAAI Conference 2024 Conference Paper

Generating Images of Rare Concepts Using Pre-trained Diffusion Models

  • Dvir Samuel
  • Rami Ben-Ari
  • Simon Raviv
  • Nir Darshan
  • Gal Chechik

Text-to-image diffusion models can synthesize high quality images, but they have various limitations. Here we highlight a common failure mode of these models, namely, generating uncommon concepts and structured concepts like hand palms. We show that their limitation is partly due to the long-tail nature of their training data: web-crawled data sets are strongly unbalanced, causing models to under-represent concepts from the tail of the distribution. We characterize the effect of unbalanced training data on text-to-image models and offer a remedy. We show that rare concepts can be correctly generated by carefully selecting suitable generation seeds in the noise space, using a small reference set of images, a technique that we call SeedSelect. SeedSelect does not require retraining or finetuning the diffusion model. We assess the faithfulness, quality and diversity of SeedSelect in creating rare objects and generating complex formations like hand images, and find it consistently achieves superior performance. We further show the advantage of SeedSelect in semantic data augmentation. Generating semantically appropriate images can successfully improve performance in few-shot recognition benchmarks, for classes from the head and from the tail of the training data of diffusion models.

ICML Conference 2024 Conference Paper

Improved Generalization of Weight Space Networks via Augmentations

  • Aviv Shamsian
  • Aviv Navon
  • David W. Zhang
  • Yan Zhang
  • Ethan Fetaya
  • Gal Chechik
  • Haggai Maron

Learning in deep weight spaces (DWS), where neural networks process the weights of other neural networks, is an emerging research direction, with applications to 2D and 3D neural fields (INRs, NeRFs), as well as making inferences about other types of neural networks. Unfortunately, weight space models tend to suffer from substantial overfitting. We empirically analyze the reasons for this overfitting and find that a key reason is the lack of diversity in DWS datasets. While a given object can be represented by many different weight configurations, typical INR training sets fail to capture variability across INRs that represent the same object. To address this, we explore strategies for data augmentation in weight spaces and propose a MixUp method adapted for weight spaces. We demonstrate the effectiveness of these methods in two setups. In classification, they improve performance similarly to having up to 10 times more data. In self-supervised contrastive learning, they yield substantial 5-10% gains in downstream classification.

NeurIPS Conference 2024 Conference Paper

Where's Waldo: Diffusion Features For Personalized Segmentation and Retrieval

  • Dvir Samuel
  • Rami Ben-Ari
  • Matan Levy
  • Nir Darshan
  • Gal Chechik

Personalized retrieval and segmentation aim to locate specific instances within a dataset based on an input image and a short description of the reference instance. While supervised methods are effective, they require extensive labeled data for training. Recently, self-supervised foundation models have been introduced to these tasks showing comparable results to supervised methods. However, a significant flaw in these models is evident: they struggle to locate a desired instance when other instances within the same class are presented. In this paper, we explore text-to-image diffusion models for these tasks. Specifically, we propose a novel approach called PDM for Personalized Diffusion Features Matching, that leverages intermediate features of pre-trained text-to-image models for personalization tasks without any additional training. PDM demonstrates superior performance on popular retrieval and segmentation benchmarks, outperforming even supervised methods. We also highlight notable shortcomings in current instance and segmentation datasets and propose new benchmarks for these tasks.

ICLR Conference 2023 Conference Paper

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

  • Rinon Gal
  • Yuval Alaluf
  • Yuval Atzmon
  • Or Patashnik
  • Amit H. Bermano
  • Gal Chechik
  • Daniel Cohen-Or

Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn *our* cat into a painting, or imagine a new product based on *our* favorite toy? Here we present a simple approach that allows such creative freedom. Using only $3$-$5$ images of a user-provided concept, like an object or a style, we learn to represent it through new ``words" in the embedding space of a frozen text-to-image model. These ``words" can be composed into natural language sentences, guiding *personalized* creation in an intuitive way. Notably, we find evidence that a *single* word embedding is sufficient for capturing unique and varied concepts. We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks. Our code, data and new words will be available.

ICML Conference 2023 Conference Paper

Auxiliary Learning as an Asymmetric Bargaining Game

  • Aviv Shamsian
  • Aviv Navon
  • Neta Glazer
  • Kenji Kawaguchi
  • Gal Chechik
  • Ethan Fetaya

Auxiliary learning is an effective method for enhancing the generalization capabilities of trained models, particularly when dealing with small datasets. However, this approach may present several difficulties: (i) optimizing multiple objectives can be more challenging, and (ii) how to balance the auxiliary tasks to best assist the main task is unclear. In this work, we propose a novel approach, named AuxiNash, for balancing tasks in auxiliary learning by formalizing the problem as generalized bargaining game with asymmetric task bargaining power. Furthermore, we describe an efficient procedure for learning the bargaining power of tasks based on their contribution to the performance of the main task and derive theoretical guarantees for its convergence. Finally, we evaluate AuxiNash on multiple multi-task benchmarks and find that it consistently outperforms competing methods.

ICML Conference 2023 Conference Paper

Equivariant Architectures for Learning in Deep Weight Spaces

  • Aviv Navon
  • Aviv Shamsian
  • Idan Achituve
  • Ethan Fetaya
  • Gal Chechik
  • Haggai Maron

Designing machine learning architectures for processing neural networks in their raw weight matrix form is a newly introduced research direction. Unfortunately, the unique symmetry structure of deep weight spaces makes this design very challenging. If successful, such architectures would be capable of performing a wide range of intriguing tasks, from adapting a pre-trained network to a new domain to editing objects represented as functions (INRs or NeRFs). As a first step towards this goal, we present here a novel network architecture for learning in deep weight spaces. It takes as input a concatenation of weights and biases of a pre-trained MLP and processes it using a composition of layers that are equivariant to the natural permutation symmetry of the MLP’s weights: Changing the order of neurons in intermediate layers of the MLP does not affect the function it represents. We provide a full characterization of all affine equivariant and invariant layers for these symmetries and show how these layers can be implemented using three basic operations: pooling, broadcasting, and fully connected layers applied to the input in an appropriate manner. We demonstrate the effectiveness of our architecture and its advantages over natural baselines in a variety of learning tasks.

ICML Conference 2023 Conference Paper

Graph Positional Encoding via Random Feature Propagation

  • Moshe Eliasof
  • Fabrizio Frasca
  • Beatrice Bevilacqua
  • Eran Treister
  • Gal Chechik
  • Haggai Maron

Two main families of node feature augmentation schemes have been explored for enhancing GNNs: random features and spectral positional encoding. Surprisingly, however, there is still no clear understanding of the relation between these two augmentation schemes. Here we propose a novel family of positional encoding schemes which draws a link between the above two approaches and improves over both. The new approach, named Random Feature Propagation (RFP), is inspired by the power iteration method and its generalizations. It concatenates several intermediate steps of an iterative algorithm for computing the dominant eigenvectors of a propagation matrix, starting from random node features. Notably, these propagation steps are based on graph-dependent propagation operators that can be either predefined or learned. We explore the theoretical and empirical benefits of RFP. First, we provide theoretical justifications for using random features, for incorporating early propagation steps, and for using multiple random initializations. Then, we empirically demonstrate that RFP significantly outperforms both spectral PE and random features in multiple node classification and graph classification benchmarks.

UAI Conference 2023 Conference Paper

Guided Deep Kernel Learning

  • Idan Achituve
  • Gal Chechik
  • Ethan Fetaya

Combining Gaussian processes with the expressive power of deep neural networks is commonly done nowadays through deep kernel learning (DKL). Unfortunately, due to the kernel optimization process, this often results in losing their Bayesian benefits. In this study, we present a novel approach for learning deep kernels by utilizing infinite-width neural networks. We propose to use the Neural Network Gaussian Process (NNGP) model as a guide to the DKL model in the optimization process. Our approach harnesses the reliable uncertainty estimation of the NNGPs to adapt the DKL target confidence when it encounters novel data points. As a result, we get the best of both worlds, we leverage the Bayesian behavior of the NNGP, namely its robustness to overfitting, and accurate uncertainty estimation, while maintaining the generalization abilities, scalability, and flexibility of deep kernels. Empirically, we show on multiple benchmark datasets of varying sizes and dimensionality, that our method is robust to overfitting, has good predictive performance, and provides reliable uncertainty estimations.

ICML Conference 2023 Conference Paper

Learning to Initiate and Reason in Event-Driven Cascading Processes

  • Yuval Atzmon
  • Eli A. Meirom
  • Shie Mannor
  • Gal Chechik

Training agents to control a dynamic environment is a fundamental task in AI. In many environments, the dynamics can be summarized by a small set of events that capture the semantic behavior of the system. Typically, these events form chains or cascades. We often wish to change the system behavior using a single intervention that propagates through the cascade. For instance, one may trigger a biochemical cascade to switch the state of a cell or, in logistics, reroute a truck to meet an unexpected, urgent delivery. We introduce a new supervised learning setup called Cascade. An agent observes a system with known dynamics evolving from some initial state. The agent is given a structured semantic instruction and needs to make an intervention that triggers a cascade of events, such that the system reaches an alternative (counterfactual) behavior. We provide a test-bed for this problem, consisting of physical objects. We combine semantic tree search with an event-driven forward model and devise an algorithm that learns to efficiently search in exponentially large semantic trees. We demonstrate that our approach learns to follow instructions to intervene in new complex scenes. When provided with an observed cascade of events, it can also reason about alternative outcomes.

NeurIPS Conference 2023 Conference Paper

Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

  • Royi Rassin
  • Eran Hirsch
  • Daniel Glickman
  • Shauli Ravfogel
  • Yoav Goldberg
  • Gal Chechik

Text-conditioned image generation models often generate incorrect associations between entities and their visual attributes. This reflects an impaired mapping between linguistic binding of entities and modifiers in the prompt and visual binding of the corresponding elements in the generated image. As one example, a query like ``a pink sunflower and a yellow flamingo'' may incorrectly produce an image of a yellow sunflower and a pink flamingo. To remedy this issue, we propose SynGen, an approach which first syntactically analyses the prompt to identify entities and their modifiers, and then uses a novel loss function that encourages the cross-attention maps to agree with the linguistic binding reflected by the syntax. Specifically, we encourage large overlap between attention maps of entities and their modifiers, and small overlap with other entities and modifier words. The loss is optimized during inference, without retraining or fine-tuning the model. Human evaluation on three datasets, including one new and challenging set, demonstrate significant improvements of SynGen compared with current state of the art methods. This work highlights how making use of sentence structure during inference can efficiently and substantially improve the faithfulness of text-to-image generation.

NeurIPS Conference 2023 Conference Paper

Norm-guided latent space exploration for text-to-image generation

  • Dvir Samuel
  • Rami Ben-Ari
  • Nir Darshan
  • Haggai Maron
  • Gal Chechik

Text-to-image diffusion models show great potential in synthesizing a large variety of concepts in new compositions and scenarios. However, the latent space of initial seeds is still not well understood and its structure was shown to impact the generation of various concepts. Specifically, simple operations like interpolation and finding the centroid of a set of seeds perform poorly when using standard Euclidean or spherical metrics in the latent space. This paper makes the observation that, in current training procedures, diffusion models observed inputs with a narrow range of norm values. This has strong implications for methods that rely on seed manipulation for image generation, with applications to few-shot and long-tail learning tasks. To address this issue, we propose a novel method for interpolating between two seeds and demonstrate that it defines a new non-Euclidean metric that takes into account a norm-based prior on seeds. We describe a simple yet efficient algorithm for approximating this interpolation procedure and use it to further define centroids in the latent seed space. We show that our new interpolation and centroid techniques significantly enhance the generation of rare concept images. This further leads to state-of-the-art performance on few-shot and long-tail benchmarks, improving prior approaches in terms of generation speed, image quality, and semantic content.

AAAI Conference 2023 Conference Paper

Planning and Learning with Adaptive Lookahead

  • Aviv Rosenberg
  • Assaf Hallak
  • Shie Mannor
  • Gal Chechik
  • Gal Dalal

Some of the most powerful reinforcement learning frameworks use planning for action selection. Interestingly, their planning horizon is either fixed or determined arbitrarily by the state visitation history. Here, we expand beyond the naive fixed horizon and propose a theoretically justified strategy for adaptive selection of the planning horizon as a function of the state-dependent value estimate. We propose two variants for lookahead selection and analyze the trade-off between iteration count and computational complexity per iteration. We then devise a corresponding deep Q-network algorithm with an adaptive tree search horizon. We separate the value estimation per depth to compensate for the off-policy discrepancy between depths. Lastly, we demonstrate the efficacy of our adaptive lookahead method in a maze environment and Atari.

NeurIPS Conference 2023 Conference Paper

Point Cloud Completion with Pretrained Text-to-Image Diffusion Models

  • Yoni Kasten
  • Ohad Rahamim
  • Gal Chechik

Point cloud data collected in real-world applications are often incomplete. This is because they are observed from partial viewpoints, which capture only a specific perspective or angle, or due to occlusion and low resolution. Existing completion approaches rely on datasets of specific predefined objects to guide the completion of incomplete, and possibly noisy, point clouds. However, these approaches perform poorly with Out-Of-Distribution (OOD) objects, which are either absent from the dataset or poorly represented. In recent years, the field of text-guided image generation has made significant progress, leading to major breakthroughs in text guided shape generation. We describe an approach called SDS-Complete that uses a pre-trained text-to-image diffusion model and leverages the text semantic of a given incomplete point cloud of an object, to obtain a complete surface representation. SDS-Complete can complete a variety of objects at test time optimization without the need for an expensive collection of 3D information. We evaluate SDS-Complete on incomplete scanned objects, captured by real-world depth sensors and LiDAR scanners, and demonstrate that is effective in handling objects which are typically absent from common datasets.

EWRL Workshop 2023 Workshop Paper

Train Hard, Fight Easy: Robust Meta Reinforcement Learning

  • Ido Greenberg
  • Shie Mannor
  • Gal Chechik
  • Eli Meirom

A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients. Meta-RL (MRL) addresses this issue by learning a meta-policy that adapts to new tasks. Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty. This limits system reliability whenever test tasks are not known in advance. In this work, we define a robust MRL objective with a controlled robustness level. Disturbingly, optimization of analogous robust objectives in RL is known to lead to both *biased gradients* and *data inefficiency*. The gradient bias is proven to disappear in MRL, which further motivates the proposed framework. The data inefficiency is addressed via the novel Robust Meta RL algorithm (RoML). RoML is a meta-algorithm that generates a robust version of any given MRL algorithm, by identifying and over-sampling harder tasks throughout training. We demonstrate that RoML achieves robust returns on multiple navigation and continuous control benchmarks.

NeurIPS Conference 2023 Conference Paper

Train Hard, Fight Easy: Robust Meta Reinforcement Learning

  • Ido Greenberg
  • Shie Mannor
  • Gal Chechik
  • Eli Meirom

A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients. Meta-RL (MRL) addresses this issue by learning a meta-policy that adapts to new tasks. Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty. This limits system reliability since test tasks are not known in advance. In this work, we define a robust MRL objective with a controlled robustness level. Optimization of analogous robust objectives in RL is known to lead to both biased gradients and data inefficiency. We prove that the gradient bias disappears in our proposed MRL framework. The data inefficiency is addressed via the novel Robust Meta RL algorithm (RoML). RoML is a meta-algorithm that generates a robust version of any given MRL algorithm, by identifying and over-sampling harder tasks throughout training. We demonstrate that RoML achieves robust returns on multiple navigation and continuous control benchmarks.

ICML Conference 2022 Conference Paper

Multi-Task Learning as a Bargaining Game

  • Aviv Navon
  • Aviv Shamsian
  • Idan Achituve
  • Haggai Maron
  • Kenji Kawaguchi
  • Gal Chechik
  • Ethan Fetaya

In Multi-task learning (MTL), a joint model is trained to simultaneously make predictions for several tasks. Joint training reduces computation costs and improves data efficiency; however, since the gradients of these different tasks may conflict, training a joint model for MTL often yields lower performance than its corresponding single-task counterparts. A common method for alleviating this issue is to combine per-task gradients into a joint update direction using a particular heuristic. In this paper, we propose viewing the gradients combination step as a bargaining game, where tasks negotiate to reach an agreement on a joint direction of parameter update. Under certain assumptions, the bargaining problem has a unique solution, known as the Nash Bargaining Solution, which we propose to use as a principled approach to multi-task learning. We describe a new MTL optimization procedure, Nash-MTL, and derive theoretical guarantees for its convergence. Empirically, we show that Nash-MTL achieves state-of-the-art results on multiple MTL benchmarks in various domains.

ICLR Conference 2022 Conference Paper

On Covariate Shift of Latent Confounders in Imitation and Reinforcement Learning

  • Guy Tennenholtz
  • Assaf Hallak
  • Gal Dalal
  • Shie Mannor
  • Gal Chechik
  • Uri Shalit

We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning. We begin by defining the problem of learning from confounded expert data in a contextual MDP setup. We analyze the limitations of learning from such data with and without external reward and propose an adjustment of standard imitation learning algorithms to fit this setup. In addition, we discuss the problem of distribution shift between the expert data and the online environment when partial observability is present in the data. We prove possibility and impossibility results for imitation learning under arbitrary distribution shift of the missing covariates. When additional external reward is provided, we propose a sampling procedure that addresses the unknown shift and prove convergence to an optimal solution. Finally, we validate our claims empirically on challenging assistive healthcare and recommender system simulation tasks.

ICML Conference 2022 Conference Paper

Optimizing Tensor Network Contraction Using Reinforcement Learning

  • Eli A. Meirom
  • Haggai Maron
  • Shie Mannor
  • Gal Chechik

Quantum Computing (QC) stands to revolutionize computing, but is currently still limited. To develop and test quantum algorithms today, quantum circuits are often simulated on classical computers. Simulating a complex quantum circuit requires computing the contraction of a large network of tensors. The order (path) of contraction can have a drastic effect on the computing cost, but finding an efficient order is a challenging combinatorial optimization problem. We propose a Reinforcement Learning (RL) approach combined with Graph Neural Networks (GNN) to address the contraction ordering problem. The problem is extremely challenging due to the huge search space, the heavy-tailed reward distribution, and the challenging credit assignment. We show how a carefully implemented RL-agent that uses a GNN as the basic policy construct can address these challenges and obtain significant improvements over state-of-the-art techniques in three varieties of circuits, including the largest scale networks used in contemporary QC.

NeurIPS Conference 2022 Conference Paper

Reinforcement Learning with a Terminator

  • Guy Tennenholtz
  • Nadav Merlis
  • Lior Shani
  • Shie Mannor
  • Uri Shalit
  • Gal Chechik
  • Assaf Hallak
  • Gal Dalal

We present the problem of reinforcement learning with exogenous termination. We define the Termination Markov Decision Process (TerMDP), an extension of the MDP framework, in which episodes may be interrupted by an external non-Markovian observer. This formulation accounts for numerous real-world situations, such as a human interrupting an autonomous driving agent for reasons of discomfort. We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds. We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret. Motivated by our theoretical analysis, we design and implement a scalable approach, which combines optimism (w. r. t. termination) and a dynamic discount factor, incorporating the termination probability. We deploy our method on high-dimensional driving and MinAtar benchmarks. Additionally, we test our approach on human data in a driving setting. Our results demonstrate fast convergence and significant improvement over various baseline approaches.

EWRL Workshop 2022 Workshop Paper

Reinforcement Learning with a Terminator

  • Tennenholtz Guy
  • Nadav Merlis
  • Lior Shani
  • Shie Mannor
  • Uri Shalit
  • Gal Chechik
  • Assaf Hallak
  • Gal Dalal

We present the problem of reinforcement learning with exogenous termination. We define the Termination Markov Decision Process (TerMDP), an extension of the MDP framework, in which episodes may be interrupted by an external non-Markovian observer. We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds. We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret. Motivated by our theoretical analysis, we design and implement a scalable approach, which combines optimism (w. r. t. termination) and a dynamic discount factor, incorporating the termination probability. We deploy our method on high-dimensional driving and MinAtar benchmarks. Additionally, we test our approach on human data in a driving setting. Our results demonstrate fast convergence and significant improvement over various baseline approaches.

ICLR Conference 2021 Conference Paper

Auxiliary Learning by Implicit Differentiation

  • Aviv Navon
  • Idan Achituve
  • Haggai Maron
  • Gal Chechik
  • Ethan Fetaya

Training neural networks with auxiliary tasks is a common practice for improving the performance on a main task of interest. Two main challenges arise in this multi-task learning setting: (i) designing useful auxiliary tasks; and (ii) combining auxiliary tasks into a single coherent loss. Here, we propose a novel framework, AuxiLearn, that targets both challenges based on implicit differentiation. First, when useful auxiliaries are known, we propose learning a network that combines all losses into a single coherent objective function. This network can learn non-linear interactions between tasks. Second, when no useful auxiliary task is known, we describe how to learn a network that generates a meaningful, novel auxiliary task. We evaluate AuxiLearn in a series of tasks and domains, including image segmentation and learning with attributes in the low data regime, and find that it consistently outperforms competing methods.

ICML Conference 2021 Conference Paper

Compositional Video Synthesis with Action Graphs

  • Amir Bar
  • Roei Herzig
  • Xiaolong Wang 0004
  • Anna Rohrbach
  • Gal Chechik
  • Trevor Darrell
  • Amir Globerson

Videos of actions are complex signals containing rich compositional structure in space and time. Current video generation methods lack the ability to condition the generation on multiple coordinated and potentially simultaneous timed actions. To address this challenge, we propose to represent the actions in a graph structure called Action Graph and present the new "Action Graph To Video" synthesis task. Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation. We train and evaluate AG2Vid on CATER and Something-Something V2 datasets, which results in videos that have better visual quality and semantic consistency compared to baselines. Finally, our model demonstrates zero-shot abilities by synthesizing novel compositions of the learned actions.

ICML Conference 2021 Conference Paper

Controlling Graph Dynamics with Reinforcement Learning and Graph Neural Networks

  • Eli A. Meirom
  • Haggai Maron
  • Shie Mannor
  • Gal Chechik

We consider the problem of controlling a partially-observed dynamic process on a graph by a limited number of interventions. This problem naturally arises in contexts such as scheduling virus tests to curb an epidemic; targeted marketing in order to promote a product; and manually inspecting posts to detect fake news spreading on social networks. We formulate this setup as a sequential decision problem over a temporal graph process. In face of an exponential state space, combinatorial action space and partial observability, we design a novel tractable scheme to control dynamical processes on temporal graphs. We successfully apply our approach to two popular problems that fall into our framework: prioritizing which nodes should be tested in order to curb the spread of an epidemic, and influence maximization on a graph.

ICML Conference 2021 Conference Paper

From Local Structures to Size Generalization in Graph Neural Networks

  • Gilad Yehudai
  • Ethan Fetaya
  • Eli A. Meirom
  • Gal Chechik
  • Haggai Maron

Graph neural networks (GNNs) can process graphs of different sizes, but their ability to generalize across sizes, specifically from small to large graphs, is still not well understood. In this paper, we identify an important type of data where generalization from small to large graphs is challenging: graph distributions for which the local structure depends on the graph size. This effect occurs in multiple important graph learning domains, including social and biological networks. We first prove that when there is a difference between the local structures, GNNs are not guaranteed to generalize across sizes: there are "bad" global minima that do well on small graphs but fail on large graphs. We then study the size-generalization problem empirically and demonstrate that when there is a discrepancy in local structure, GNNs tend to converge to non-generalizing solutions. Finally, we suggest two approaches for improving size generalization, motivated by our findings. Notably, we propose a novel Self-Supervised Learning (SSL) task aimed at learning meaningful representations of local structures that appear in large graphs. Our SSL task improves classification accuracy on several popular datasets.

ICML Conference 2021 Conference Paper

GP-Tree: A Gaussian Process Classifier for Few-Shot Incremental Learning

  • Idan Achituve
  • Aviv Navon
  • Yochai Yemini
  • Gal Chechik
  • Ethan Fetaya

Gaussian processes (GPs) are non-parametric, flexible, models that work well in many tasks. Combining GPs with deep learning methods via deep kernel learning (DKL) is especially compelling due to the strong representational power induced by the network. However, inference in GPs, whether with or without DKL, can be computationally challenging on large datasets. Here, we propose GP-Tree, a novel method for multi-class classification with Gaussian processes and DKL. We develop a tree-based hierarchical model in which each internal node of the tree fits a GP to the data using the P{ó}lya-Gamma augmentation scheme. As a result, our method scales well with both the number of classes and data size. We demonstrate the effectiveness of our method against other Gaussian process training baselines, and we show how our general GP approach achieves improved accuracy on standard incremental few-shot learning benchmarks.

NeurIPS Conference 2021 Conference Paper

Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction

  • Gal Dalal
  • Assaf Hallak
  • Steven Dalton
  • iuri frosio
  • Shie Mannor
  • Gal Chechik

Tree Search (TS) is crucial to some of the most influential successes in reinforcement learning. Here, we tackle two major challenges with TS that limit its usability: \textit{distribution shift} and \textit{scalability}. We first discover and analyze a counter-intuitive phenomenon: action selection through TS and a pre-trained value function often leads to lower performance compared to the original pre-trained agent, even when having access to the exact state and reward in future steps. We show this is due to a distribution shift to areas where value estimates are highly inaccurate and analyze this effect using Extreme Value theory. To overcome this problem, we introduce a novel off-policy correction term that accounts for the mismatch between the pre-trained value and its corresponding TS policy by penalizing under-sampled trajectories. We prove that our correction eliminates the above mismatch and bound the probability of sub-optimal action selection. Our correction significantly improves pre-trained Rainbow agents without any further training, often more than doubling their scores on Atari games. Next, we address the scalability issue given by the computational complexity of exhaustive TS that scales exponentially with the tree depth. We introduce Batch-BFS: a GPU breadth-first search that advances all nodes in each depth of the tree simultaneously. Batch-BFS reduces runtime by two orders of magnitude and, beyond inference, enables also training with TS of depths that were not feasible before. We train DQN agents from scratch using TS and show improvement in several Atari games compared to both the original DQN and the more advanced Rainbow. We will share the code upon publication.

UAI Conference 2021 Conference Paper

Known unknowns: Learning novel concepts using reasoning-by-elimination

  • Harsh Agrawal
  • Eli A. Meirom
  • Yuval Atzmon
  • Shie Mannor
  • Gal Chechik

People can learn new visual concepts without any samples, from information given by language or by deductive reasoning. For instance, people can use elimination to infer the meaning of novel labels from their context. While recognizing novel concepts was intensively studied in zero-shot learning with semantic descriptions, training models to learn by elimination is much less studied. Here we describe the first approach to train an agent to reason-by-elimination, by providing instructions that contain both familiar concepts and unfamiliar ones (“pick the red box and the green wambim”). In our framework, the agent combines a perception module with a reasoning module that includes internal memory. It uses reinforcement learning to construct a reasoning policy that, by considering all available items in a room, can make a correct inference even for never-seen objects or concepts. Furthermore, it can then perform one-shot learning and use newly learned concepts for inferring additional novel concepts. We evaluate this approach in a new set of environments, showing that agents successfully learn to reason by elimination, and can also learn novel concepts and use them for further reasoning. This approach paves the way to handle open-world environments by extending the abundant supervised learning approaches with reasoning frameworks that can handle novel concepts.

ICLR Conference 2021 Conference Paper

Learning the Pareto Front with Hypernetworks

  • Aviv Navon
  • Aviv Shamsian
  • Ethan Fetaya
  • Gal Chechik

Multi-objective optimization (MOO) problems are prevalent in machine learning. These problems have a set of optimal solutions, called the Pareto front, where each point on the front represents a different trade-off between possibly conflicting objectives. Recent MOO methods can target a specific desired ray in loss space however, most approaches still face two grave limitations: (i) A separate model has to be trained for each point on the front; and (ii) The exact trade-off must be known before the optimization process. Here, we tackle the problem of learning the entire Pareto front, with the capability of selecting a desired operating point on the front after training. We call this new setup Pareto-Front Learning (PFL). We describe an approach to PFL implemented using HyperNetworks, which we term Pareto HyperNetworks (PHNs). PHN learns the entire Pareto front simultaneously using a single hypernetwork, which receives as input a desired preference vector and returns a Pareto-optimal model whose loss vector is in the desired ray. The unified model is runtime efficient compared to training multiple models and generalizes to new operating points not used during training. We evaluate our method on a wide set of problems, from multi-task regression and classification to fairness. PHNs learn the entire Pareto front at roughly the same time as learning a single point on the front and at the same time reach a better solution set. PFL opens the door to new applications where models are selected based on preferences that are only available at run time.

IJCAI Conference 2021 Conference Paper

On Learning Sets of Symmetric Elements (Extended Abstract)

  • Haggai Maron
  • Or Litany
  • Gal Chechik
  • Ethan Fetaya

Learning from unordered sets is a fundamental learning setup, recently attracting increasing attention. Research in this area has focused on the case where elements of the set are represented by feature vectors, and far less emphasis has been given to the common case where set elements themselves adhere to their own symmetries. That case is relevant to numerous applications, from deblurring image bursts to multi-view 3D shape recognition and reconstruction. In this paper, we present a principled approach to learning sets of general symmetric elements. We first characterize the space of linear layers that are equivariant both to element reordering and to the inherent symmetries of elements, like translation in the case of images. We further show that networks that are composed of these layers, called Deep Sets for Symmetric Elements layers (DSS), are universal approximators of both invariant and equivariant functions, and that these networks are strictly more expressive than Siamese networks. DSS layers are also straightforward to implement. Finally, we show that they improve over existing set-learning architectures in a series of experiments with images, graphs, and point clouds.

ICML Conference 2021 Conference Paper

Personalized Federated Learning using Hypernetworks

  • Aviv Shamsian
  • Aviv Navon
  • Ethan Fetaya
  • Gal Chechik

Personalized federated learning is tasked with training machine learning models for multiple clients, each with its own data distribution. The goal is to train personalized models collaboratively while accounting for data disparities across clients and reducing communication costs. We propose a novel approach to this problem using hypernetworks, termed pFedHN for personalized Federated HyperNetworks. In this approach, a central hypernetwork model is trained to generate a set of models, one model for each client. This architecture provides effective parameter sharing across clients while maintaining the capacity to generate unique and diverse personal models. Furthermore, since hypernetwork parameters are never transmitted, this approach decouples the communication cost from the trainable model size. We test pFedHN empirically in several personalized federated learning challenges and find that it outperforms previous methods. Finally, since hypernetworks share information across clients, we show that pFedHN can generalize better to new clients whose distributions differ from any client observed during training.

NeurIPS Conference 2021 Conference Paper

Personalized Federated Learning With Gaussian Processes

  • Idan Achituve
  • Aviv Shamsian
  • Aviv Navon
  • Gal Chechik
  • Ethan Fetaya

Federated learning aims to learn a global model that performs well on client devices with limited cross-client communication. Personalized federated learning (PFL) further extends this setup to handle data heterogeneity between clients by learning personalized models. A key challenge in this setting is to learn effectively across clients even though each client has unique data that is often limited in size. Here we present pFedGP, a solution to PFL that is based on Gaussian processes (GPs) with deep kernel learning. GPs are highly expressive models that work well in the low data regime due to their Bayesian nature. However, applying GPs to PFL raises multiple challenges. Mainly, GPs performance depends heavily on access to a good kernel function, and learning a kernel requires a large training set. Therefore, we propose learning a shared kernel function across all clients, parameterized by a neural network, with a personal GP classifier for each client. We further extend pFedGP to include inducing points using two novel methods, the first helps to improve generalization in the low data regime and the second reduces the computational cost. We derive a PAC-Bayes generalization bound on novel clients and empirically show that it gives non-vacuous guarantees. Extensive experiments on standard PFL benchmarks with CIFAR-10, CIFAR-100, and CINIC-10, and on a new setup of learning under input noise show that pFedGP achieves well-calibrated predictions while significantly outperforming baseline methods, reaching up to 21% in accuracy gain.

NeurIPS Conference 2020 Conference Paper

A causal view of compositional zero-shot recognition

  • Yuval Atzmon
  • Felix Kreuk
  • Uri Shalit
  • Gal Chechik

People easily recognize new visual categories that are new combinations of known components. This compositional generalization capacity is critical for learning in real-world domains like vision and language because the long tail of new combinations dominates the distribution. Unfortunately, learning systems struggle with compositional generalization because they often build on features that are correlated with class labels even if they are not "essential" for the class. This leads to consistent misclassification of samples from a new distribution, like new combinations of known components. Here we describe an approach for compositional generalization that builds on causal ideas. First, we describe compositional zero-shot learning from a causal perspective, and propose to view zero-shot inference as finding "which intervention caused the image? ". Second, we present a causal-inspired embedding model that learns disentangled representations of elementary components of visual objects from correlated (confounded) training data. We evaluate this approach on two datasets for predicting new combinations of attribute-object pairs: A well-controlled synthesized images dataset and a real world dataset which consists of fine-grained types of shoes. We show improvements compared to strong baselines.

ICML Conference 2020 Conference Paper

On Learning Sets of Symmetric Elements

  • Haggai Maron
  • Or Litany
  • Gal Chechik
  • Ethan Fetaya

Learning from unordered sets is a fundamental learning setup, recently attracting increasing attention. Research in this area has focused on the case where elements of the set are represented by feature vectors, and far less emphasis has been given to the common case where set elements themselves adhere to their own symmetries. That case is relevant to numerous applications, from deblurring image bursts to multi-view 3D shape recognition and reconstruction. In this paper, we present a principled approach to learning sets of general symmetric elements. We first characterize the space of linear layers that are equivariant both to element reordering and to the inherent symmetries of elements, like translation in the case of images. We further show that networks that are composed of these layers, called Deep Sets for Symmetric Elements layers (DSS), are universal approximators of both invariant and equivariant functions. DSS layers are also straightforward to implement. Finally, we show that they improve over existing set-learning architectures in a series of experiments with images, graphs, and point-clouds.

NeurIPS Conference 2018 Conference Paper

Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction

  • Roei Herzig
  • Moshiko Raboh
  • Gal Chechik
  • Jonathan Berant
  • Amir Globerson

Machine understanding of complex images is a key goal of artificial intelligence. One challenge underlying this task is that visual scenes contain multiple inter-related objects, and that global context plays an important role in interpreting the scene. A natural modeling framework for capturing such effects is structured prediction, which optimizes over complex labels, while modeling within-label interactions. However, it is unclear what principles should guide the design of a structured prediction model that utilizes the power of deep learning components. Here we propose a design principle for such architectures that follows from a natural requirement of permutation invariance. We prove a necessary and sufficient characterization for architectures that follow this invariance, and discuss its implication on model design. Finally, we show that the resulting model achieves new state of the art results on the Visual Genome scene graph labeling benchmark, outperforming all recent approaches.

UAI Conference 2018 Conference Paper

Probabilistic AND-OR Attribute Grouping for Zero-Shot Learning

  • Yuval Atzmon
  • Gal Chechik

In zero-shot learning (ZSL), a classifier is trained to recognize visual classes without any image samples. Instead, it is given semantic information about the class, like a textual description or a set of attributes. Learning from attributes could benefit from explicitly modeling structure of the attribute space. Unfortunately, learning of general structure from empirical samples is hard with typical dataset sizes. Here we describe LAGO1, a probabilistic model designed to capture natural soft andor relations across groups of attributes. We show how this model can be learned end-toend with a deep attribute-detection model. The soft group structure can be learned from data jointly as part of the model, and can also readily incorporate prior knowledge about groups if available. The soft and-or structure succeeds to capture meaningful and predictive structures, improving the accuracy of zero-shot learning on two of three benchmarks. Finally, LAGO reveals a unified formulation over two ZSL approaches: DAP (Lampert et al. , 2009) and ESZSL (Romera-Paredes & Torr, 2015). Interestingly, taking only one singleton group for each attribute, introduces a new soft-relaxation of DAP, that outperforms DAP by ∼40%. 1 A video of the highlights, and code is available at: http: //chechiklab. biu. ac. il/˜yuvval/LAGO/ Figure 1: Classifying a bird species based on attributes from (Wah et al. , 2011). The Mourning Warbler can be distinguished from other species by a combination of a grey head and olive-green underparts. Both human raters and machine learning models may confuse semantically-similar attributes like olive or green wings. These attribute naturally cluster into ”OR” groups, where we aim to recognize this species if the wing is labeled as either green or olive. The LAGO model (Eq. 4) weighs attributes detection, inferring classes based on within-group soft-OR and across-groups soft-AND. In general, OR-groups include alternative choices of a property (wing color: {red, olive, green}) and soft-OR allows to weigh down class-irrelevant attributes (here, wing: red).

IJCAI Conference 2017 Conference Paper

Instance-Level Label Propagation with Multi-Instance Learning

  • Qifan Wang
  • Gal Chechik
  • Chen Sun
  • Bin Shen

Label propagation is a popular semi-supervised learning technique that transfers information from labeled examples to unlabeled examples through a graph. Most label propagation methods construct a graph based on example-to-example similarity, assuming that the resulting graph connects examples that share similar labels. Unfortunately, example-level similarity is sometimes badly defined. For instance, two images may contain two different objects, but have similar overall appearance due to large similar background. In this case, computing similarities based on whole-image would fail propagating information to the right labels. This paper proposes a novel Instance-Level Label Propagation (ILLP) approach that integrates label propagation with multi-instance learning. Each example is treated as containing multiple instances, as in the case of an image consisting of multiple regions. We first construct a graph based on instance-level similarity and then simultaneously identify the instances carrying the labels and propagate the labels across instances in the graph. Optimization is based on an iterative Expectation Maximization (EM) algorithm. Experimental results on two benchmark datasets demonstrate the effectiveness of the proposed approach over several state-of-the-art methods.

ICML Conference 2014 Conference Paper

Coordinate-descent for learning orthogonal matrices through Givens rotations

  • Uri Shalit
  • Gal Chechik

Optimizing over the set of orthogonal matrices is a central component in problems like sparse-PCA or tensor decomposition. Unfortunately, such optimization is hard since simple operations on orthogonal matrices easily break orthogonality, and correcting orthogonality usually costs a large amount of computation. Here we propose a framework for optimizing orthogonal matrices, that is the parallel of coordinate-descent in Euclidean spaces. It is based on \em Givens-rotations, a fast-to-compute operation that affects a small number of entries in the learned matrix, and preserves orthogonality. We show two applications of this approach: an algorithm for tensor decompositions used in learning mixture models, and an algorithm for sparse-PCA. We study the parameter regime where a Givens rotation approach converges faster and achieves a superior model on a genome-wide brain-wide mRNA expression dataset.

ICML Conference 2013 Conference Paper

Modeling Musical Influence with Topic Models

  • Uri Shalit
  • Daphna Weinshall
  • Gal Chechik

The role of musical influence has long been debated by scholars and critics in the humanities, but never in a data-driven way. In this work we approach the question of influence by applying topic-modeling tools (Blei & Lafferty, 2006; Gerrish & Blei, 2010) to a dataset of 24941 songs by 9222 artists, from the years 1922 to 2010. We find the models to be significantly correlated with a human-curated influence measure, and to clearly outperform a baseline method. Further using the learned model to study properties of influence, we find that musical influence and musical innovation are not monotonically correlated. However, we do find that the most influential songs were more innovative during two time periods: the early 1970’s and the mid 1990’s.

JMLR Journal 2012 Journal Article

Online Learning in the Embedded Manifold of Low-rank Matrices

  • Uri Shalit
  • Daphna Weinshall
  • Gal Chechik

When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches to minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low-rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction is costly to compute, and so is the projection operator that approximates it, we describe another retraction that can be computed efficiently. It has run time and memory complexity of O ( (n+m)k ) for a rank- k matrix of dimension m X n, when using an online procedure with rank-one gradients. We use this algorithm, LORETA, to learn a matrix-form similarity measure over pairs of documents represented as high dimensional vectors. LORETA improves the mean average precision over a passive-aggressive approach in a factorized model, and also improves over a full model trained on pre-selected features using the same memory requirements. We further adapt LORETA to learn positive semi-definite low-rank matrices, providing an online algorithm for low-rank metric learning. LORETA also shows consistent improvement over standard weakly supervised methods in a large (1600 classes and 1 million images, using ImageNet ) multi-label image classification task. [abs] [ pdf ][ bib ] &copy JMLR 2012. ( edit, beta )

JMLR Journal 2010 Journal Article

Large Scale Online Learning of Image Similarity Through Ranking

  • Gal Chechik
  • Varun Sharma
  • Uri Shalit
  • Samy Bengio

Learning a measure of similarity between pairs of objects is an important generic problem in machine learning. It is particularly useful in large scale applications like searching for an image that is similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, the approaches that exist today for learning such semantic similarity do not scale to large data sets. This is both because typically their CPU and storage requirements grow quadratically with the sample size, and because many methods impose complex positivity constraints on the space of learned similarity functions. The current paper presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a data set with thousands of images, it achieves better results than existing state-of-the-art methods, while being an order of magnitude faster. For large, web scale, data sets, OASIS can be trained on more than two million images from 150K text queries within 3 days on a single CPU. On this large scale data set, human evaluations showed that 35% of the ten nearest neighbors of a given test image, as found by OASIS, were semantically relevant to that image. This suggests that query independent similarity could be accurately learned even for large scale data sets that could not be handled before. [abs] [ pdf ][ bib ] &copy JMLR 2010. ( edit, beta )

NeurIPS Conference 2010 Conference Paper

Online Learning in The Manifold of Low-Rank Matrices

  • Uri Shalit
  • Daphna Weinshall
  • Gal Chechik

When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches for minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction is hard to compute, and so is the projection operator that approximates it, we describe another second-order retraction that can be computed efficiently, with run time and memory complexity of O((n+m)k) for a rank-k matrix of dimension m x n, given rank one gradients. We use this algorithm, LORETA, to learn a matrix-form similarity measure over pairs of documents represented as high dimensional vectors. LORETA improves the mean average precision over a passive- aggressive approach in a factorized model, and also improves over a full model trained over pre-selected features using the same memory requirements. LORETA also showed consistent improvement over standard methods in a large (1600 classes) multi-label image classification task.

NeurIPS Conference 2009 Conference Paper

An Online Algorithm for Large Scale Image Similarity Learning

  • Gal Chechik
  • Uri Shalit
  • Varun Sharma
  • Samy Bengio

Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, current approaches for learning similarity may not scale to large datasets with high dimensionality, especially when imposing metric constraints on the learned similarity. We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of non-zero features. Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. OASIS is accurate at a wide range of scales: on a standard benchmark with thousands of images, it is more precise than state-of-the-art methods, and faster by orders of magnitude. On 2 million images collected from the web, OASIS can be trained within 3 days on a single CPU. The non-metric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. This suggests an approach for learning a metric from data that is larger by an order of magnitude than was handled before.

JMLR Journal 2008 Journal Article

Max-margin Classification of Data with Absent Features

  • Gal Chechik
  • Geremy Heitz
  • Gal Elidan
  • Pieter Abbeel
  • Daphne Koller

We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. [abs] [ pdf ][ bib ] &copy JMLR 2008. ( edit, beta )

JMLR Journal 2007 Journal Article

Euclidean Embedding of Co-occurrence Data

  • Amir Globerson
  • Gal Chechik
  • Fernando Pereira
  • Naftali Tishby

Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. [abs] [ pdf ][ bib ] &copy JMLR 2007. ( edit, beta )

NeurIPS Conference 2006 Conference Paper

Max-margin classification of incomplete data

  • Gal Chechik
  • Geremy Heitz
  • Gal Elidan
  • Pieter Abbeel
  • Daphne Koller

We consider the problem of learning classifiers for structurally incomplete data, where some ob jects have a subset of features inherently absent due to complex relationships between the features. The common approach for handling missing features is to begin with a preprocessing phase that completes the missing features, and then use a standard classification procedure. In this paper we show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate this task using a geometrically-inspired ob jective function, and discuss two optimization approaches: The linearly separable case is written as a set of convex feasibility problems, and the non-separable case has a non-convex ob jective that we optimize iteratively. By avoiding the pre-processing phase in which the data is completed, these approaches offer considerable computational savings. More importantly, we show that by elegantly handling complex patterns of missing values, our approach is both competitive with other methods when the values are missing at random and outperforms them when the missing values have non-trivial structure. We demonstrate our results on two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images.

NeurIPS Conference 2006 Conference Paper

Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks

  • Alexis Battle
  • Gal Chechik
  • Daphne Koller

We present a probabilistic model applied to the fMRI video rating prediction task of the Pittsburgh Brain Activity Interpretation Competition (PBAIC) [2]. Our goal is to predict a time series of subjective, semantic ratings of a movie given functional MRI data acquired during viewing by three subjects. Our method uses conditionally trained Gaussian Markov random fields, which model both the relationships between the subjects' fMRI voxel measurements and the ratings, as well as the dependencies of the ratings across time steps and between subjects. We also employed non-traditional methods for feature selection and regularization that exploit the spatial structure of voxel activity in the brain. The model displayed good performance in predicting the scored ratings for the three subjects in test data sets, and a variant of this model was the third place entrant to the 2006 PBAIC.

JMLR Journal 2005 Journal Article

Information Bottleneck for Gaussian Variables

  • Gal Chechik
  • Amir Globerson
  • Naftali Tishby
  • Yair Weiss

The problem of extracting the relevant aspects of data was previously addressed through the information bottleneck (IB) method, through (soft) clustering one variable while preserving information about another - relevance - variable. The current work extends these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters, for the special case of multivariate Gaussian variables. While the general continuous IB problem is difficult to solve, we provide an analytic solution for the optimal representation and tradeoff between compression and relevance for the this important case. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized regression matrix Σ x|y Σ x -1, which is also the basis obtained in canonical correlation analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector, through a cascade of structural phase transitions. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrized by the compression level. Our analysis also provides a complete analytic expression of the preserved information as a function of the compression (the "information-curve"), in terms of the eigenvalue spectrum of the data. As in the discrete case, the information curve is concave and smooth, though it is made of different analytic segments for each optimal dimension. Finally, we show how the algorithmic theory developed in the IB framework provides an iterative algorithm for obtaining the optimal Gaussian projections. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

NeurIPS Conference 2004 Conference Paper

Discrete profile alignment via constrained information bottleneck

  • Sean O'rourke
  • Gal Chechik
  • Robin Friedman
  • Eleazar Eskin

Amino acid profiles, which capture position-specific mutation prob- abilities, are a richer encoding of biological sequences than the in- dividual sequences themselves. However, profile comparisons are much more computationally expensive than discrete symbol com- parisons, making profiles impractical for many large datasets. Fur- thermore, because they are such a rich representation, profiles can be difficult to visualize. To overcome these problems, we propose a discretization for profiles using an expanded alphabet representing not just individual amino acids, but common profiles. By using an extension of information bottleneck (IB) incorporating constraints and priors on the class distributions, we find an informationally optimal alphabet. This discretization yields a concise, informative textual representation for profile sequences. Also alignments be- tween these sequences, while nearly as accurate as the full profile- profile alignments, can be computed almost as quickly as those between individual or consensus sequences. A full pairwise align- ment of SwissProt would take years using profiles, but less than 3 days using a discrete IB encoding, illustrating how discrete en- coding can expand the range of sequence problems to which profile information can be applied. 1 Introduction One of the most powerful techniques in protein analysis is the comparison of a target amino acid sequence with phylogenetically related or homologous proteins. Such comparisons give insight into which portions of the protein are important by revealing the parts that were conserved through natural selection. While mutations in non-functional regions may be harmless, mutations in functional regions are often lethal. For this reason, functional regions of a protein tend to be conserved between organisms while non-functional regions diverge. Department of Computer Science and Engineering, University of California San Diego Department of Computer Science, Stanford University Many of the state-of-the-art protein analysis techniques incorporate homologous sequences by representing a set of homologous sequences as a probabilistic profile, a sequence of the marginal distributions of amino acids at each position in the sequence. For example, Yona et al. [10] uses profiles to align distant homologues from the SCOP database[3]; the resulting alignments are similar to results from structural alignments, and tend to reflect both secondary and tertiary protein structure. The PHD algorithm[5] uses profiles purely for structure prediction. PSIBLAST[6] uses them to refine database searches. Although profiles provide a lot of information about the sequence, the use of pro- files comes at a steep price. While extremely efficient string algorithms exist for aligning protein sequences (Smith-Waterman[8]) and performing database queries (BLAST[6]), these algorithms operate on strings and are not immediately applica- ble to profile alignment or profile database queries. While profile-based methods can be substantially more accurate than sequence-based ones, they can require at least an order of magnitude more computation time, since substitution penalties must be calculated by computing distances between probability distributions. This makes profiles impractical for use with large bioinformatics databases like SwissProt, which recently passed 150, 000 sequences. Another drawback of profile as compared to string representations is that it is much more difficult to visually interpret a sequence of 20 dimensional vectors than a sequence of letters. Discretizing the profiles addresses both of these problems. First, once a profile is rep- resented using a discrete alphabet, alignment and database search can be performed using the efficient string algorithms developed for sequences. For example, when aligning sequences of 1000 elements, runtime decreases from 20 seconds for profiles to 2 for discrete sequences. Second, by representing each class as a letter, discretized profiles can be presented in plain text like the original or consensus sequences, while conveying more information about the underlying profiles. This makes them more accurate than consensus sequences, and more dense than sequence logos (see figure 1). To make this representation intuitive, we want the discretization not only to minimize information loss, but also to reflect biologically meaningful categories by forming a superset of the standard 20-character amino acid alphabet. For example, we use "A" and "a" for strongly- and weakly-conserved Alanine. This formulation demands two types of constraints: similarities of the centroids to predefined values, and specific structural similarities between strongly- and weakly-conserved variants. We show below how these constraints can be added to the original IB formalism. In this paper, we present a new discrete representation of proteins that takes into account information from homologues. The main idea behind our approach is to compress the space of probabilistic profiles in a data-dependent manner by clustering the actual profiles and representing them by a small alphabet of distributions. Since this discretization removes some of the information carried by the full profiles, we cluster the distribution in a way that is directly targeted at minimizing the information loss. This is achieved using a variant of Information Bottleneck (IB)[9], a distributional clustering approach for informationally optimal discretization. We apply our algorithm to a subset of MEROPS[4], a database of peptidases or- ganized structurally by family and clan, and analyze the results in terms of both information loss and alignment quality. We show that multivariate IB in particular preserves much of the information in the original profiles using a small number of classes. Furthermore, optimal alignments for profile sequences encoded with these classes are much closer to the original profile-profile alignments than are alignments between the seed proteins. IB discretization is therefore an attractive way to gain some of the additional sensitivity of profiles with less computational cost. 0. 0 0. 0 0. 0 0. 09 0. 34 0. 23 0. 12 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 04 0. 01 0. 01 0. 03 0. 0 0. 0 0. 0 0. 0 0. 0 1. 0 0. 01 0. 05 0. 14 0. 09 0. 0 1. 0 0. 0 0. 0 0. 0 0. 0 0. 38 0. 04 0. 00 0. 04 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 06 0. 00 0. 08 0. 04 0. 0 0. 0 1. 0 0. 0 0. 0 0. 0 0. 00 0. 06 0. 01 0. 03 1. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 02 0. 00 0. 04 0. 00 0. 0 0. 0 0. 0 N 0. 0 0. 0 0. 0 0. 00 0. 00 0. 03 0. 00 0. 0 0. 0 0. 0 ND GDF 0. 0 0. 0 0. 0 0. 04 0. 01 0. 01 0. 00 0. 0 0. 0 0. 0 S EAAS S V PT S A A T D F F D N G S L K D Q E T R N H F Y V 0. 0 0. 0 0. 0 0. 01 0. 01 0. 00 0. 09 0. 0 0. 0 0. 0 Q Y E A A A A A A 0. 0 0. 0 0. 0 0. 00 0. 00 0. 03 0. 00 0. 0 0. 0 0. 0 (b) 0. 5 1. 0 0. 0 0. 05 0. 05 0. 01 0. 01 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 02 0. 00 0. 23 0. 00 0. 0 0. 0 0. 0 P00790 Seq. : ---EAPT--- 0. 0 0. 0 0. 0 0. 04 0. 05 0. 00 0. 00 0. 0 0. 0 0. 0 Consensus Seq. : NNDEAASGDF 0. 0 0. 0 0. 0 0. 04 0. 01 0. 00 0. 00 0. 0 0. 0 0. 0 0. 5 0. 0 0. 0 0. 16 0. 10 0. 06 0. 29 0. 0 0. 0 0. 0 IB Seq. : NNDeaptGDF 0. 0 0. 0 0. 0 0. 02 0. 10 0. 05 0. 20 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 00 0. 14 0. 03 0. 04 0. 0 0. 0 0. 0 (c) 0. 0 0. 0 0. 0 0. 00 0. 00 0. 00 0. 00 0. 0 0. 0 0. 0 0. 0 0. 0 0. 0 0. 01 0. 00 0. 04 0. 04 0. 0 0. 0 0. 0 (a) Figure 1: (a) Profile, (b) sequence logo[2], and (c) textual representations for part of an alignment of Pepsin A precursor P00790, showing IB's concision compared to profiles and logos, and its precision compared to single sequences. 2 Information Bottleneck Information Bottleneck [9] is an information theoretic approach for distributional clustering. Given a joint distribution p(X, Y ) of two random variables X and Y, the goal is to obtain a compressed representation C of X, while preserving the informa- tion about Y. The two goals of compression and information preservation are quan- tified by the same measure of mutual information I(X; Y ) = p(x, y) log p(x, y) x, y p(x)p(y) and the problem is therefore defined as the constrained optimization problem minp(c|x): I(C; Y )>K I(C; X) where K is a constraint on the level of information preserved about Y, and the problem should also obey the constraints p(y|c) = p(y|x)p(x|c) and p(y) = p(y|x)p(x). This constrained optimization can be x x reformulated using Lagrange multipliers, and turned into a tradeoff optimization function with Lagrange multiplier: min L def = I(C; X) - I(C; Y ) (1) p(c|x) As an unsupervised learning technique, IB aims to characterize the set of solutions for the complete spectrum of constraint values K. This set of solutions is identical to the set of solutions of the tradeoff optimization problem obtained for the spectrum of values. When X is discrete, its natural compression is fuzzy clustering. In this case, the problem is not convex and cannot be guaranteed to contain a single global minimum. Fortunately, its solutions can be characterized analytically by a set of self consistent equations. These self consistent equations can then be used in an iterative algorithm that is guaranteed to converge to a local minimum. While the optimal solutions of the IB functional are in general soft clusters, in practice, hard cluster solutions are sometimes more easily interpreted. A series of algorithms was developed for hard IB, including an algorithm that can be viewed as a one-step look-ahead sequential version of K-Means [7]. To apply IB to the problem of profiles discretization discussed here, X is a given set of probabilistic profiles obtained from a set of aligned sequences and Y is the set of 20 amino acids. 2. 1 Constraints on centroids' semantics The application studied in this paper differs from standard IB applications in that we are interested in obtaining a representation that is both efficient and biologi- cally meaningful. This requires that we add two kinds of constraints on clusters' distributions, discussed below. First, some clusters' meanings are naturally determined by limiting them to corre- spond to the common 20-letter alphabet used to describe amino acids. From the point of view of distributions over amino acids, each of these symbols is used today as the delta function distribution which is fully concentrated on a single amino acid. For the goal of finding an efficient representation, we require the centroids to be close to these delta distributions. More generally, we require the centroids to be close to some predefined values ^ ci, thus adding constraints to the IB target function of the form DKL[p(y|^ ci)||p(y|ci)] min L I(C; X) - I(C; Y ) + (ci)DKL[p(y|^ ci)||p(y|ci)]. (2) p(c|x) ciC We now use the following identity p(x, c)DKL[p(y|x)||p(y|c)] x, c = p(x) p(y|x) log p(y|x) - p(c) log p(y|c) p(y|x)p(x|c) x y c y x = -H(Y |X) + H(Y |C) = I(X; Y ) - I(Y; C) to rewrite the IB functional of Eq. (1) as L = I(C; X) + p(x, c)DKL[p(y|x)||p(y|c)] - I(X; Y ) cC xX When (ci) 1 we can similarly rewrite Eq. (2) as L = I(C; X) + p(x) p(ci|x)DKL[p(y|x)||p(y|ci)] (3) xX ciC + (ci)DKL[p(y|^ ci)||p(y|ci)] - I(X; Y ) ciC = I(C; X) + p(x ) p(ci|x )DKL[p(y|x )||p(y|ci)] - I(X; Y ) x X ciC The optimization problem therefore becomes equivalent to the original IB problem, but with a modified set of samples x X, containing X plus additional "pseudo- counts" or biases. This is similar to the inclusion of priors in Bayesian estimation. Formulated this way, the biases can be easily incorporated in standard IB algorithms by adding additional pseudo-counts x with prior probability p(x ) = i(c). 2. 2 Constraints on relations between centroids We want our discretization to capture correlations between strongly- and weakly- conserved variants of the same symbol. This can be done with standard IB using separate classes for the alternatives. However, since the distributions of other amino acids in these two variants are likely to be related, it is preferable to define a single shared prior for both variants, and to learn a model capturing their correlation. Friedman et al. [1] describe multivariate information bottleneck (mIB), an extension of information bottleneck to joint distributions over several correlated input and cluster variables. For profile discretization, we define two compression variables connected as in Friedman's "parallel IB": an amino acid class C {A, C, .. .} with an associated prior, and a strength S {0, 1}. Since this model correlates strong and weak variants of each category, it requires fewer priors than simple IB. It also has fewer parameters: a multivariate model with ns strengths and nc classes has as many categories as a univariate one with nc = nsnc classes, but has only ns +nc -2 free parameters for each x, instead of nsnc - 1.

NeurIPS Conference 2004 Conference Paper

Euclidean Embedding of Co-Occurrence Data

  • Amir Globerson
  • Gal Chechik
  • Fernando Pereira
  • Naftali Tishby

Embedding algorithms search for low dimensional structure in complex data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for em- bedding objects of different types, such as images and text, into a single common Euclidean space based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to con- vex optimization over positive semidefinite matrices. The local struc- ture of our embedding corresponds to the statistical correlations via ran- dom walks in the Euclidean space. We quantify the performance of our method on two text datasets, and show that it consistently and signifi- cantly outperforms standard methods of statistical correspondence mod- eling, such as multidimensional scaling and correspondence analysis.

NeurIPS Conference 2003 Conference Paper

Information Bottleneck for Gaussian Variables

  • Gal Chechik
  • Amir Globerson
  • Naftali Tishby
  • Yair Weiss

The problem of extracting the relevant aspects of data was ad- dressed through the information bottleneck (IB) method, by (soft) clustering one variable while preserving information about another - relevance - variable. An interesting question addressed in the current work is the extension of these ideas to obtain continuous representations that preserve relevant information, rather than dis- crete clusters. We give a formal deflnition of the general continuous IB problem and obtain an analytic solution for the optimal repre- sentation for the important case of multivariate Gaussian variables. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized correlation matrix §xjy§¡1 x, which is also the basis obtained in Canonical Correlation Analysis. How- ever, in Gaussian IB, the compression tradeofi parameter uniquely determines the dimension, as well as the scale of each eigenvector. This introduces a novel interpretation where solutions of difierent ranks lie on a continuum parametrized by the compression level. Our analysis also provides an analytic expression for the optimal tradeofi - the information curve - in terms of the eigenvalue spec- trum.

NeurIPS Conference 2002 Conference Paper

Extracting Relevant Structures with Side Information

  • Gal Chechik
  • Naftali Tishby

The problem of extracting the relevant aspects of data, in face of multiple conflicting structures, is inherent to modeling of complex data. Extract- ing structure in one random variable that is relevant for another variable has been principally addressed recently via the information bottleneck method [15]. However, such auxiliary variables often contain more in- formation than is actually required due to structures that are irrelevant for the task. In many other cases it is in fact easier to specify what is irrelevant than what is, for the task at hand. Identifying the relevant structures, however, can thus be considerably improved by also mini- mizing the information about another, irrelevant, variable. In this paper we give a general formulation of this problem and derive its formal, as well as algorithmic, solution. Its operation is demonstrated in a synthetic example and in two real world problems in the context of text categoriza- tion and face images. While the original information bottleneck problem is related to rate distortion theory, with the distortion measure replaced by the relevant information, extracting relevant features while removing irrelevant ones is related to rate distortion with side information.

NeurIPS Conference 2001 Conference Paper

Group Redundancy Measures Reveal Redundancy Reduction in the Auditory Pathway

  • Gal Chechik
  • Amir Globerson
  • M. Anderson
  • E. Young
  • Israel Nelken
  • Naftali Tishby

The way groups of auditory neurons interact to code acoustic in(cid: 173) formation is investigated using an information theoretic approach. We develop measures of redundancy among groups of neurons, and apply them to the study of collaborative coding efficiency in two processing stations in the auditory pathway: the inferior colliculus (IC) and the primary auditory cortex (AI). Under two schemes for the coding of the acoustic content, acoustic segments coding and stimulus identity coding, we show differences both in information content and group redundancies between IC and AI neurons. These results provide for the first time a direct evidence for redundancy reduction along the ascending auditory pathway, as has been hy(cid: 173) pothesized for theoretical considerations [Barlow 1959, 2001]. The redundancy effects under the single-spikes coding scheme are signif(cid: 173) icant only for groups larger than ten cells, and cannot be revealed with the redundancy measures that use only pairs of cells. The results suggest that the auditory system transforms low level rep(cid: 173) resentations that contain redundancies due to the statistical struc(cid: 173) ture of natural stimuli, into a representation in which cortical neu(cid: 173) rons extract rare and independent component of complex acoustic signals, that are useful for auditory scene analysis.

NeurIPS Conference 2000 Conference Paper

Temporally Dependent Plasticity: An Information Theoretic Account

  • Gal Chechik
  • Naftali Tishby

The paradigm of Hebbian learning has recently received a novel in(cid: 173) terpretation with the discovery of synaptic plasticity that depends on the relative timing of pre and post synaptic spikes. This paper derives a temporally dependent learning rule from the basic princi(cid: 173) ple of mutual information maximization and studies its relation to the experimentally observed plasticity. We find that a supervised spike-dependent learning rule sharing similar structure with the ex(cid: 173) perimentally observed plasticity increases mutual information to a stable near optimal level. Moreover, the analysis reveals how the temporal structure of time-dependent learning rules is determined by the temporal filter applied by neurons over their inputs. These results suggest experimental prediction as to the dependency of the learning rule on neuronal biophysical parameters

NeurIPS Conference 1999 Conference Paper

Effective Learning Requires Neuronal Remodeling of Hebbian Synapses

  • Gal Chechik
  • Isaac Meilijson
  • Eytan Ruppin

This paper revisits the classical neuroscience paradigm of Hebbian learning. We find that a necessary requirement for effective as(cid: 173) sociative memory learning is that the efficacies of the incoming synapses should be uncorrelated. This requirement is difficult to achieve in a robust manner by Hebbian synaptic learning, since it depends on network level information. Effective learning can yet be obtained by a neuronal process that maintains a zero sum of the in(cid: 173) coming synaptic efficacies. This normalization drastically improves the memory capacity of associative networks, from an essentially bounded capacity to one that linearly scales with the network's size. It also enables the effective storage of patterns with heterogeneous coding levels in a single network. Such neuronal normalization can be successfully carried out by activity-dependent homeostasis of the neuron's synaptic efficacies, which was recently observed in cortical tissue. Thus, our findings strongly suggest that effective associa(cid: 173) tive learning with Hebbian synapses alone is biologically implausi(cid: 173) ble and that Hebbian synapses must be continuously remodeled by neuronally-driven regulatory processes in the brain.

NeurIPS Conference 1998 Conference Paper

Neuronal Regulation Implements Efficient Synaptic Pruning

  • Gal Chechik
  • Isaac Meilijson
  • Eytan Ruppin

Human and animal studies show that mammalian brain undergoes massive synaptic pruning during childhood, removing about half of the synapses until puberty. We have previously shown that main(cid: 173) taining network memory performance while synapses are deleted, requires that synapses are properly modified and pruned, remov(cid: 173) ing the weaker synapses. We now show that neuronal regulation, a mechanism recently observed to maintain the average neuronal in(cid: 173) put field, results in weight-dependent synaptic modification. Under the correct range of the degradation dimension and synaptic up(cid: 173) per bound, neuronal regulation removes the weaker synapses and judiciously modifies the remaining synapses. It implements near optimal synaptic modification, and maintains the memory perfor(cid: 173) mance of a network undergoing massive synaptic pruning. Thus, this paper shows that in addition to the known effects of Hebbian changes, neuronal regulation may play an important role in the self-organization of brain networks during development.