Author name cluster

Alexei Efros

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers

1 author row

NeurIPS Conference 2025 Conference Paper

Fast Data Attribution for Text-to-Image Models

Sheng-Yu Wang
Aaron Hertzmann
Alexei Efros
Richard Zhang
Jun-Yan Zhu

Data attribution for text-to-image models aims to identify the training images that most significantly influenced a generated output. Existing attribution methods involve considerable computational resources for each query, making them impractical for real-world applications. We propose a novel approach for scalable and efficient data attribution. Our key idea is to distill a slow, unlearning-based attribution method to a feature embedding space for efficient retrieval of highly influential training images. During deployment, combined with efficient indexing and search methods, our method successfully finds highly influential images without running expensive attribution algorithms. We show extensive results on both medium-scale models trained on MSCOCO and large-scale Stable Diffusion models trained on LAION, demonstrating that our method can achieve better or competitive performance in a few seconds, faster than existing methods by 2, 500x - 400, 000x. Our work represents a meaningful step towards the large-scale application of data attribution methods on real-world models such as Stable Diffusion.

NeurIPS Conference 2025 Conference Paper

Vision Transformers Don't Need Trained Registers

Nicholas Jiang
Amil Dravid
Alexei Efros
Yossi Gandelsman

We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers -- the emergence of high-norm tokens that lead to noisy attention maps (Darcet et al. , 2024). We observe that in multiple models (e. g. , CLIP, DINOv2), a sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to irregular attention patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned $\textit{register tokens}$, we use our findings to create a training-free approach to mitigate these artifacts. By shifting the high-norm activations from our discovered $\textit{register neurons}$ into an additional untrained token, we can mimic the effect of register tokens on a model already trained without registers. We demonstrate that our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream visual tasks, and achieves results comparable to models explicitly trained with register tokens. We then extend test-time registers to off-the-shelf vision-language models, yielding cleaner attention-based, text-to-image attribution. Finally, we outline a simple mathematical model that reflects the observed behavior of register neurons and high norm tokens. Our results suggest that test-time registers effectively take on the role of register tokens at test-time, offering a training-free solution for any pre-trained model released without them.

NeurIPS Conference 2025 Conference Paper

Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting

Anand Bhattad
Konpat Preechakul
Alexei Efros

This paper proposes a novel scene understanding task called Visual Jenga. Drawing inspiration from the game Jenga, the proposed task involves progressively removing objects from a single image until only the background remains. Just as Jenga players must understand structural dependencies to maintain tower stability, our task reveals the intrinsic relationships between scene elements by systematically exploring which objects can be removed while preserving scene coherence in both physical and geometric sense. As a starting point for tackling the Visual Jenga task, we propose a simple, data-driven, training-free approach that is surprisingly effective on a range of real-world images. The principle behind our approach is to utilize the asymmetry in the pairwise relationships between objects within a scene and employ a large inpainting model to generate a set of counterfactuals to quantify the asymmetry.

NeurIPS Conference 2023 Conference Paper

Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives

Tom Monnier
Jake Austin
Angjoo Kanazawa
Alexei Efros
Mathieu Aubry

Given a set of calibrated images of a scene, we present an approach that produces a simple, compact, and actionable 3D world representation by means of 3D primitives. While many approaches focus on recovering high-fidelity 3D scenes, we focus on parsing a scene into mid-level 3D representations made of a small set of textured primitives. Such representations are interpretable, easy to manipulate and suited for physics-based simulations. Moreover, unlike existing primitive decomposition methods that rely on 3D input data, our approach operates directly on images through differentiable rendering. Specifically, we model primitives as textured superquadric meshes and optimize their parameters from scratch with an image rendering loss. We highlight the importance of modeling transparency for each primitive, which is critical for optimization and also enables handling varying numbers of primitives. We show that the resulting textured primitives faithfully reconstruct the input images and accurately model the visible 3D points, while providing amodal shape completions of unseen object regions. We compare our approach to the state of the art on diverse scenes from DTU, and demonstrate its robustness on real-life captures from BlendedMVS and Nerfstudio. We also showcase how our results can be used to effortlessly edit a scene or perform physical simulations. Code and video results are available at https: //www. tmonnier. com/DBW.

NeurIPS Conference 2023 Conference Paper

Diffusion Self-Guidance for Controllable Image Generation

Dave Epstein
Allan Jabri
Ben Poole
Alexei Efros
Aleksander Holynski

Large-scale generative models are capable of producing high-quality images from detailed prompts. However, many aspects of an image are difficult or impossible to convey through text. We introduce self-guidance, a method that provides precise control over properties of the generated image by guiding the internal representations of diffusion models. We demonstrate that the size, location, and appearance of objects can be extracted from these representations, and show how to use them to steer the sampling process. Self-guidance operates similarly to standard classifier guidance, but uses signals present in the pretrained model itself, requiring no additional models or training. We demonstrate the flexibility and effectiveness of self-guided generation through a wide range of challenging image manipulations, such as modifying the position or size of a single object (keeping the rest of the image unchanged), merging the appearance of objects in one image with the layout of another, composing objects from multiple images into one, and more. We also propose a new method for reconstruction using self-guidance, which allows extending our approach to editing real images.

NeurIPS Conference 2022 Conference Paper

Generating Long Videos of Dynamic Scenes

Tim Brooks
Janne Hellsten
Miika Aittala
Ting-Chun Wang
Timo Aila
Jaakko Lehtinen
Ming-Yu Liu
Alexei Efros

We present a video generation model that accurately reproduces object motion, changes in camera viewpoint, and new content that arises over time. Existing video generation methods often fail to produce new content as a function of time while maintaining consistencies expected in real environments, such as plausible dynamics and object persistence. A common failure case is for content to never change due to over-reliance on inductive bias to provide temporal consistency, such as a single latent code that dictates content for the entire video. On the other extreme, without long-term consistency, generated videos may morph unrealistically between different scenes. To address these limitations, we prioritize the time axis by redesigning the temporal latent representation and learning long-term consistency from data by training on longer videos. We leverage a two-phase training strategy, where we separately train using longer videos at a low resolution and shorter videos at a high resolution. To evaluate the capabilities of our model, we introduce two new benchmark datasets with explicit focus on long-term temporal dynamics.

NeurIPS Conference 2022 Conference Paper

Test-Time Training with Masked Autoencoders

Yossi Gandelsman
Yu Sun
Xinlei Chen
Alexei Efros

Test-time training adapts to a new test distribution on the fly by optimizing a model for each test input using self-supervision. In this paper, we use masked autoencoders for this one-sample learning problem. Empirically, our simple method improves generalization on many visual benchmarks for distribution shifts. Theoretically, we characterize this improvement in terms of the bias-variance trade-off.

NeurIPS Conference 2022 Conference Paper

Visual Prompting via Image Inpainting

Amir Bar
Yossi Gandelsman
Trevor Darrell
Amir Globerson
Alexei Efros

How does one adapt a pre-trained visual model to novel downstream tasks without task-specific finetuning or any model modification? Inspired by prompting in NLP, this paper investigates visual prompting: given input-output image example(s) of a new task at test time and a new input image, the goal is to automatically produce the output image, consistent with the given examples. We show that posing this problem as simple image inpainting -- literally just filling in a hole in a concatenated visual prompt image -- turns out to be surprisingly effective, provided that the inpainting algorithm has been trained on the right data. We train masked auto-encoders on a new dataset that we curated -- 88k unlabeled figures from academic papers sources on Arxiv. We apply visual prompting to these pretrained models and demonstrate results on various downstream image-to-image tasks, including foreground segmentation, single object detection, colorization, edge detection, etc. Project page: https: //yossigandelsman. github. io/visual_prompt

NeurIPS Conference 2021 Conference Paper

MarioNette: Self-Supervised Sprite Learning

Dmitriy Smirnov
MICHAEL GHARBI
Matthew Fisher
Vitor Guizilini
Alexei Efros
Justin M. Solomon

Artists and video game designers often construct 2D animations using libraries of sprites---textured patches of objects and characters. We propose a deep learning approach that decomposes sprite-based video animations into a disentangled representation of recurring graphic elements in a self-supervised manner. By jointly learning a dictionary of possibly transparent patches and training a network that places them onto a canvas, we deconstruct sprite-based content into a sparse, consistent, and explicit representation that can be easily used in downstream tasks, like editing or analysis. Our framework offers a promising approach for discovering recurring visual patterns in image collections without supervision.

NeurIPS Conference 2020 Conference Paper

Space-Time Correspondence as a Contrastive Random Walk

Allan Jabri
Andrew Owens
Alexei Efros

This paper proposes a simple self-supervised approach for learning a representation for visual correspondence from raw video. We cast correspondence as prediction of links in a space-time graph constructed from video. In this graph, the nodes are patches sampled from each frame, and nodes adjacent in time can share a directed edge. We learn a representation in which pairwise similarity defines transition probability of a random walk, such that prediction of long-range correspondence is computed as a walk along the graph. We optimize the representation to place high probability along paths of similarity. Targets for learning are formed without supervision, by cycle-consistency: the objective is to maximize the likelihood of returning to the initial node when walking along a graph constructed from a palindrome of frames. Thus, a single path-level constraint implicitly supervises chains of intermediate comparisons. When used as a similarity metric without adaptation, the learned representation outperforms the self-supervised state-of-the-art on label propagation tasks involving objects, semantic parts, and pose. Moreover, we demonstrate that a technique we call edge dropout, as well as self-supervised adaptation at test-time, further improve transfer for object-centric correspondence.

NeurIPS Conference 2020 Conference Paper

Swapping Autoencoder for Deep Image Manipulation

Taesung Park
Jun-Yan Zhu
Oliver Wang
Jingwan Lu
Eli Shechtman
Alexei Efros
Richard Zhang

Deep generative models have become increasingly effective at producing realistic images from randomly sampled seeds, but using such models for controllable manipulation of existing images remains challenging. We propose the Swapping Autoencoder, a deep model designed specifically for image manipulation, rather than random sampling. The key idea is to encode an image into two independent components and enforce that any swapped combination maps to a realistic image. In particular, we encourage the components to represent structure and texture, by enforcing one component to encode co-occurrent patch statistics across different parts of the image. As our method is trained with an encoder, finding the latent codes for a new input image becomes trivial, rather than cumbersome. As a result, our method enables us to manipulate real input images in various ways, including texture swapping, local and global editing, and latent code vector arithmetic. Experiments on multiple datasets show that our model produces better results and is substantially more efficient compared to recent generative models.

NeurIPS Conference 2019 Conference Paper

Learning to Control Self-Assembling Morphologies: A Study of Generalization via Modularity

Deepak Pathak
Christopher Lu
Trevor Darrell
Phillip Isola
Alexei Efros

Contemporary sensorimotor learning approaches typically start with an existing complex agent (e. g. , a robotic arm), which they learn to control. In contrast, this paper investigates a modular co-evolution strategy: a collection of primitive agents learns to dynamically self-assemble into composite bodies while also learning to coordinate their behavior to control these bodies. Each primitive agent consists of a limb with a motor attached at one end. Limbs may choose to link up to form collectives. When a limb initiates a link-up action and there is another limb nearby, the latter is magnetically connected to the 'parent' limb's motor. This forms a new single agent, which may further link with other agents. In this way, complex morphologies can emerge, controlled by a policy whose architecture is in explicit correspondence with the morphology. We evaluate the performance of these dynamic and modular agents in simulated environments. We demonstrate better generalization to test-time changes both in the environment, as well as in the structure of the agent, compared to static and monolithic baselines. Project videos and source code are provided in the supplementary material.

NeurIPS Conference 2017 Conference Paper

Toward Multimodal Image-to-Image Translation

Jun-Yan Zhu
Richard Zhang
Deepak Pathak
Trevor Darrell
Alexei Efros
Oliver Wang
Eli Shechtman

Many image-to-image translation problems are ambiguous, as a single input image may correspond to multiple possible outputs. In this work, we aim to model a distribution of possible outputs in a conditional generative modeling setting. The ambiguity of the mapping is distilled in a low-dimensional latent vector, which can be randomly sampled at test time. A generator learns to map the given input, combined with this latent code, to the output. We explicitly encourage the connection between output and the latent code to be invertible. This helps prevent a many-to-one mapping from the latent code to the output during training, also known as the problem of mode collapse, and produces more diverse results. We explore several variants of this approach by employing different training objectives, network architectures, and methods of injecting the latent code. Our proposed method encourages bijective consistency between the latent encoding and output modes. We present a systematic comparison of our method and other variants on both perceptual realism and diversity.

NeurIPS Conference 2013 Conference Paper

Mid-level Visual Element Discovery as Discriminative Mode Seeking

Carl Doersch
Abhinav Gupta
Alexei Efros

Recent work on mid-level visual representations aims to capture information at the level of complexity higher than typical visual words", but lower than full-blown semantic objects. Several approaches have been proposed to discover mid-level visual elements, that are both 1) representative, i. e. frequently occurring within a visual dataset, and 2) visually discriminative. However, the current approaches are rather ad hoc and difficult to analyze and evaluate. In this work, we pose visual element discovery as discriminative mode seeking, drawing connections to the the well-known and well-studied mean-shift algorithm. Given a weakly-labeled image collection, our method discovers visually-coherent patch clusters that are maximally discriminative with respect to the labels. One advantage of our formulation is that it requires only a single pass through the data. We also propose the Purity-Coverage plot as a principled way of experimentally analyzing and evaluating different visual discovery approaches, and compare our method against prior work on the Paris Street View dataset. We also evaluate our method on the task of scene classification, demonstrating state-of-the-art performance on the MIT Scene-67 dataset. "

NeurIPS Conference 2004 Conference Paper

Seeing through water

Alexei Efros
Volkan Isler
Jianbo Shi
Mirkó Visontai

We consider the problem of recovering an underwater image distorted by surface waves. A large amount of video data of the distorted image is acquired. The problem is posed in terms of finding an undistorted im- age patch at each spatial location. This challenging reconstruction task can be formulated as a manifold learning problem, such that the center of the manifold is the image of the undistorted patch. To compute the center, we present a new technique to estimate global distances on the manifold. Our technique achieves robustness through convex flow com- putations and solves the "leakage" problem inherent in recent manifold embedding techniques. 1 Introduction Consider the following problem. A pool of water is observed by a stationary video camera mounted above the pool and looking straight down. There are waves on the surface of the water and all the camera sees is a series of distorted images of the bottom of the pool, e. g. Figure 1. The aim is to use these images to recover the undistorted image of the pool floor as if the water was perfectly still. Besides obvious applications in ocean optics and underwater imaging [1], variants of this problem also arise in several other fields, including astronomy (overcoming atmospheric distortions) and structure-from-motion (learning the appearance of a deforming object). Most approaches to solve this problem try to model the distortions explicitly. In order to do this, it is critical not only to have a good parametric model of the distortion process, but also to be able to reliably extract features from the data to fit the parameters. As such, this approach is only feasible in well understood, highly controlled domains. On the opposite side of the spectrum is a very simple method used in underwater imaging: simply, average the data temporally. Although this method performs surprisingly well in many situations, it fails when the structure of the target image is too fine with respect to the amplitude of the wave (Figure 2). In this paper we propose to look at this difficult problem from a more statistical angle. We will exploit a very simple observation: if we watch a particular spot on the image plane, most of the time the picture projected there will be distorted. But once in a while, when the water just happens to be locally flat at that point, we will be looking straight down and seeing exactly the right spot on the ground. If we can recognize when this happens Authors in alphabetical order. Figure 1: Fifteen consecutive frames from the video. The experimental setup involved: a transparent bucket of water, the cover of a vision textbook "Computer Vision/A Modern Approach". Figure 2: Ground truth image and reconstruction results using mean and median and snap the right picture at each spatial location, then recovering the desired ground truth image would be simply a matter of stitching these correct observations together. In other words, the question that we will be exploring in this paper is not where to look, but when! 2 Problem setup Let us first examine the physical setup of our problem. There is a "ground truth" image G on the bottom of the pool. Overhead, a stationary camera pointing downwards is recording a video stream V. In the absence of any distortion V (x, y, t) = G(x, y) at any time t. However, the water surface refracts in accordance with Snell's Law. Let us consider what the camera is seeing at a particular point x on the CCD array, as shown in Figure 3(c) (assume 1D for simplicity). If the normal to the water surface directly underneath x is pointing straight up, there is no refraction and V (x) = G(x). However, if the normal is tilted by angle 1, light will bend by the amount 2 = 1 - sin-1 ( 1 sin 1. 33 1 ), so the camera point V (x) will see the light projected from G(x + dx) on the ground plane. It is easy to see that the relationship between the tilt of the normal to the surface 1 and the displacement dx is approximately linear (dx 0. 251h using small angle approximation, where h is the height of the water). This means that, in 2D, what the camera will be seeing over time at point V (x, y, t) are points on the ground plane sampled from a disk centered at G(x, y) and with radius related to the height of the water and the overall roughness of the water surface. A similar relationship holds in the inverse direction as well: a point G(x, y) will be imaged on a disk centered around V (x, y). What about the distribution of these sample points? According to Cox-Munk Law [2], the surface normals of rough water are distributed approximately as a Gaussian centered around the vertical, assuming a large surface area and stationary waves. Our own experiments, conducted by hand-tracking (Figure 3b), confirm that the distribution, though not exactly Gaussian, is definitely unimodal and smooth. Up to now, we only concerned ourselves with infinitesimally small points on the image or the ground plane. However, in practice, we must have something that we can compute with. Therefore, we will make an assumption that the surface of the water can be locally approximated by a planar patch. This means that everything that was true for points is now true for local image patches (up to a small affine distortion). 3 Tracking via embedding From the description outlined above, one possible solution emerges. If the distribution of a particular ground point on the image plane is unimodal, then one could track feature points in the video sequence over time. Computing their mean positions over the entire video will give an estimate of their true positions on the ground plane. Unfortunately, tracking over long periods of time is difficult even under favorable conditions, whereas our data is so fast (undersampled) and noisy that reliable tracking is out of the question (Figure 3(c)). However, since we have a lot of data, we can substitute smoothness in time with smoothness in similarity for a given patch we are more likely to find a patch similar to it somewhere in time, and will have a better chance to track the transition between them. An alternative to tracking the patches directly (which amounts to holding the ground patch G(x, y) fixed and centering the image patches V (x + dxt, y + dyt) on top of it in each frame), is to fix the image patch V (x, y) in space and observe the patches from G(x + dxt, y + dyt) appearing in this window. We know that this set of patches comes from a disk on the ground plane centered around patch G(x, y) our goal. If the disk was small enough compared to the size of the patch, we could just cluster the patches together, e. g. by using translational EM [3]. Unfortunately, the disk can be rather large, containing patches with no overlap at all, thus making only the local similarity comparisons possible. However, notice that our set of patches lies on a low-dimensional manifold; in fact we know precisely which manifold it's the disk on the ground plane centered at G(x, y)! So, if we could use the local patch similarities to find an embedding of the patches in V (x, y, t) on this manifold, the center of the embedding will hold our desired patch G(x, y). The problem of embedding the patches based on local similarity is related to the recent work in manifold learning [4, 5]. Basic ingredients of the embedding algorithms are: defin- ing a distance measure between points, and finding an energy function that optimally places them in the embedding space. The distance can be defined as all-pairs distance matrix, or as distance from a particular reference node. In both cases, we want the distance function to satisfy some constraints to model the underlying physical problem. The local similarity measure for our problem turned out to be particularly unreliable, so none of the previous manifold learning techniques were adequate for our purposes. In the following section we will describe our own, robust method for computing a global distance function and finding the right embedding and eventually the center of it. 1 N Surface h 2 G(x) G(x + dx) (a) (b) (c) Figure 3: (a) Snell's Law (b)-(c) Tracking points of the bottom of the pool: (b) the tracked position forms a distribution close to a Gaussian, (c): a vertical line of the image shown at different time instances (horizontal axis). The discontinuity caused by rapid changes makes the tracking infeasible. 4 What is the right distance function? Let I = {I1, .. ., In} be the set of patches, where It = V (x, y, t) and x = [xmin, xmax], y = [ymin, ymax] are the patch pixel coordinates. Our goal is to find a center patch to represent the set I. To achieve this goal, we need a distance function d: I I IR such that d(Ii, Ij) I = arg min d(Ii, Ij) (1) IiI Ij I Unfortunately, the measurable distance functions, such as Normalized Cross Correlation (N CC) are only local. A common approach is to design a global distance function using the measurable local distances and transitivity [6, 4]. This is equivalent to designing a global distance function of the form: d d(I local(Ii, Ij ), if dlocal(Ii, Ij) i, Ij ) = (2) dtransitive(Ii, Ij), otherwise. where dlocal is a local distance function, is a user-specified threshold and dtransitive is a global, transitive distance function which utilizes dlocal. The underlying assumption here is that the members of I lie on a constraint space (or manifold) S. Hence, a local similarity function such as N CC can be used to measure local distances on the manifold. An important research question in machine learning is to extend the local measurements into global ones, i. e. to design dtransitive above. One method for designing such a transitive distance function is to build a graph G = (V, E) whose vertices correspond to the members of I. The local distance measure is used to place edges which connect only very similar members of I. Afterwards, the length of pairwise shortest paths are used to estimate the true distances on the manifold S. For example, this method forms the basis of the well-known Isomap method [4]. Unfortunately, estimating the distance dtransitive(, ) using shortest path computations is not robust to errors in the local distances which are very common. Consider a patch that contains the letter A and another one that contains the letter B. Since they are different letters, we expect that these patches would be quite distant on the manifold S. However, among the A patches there will inevitably be a very blurry A that would look quite similar to a very blurry B producing an erroneous local distance measurement. When the transitive global distances are computed using shortest paths, a single erroneous edge will single- handedly cause all the A patches to be much closer to all the B patches, short-circuiting the graph and completely distorting all the distances. Such errors lead to the leakage problem in estimating the global distances of patches. This problem is illustrated in Figure 4. In this example, our underlying manifold S is a triangle. Suppose our local distance function erroneously estimates an edge between the corners of the triangle as shown in the figure. After the erroneous edge is inserted, the shortest paths from the top of the triangle leak through this edge. Therefore, the shortest path distances will fail to reflect the true distance on the manifold. 5 Solving the leakage problem Recall that our goal is to find the center of our data set as defined in Equation 1. Note that, in order to compute the center we do not need all pairwise distances. All we need is the quantity dI (Ii) = d(I I i, Ij ) for all Ii. j I The leakage problem occurs when we compute the values dI (Ii) using the shortest path metric. In this case, even a single erroneous edge may reduce the shortest paths from many different patches to Ii changing the value of dI(Ii) drastically. Intuitively, in order to prevent the leakage problem we must prevent edges from getting involved in many shortest path computations to the same node (i. e. leaking edges). We can formalize this notion by casting the computation as a network flow problem. Let G = (V, E) be our graph representation such that for each patch Ii I, there is a vertex vi V. The edge set E is built as follows: there is an edge (vi, vj) if dlocal(Ii, Ij) is less than a threshold. The weight of the edge (vi, vj) is equal to dlocal(Ii, Ij). To compute the value dI (Ii), we build a flow network whose vertex set is also V. All vertices in V - {vi} are sources, pushing unit flow into the network. The vertex vi is a sink with infinite capacity. The arcs of the flow network are chosen using the edge set E. For each edge (vj, vk) E we add the arcs vj vk and vk vj. Both arcs have infinite capacity and the cost of pushing one unit of flow on either arc is equal to the weight of (vj, vk), as shown in Figure 4 left (top and bottom). It can easily be seen that the minimum cost flow in this network is equal to dI (Ii). Let us call this network which is used to compute dI (Ii) as N W (Ii). The crucial factor in designing such a flow network is choosing the right cost and capacity. Computing the minimum cost flow on N W (Ii) not only gives us dI(Ii) but also allows us to compute how many times an edge is involved in the distance computation: the amount of flow through an edge is exactly the number of times that edge is used for the shortest path computations. This is illustrated in Figure 4 (box A) where d1 units of cost is charged for each unit of flow through the edge (u, w). Therefore, if we prevent too much flow going through an edge, we can prevent the leakage problem. d3/ d1/ d Error u 2/c2 w u w d1 v A: Shortest Path B: Convex Flow c d1/c1 1 c1 + c2 u C: Shortest Path with Capacity Error d/ d1/c1 v u w v c1 w Figure 4: The leakage problem. Left: Equivalence of shortest path leakage and uncapacitated flow leakage problem. Bottom-middle: After the erroneous edge is inserted, the shortest paths from the top of the triangle to vertex v go through this edge. Boxes A-C: Alternatives for charging a unit of flow between nodes u and w. The horizontal axis of the plots is the amount of flow and the vertical axis is the cost. Box A: Linear flow. The cost of a unit of flow is d1 Box B: Convex flow. Multiple edges are introduced between two nodes, with fixed capacity, and convexly increasing costs. The cost of a unit of flow increases from d1 to d2 and then to d3 as the amount of flow from u to w increases. Box C: Linear flow with capacity. The cost is d1 until a capacity of c1 is achieved and becomes infinite afterwards. One might think that the leakage problem can simply be avoided by imposing capacity constraints on the arcs of the flow network (Figure 4, box C). Unfortunately, this is not very easy. Observe that in the minimum cost flow solution of the network N W (Ii), the amount of flow on the arcs will increase as the arcs get closer to Ii. Therefore, when we are setting up the network N W (Ii), we must adaptively increase the capacities of arcs "closer" to the sink vi otherwise, there will be no feasible solution. As the structure of the graph G gets complicated, specifying this notion of closeness becomes a subtle issue. Further, the structure of the underlying space S could be such that some arcs in G must indeed carry a lot of flow. Therefore imposing capacities on the arcs requires understanding the underlying structure of the graph G as well as the space S which is in fact the problem we are trying to solve! Our proposed solution to the leakage problem uses the notion of a convex flow. We do not impose a capacity on the arcs. Instead, we impose a convex cost function on the arcs such that the cost of pushing unit flow on arc a increases as the total amount of flow through a increases. See Figure 4, box B. This can be achieved by transforming the network N W (Ii) to a new network N W (Ii). The transformation is achieved by applying the following operation on each arc in N W (Ii): Let a be an arc from u to w in N W (Ii). In N W (Ii), we replace a by k arcs a1, .. ., ak. The costs of these arcs are chosen to be uniformly increasing so that cost(a1) d1x, if 0 x c1 cost(x) = d1c1 + d2(x - c1), if c1 x c2 (3) d1c1 + d2(c2 - c1) + d3(x - c1 - c2), if c2 x The advantage of this convex flow computation is twofold. It does not require putting thresholds on the arcs a-priori. It is always feasible to have as much flow on a single arc as required. However, the minimum cost flow will avoid the leakage problem because it will be costly to use an erroneous edge to carry the flow from many different patches. 5. 1 Fixing the leakage in Isomap As noted earlier, the Isomap method [4] uses the shortest path measurements to estimate a distance matrix M. Afterwards, M is used to find an embedding of the manifold S via MDS. As expected, this method also suffers from the leakage problem as demonstrated in Fig- ure 5. The top-left image in Figure 5 shows our ground truth. In the middle row, we present an embedding of these graphs computed using Isomap which uses the shortest path length as the global distance measure. As illustrated in these figures, even though isomap does a good job in embedding the ground truth when there are no errors, the embedding (or manifold) collapses after we insert the erroneous edges. In contrast, when we use the convex-flow based technique to estimate the distances, we recover the true embedding even in the presence of erroneous edges (Figure 5 bottom row).