Author name cluster

William Freeman

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

16 papers

1 author row

NeurIPS Conference 2007 Conference Paper

Object Recognition by Scene Alignment

Bryan Russell
Antonio Torralba
Ce Liu
Rob Fergus
William Freeman

Current object recognition systems can only recognize a limited number of object categories; scaling up to many categories is the next challenge. We seek to build a system to recognize and localize many different object categories in complex scenes. We achieve this through a simple approach: by matching the input im- age, in an appropriate representation, to images in a large training set of labeled images. Due to regularities in object identities across similar scenes, the retrieved matches provide hypotheses for object identities and locations. We build a prob- abilistic model to transfer the labels from the retrieval set to the input image. We demonstrate the effectiveness of this approach and study algorithm component contributions using held-out test sets from the LabelMe database.

NeurIPS Conference 2006 Conference Paper

Analysis of Contour Motions

Ce Liu
William Freeman
Edward Adelson

A reliable motion estimation algorithm must function under a wide range of con- ditions. One regime, which we consider here, is the case of moving objects with contours but no visible texture. Tracking distinctive features such as corners can disambiguate the motion of contours, but spurious features such as T-junctions can be badly misleading. It is difﬁcult to determine the reliability of motion from local measurements, since a full rank covariance matrix can result from both real and spurious features. We propose a novel approach that avoids these points al- together, and derives global motion estimates by utilizing information from three levels of contour analysis: edgelets, boundary fragments and contours. Boundary fragment are chains of orientated edgelets, for which we derive motion estimates from local evidence. The uncertainties of the local estimates are disambiguated after the boundary fragments are properly grouped into contours. The grouping is done by constructing a graphical model and marginalizing it using importance sampling. We propose two equivalent representations in this graphical model, re- versible switch variables attached to the ends of fragments and fragment chains, to capture both local and global statistics of boundaries. Our system is success- fully applied to both synthetic and real video sequences containing high-contrast boundaries and textureless regions. The system produces good motion estimates along with properly grouped and completed contours.

NeurIPS Conference 2005 Conference Paper

Describing Visual Scenes using Transformed Dirichlet Processes

Antonio Torralba
Alan Willsky
Erik Sudderth
William Freeman

Motivated by the problem of learning to detect and recognize objects with minimal supervision, we develop a hierarchical probabilistic model for the spatial structure of visual scenes. In contrast with most existing models, our approach explicitly captures uncertainty in the number of object instances depicted in a given image. Our scene model is based on the transformed Dirichlet process (TDP), a novel extension of the hierarchical DP in which a set of stochastically transformed mixture components are shared between multiple groups of data. For visual scenes, mixture components describe the spatial structure of visual features in an objectcentered coordinate frame, while transformations model the object positions in a particular image. Learning and inference in the TDP, which has many potential applications beyond computer vision, is based on an empirically effective Gibbs sampler. Applied to a dataset of partially labeled street scenes, we show that the TDP's inclusion of spatial structure improves detection performance, flexibly exploiting partially labeled training images.

NeurIPS Conference 2004 Conference Paper

Contextual Models for Object Detection Using Boosted Random Fields

Antonio Torralba
Kevin Murphy
William Freeman

We seek to both detect and segment objects in images. To exploit both lo- cal image data as well as contextual information, we introduce Boosted Random Fields (BRFs), which uses Boosting to learn the graph struc- ture and local evidence of a conditional random field (CRF). The graph structure is learned by assembling graph fragments in an additive model. The connections between individual pixels are not very informative, but by using dense graphs, we can pool information from large regions of the image; dense models also support efficient inference. We show how contextual information from other objects can improve detection perfor- mance, both in terms of accuracy and speed, by using a computational cascade. We apply our system to detect stuff and things in office and street scenes. 1 Introduction Our long-term goal is to build a vision system that can examine an image and describe what objects are in it, and where. In many images, such as Fig. 5(a), objects of interest, such as the keyboard or mouse, are so small that they are impossible to detect just by using local features. Seeing a blob next to a keyboard, humans can infer it is likely to be a mouse; we want to give a computer the same abilities. There are several pieces of related work. Murphy et al [9] used global scene context to help object recognition, but did not model relationships between objects. Fink and Perona [4] exploited local dependencies in a boosting framework, but did not allow for multiple rounds of communication between correlated objects. He et al [6] do not model connections between objects directly, but rather they induce such correlations indirectly, via a bank of hidden variables, using a "restricted Boltzmann machine" architecture. In this paper, we exploit contextual correlations between the object classes by introducing Boosted Random Fields (BRFs). Boosted random fields build on both boosting [5, 10] and conditional random fields (CRFs) [8, 7, 6]. Boosting is a simple way of sequentially constructing "strong" classifiers from "weak" components, and has been used for single- class object detection with great success [12]. Dietterich et al [3] combine boosting and 1D CRFs, but they only consider the problem of learning the local evidence potentials; we consider the much harder problem of learning the structure of a 2D CRF. Standard applications of MRFs/ CRFs to images [7] assume a 4-nearest neighbor grid structure. While successful in low-level vision, this structure will fail in capturing im- portant long distance dependencies between whole regions and across classes. We propose a method for learning densely connected random fields with long range connections. The topology of these connections is chosen by a weak learner which has access to a library of graph fragments, derived from patches of labeled training images, which reflect typical spatial arrangments of objects (similar to the segmentation fragments in [2]). At each round of the learning algorithm, we add more connections from other locations in the image and from other classes (detectors). The connections are assumed to be spatially invariant, which means this update can be performed using convolution followed by a sigmoid nonlinearity. The resulting architecture is similar to a convolutional neural network, although we used a stagewise training procedure, which is much faster than back propagation. In addition to recognizing things, such as cars and people, we are also interested in recog- nizing spatially extended "stuff" [1], such as roads and buildings. The traditional sliding window approach to object detection does not work well for detecting "stuff". Instead, we combine object detection and image segmentation (c. f. , [2]) by labeling every pixel in the image. We do not rely on a bottom-up image segmentation algorithm, which can be fragile without top-down guidance. 2 Learning potentials and graph structure A conditional random field (CRF) is a distribution of the form 1 P (S|x) = Z i(Si) i, j (Si, Sj ) i jNi where x is the input (e. g. , image), Ni are the neighbors of node i, and Si are labels. We have assumed pairwise potentials for notational simplicity. Our goal is to learn the local evidence potentials, i, the compatibility potentials, and the set of neighbors Ni. We propose the following simple approximation: use belief propagation (BP) to estimate the marginals, P (Si|x), and then use boosting to maximize the likelihood of each node's training data with respect to i and. In more detail, the algorithm is as follows. At iteration t, the goal is to minimize the negative log-likelihood of the training data. As in [11], we consider the per-label loss (i. e. , we use marginal probabilities), as opposed to requiring that the joint labeling be correct (as in Viterbi decoding). Hence the cost function to be minimized is Jt = Jti = - bti, m(Si, m) = - bti, m(+1)Si, mbti, m(-1)1-Si, m (1) i m i m i where Si, m {-1, +1} is the true label for pixel i in training case m, Si, m = (Si, m + 1)/2 {0, 1} is just a relabeling, and bti, m = [P (Si = -1|xm, t), P (Si = 1|xm, t)] is the belief state at node i given input image xm after t iterations of the algorithm. The belief at node i is given by the following (dropping the dependence on case m) bti(1) ti(1) Mti(1) where Mti is the product of all the messages coming into i from all its neighbors at time t and where the message that k sends to i is given by bt (s M t+1(1) = t+1 (1) t+1 (1) = k k) i (2) ki ki k, i(sk, 1) t (sk) kN ik i sk{-1, +1} where k, i is the compatility between nodes k and i. If we assume that the local potentials have the form t /2 /2 i(si) = [eF t i; e-F ti ], where F ti is some function of the input data, then: bti(+1) = (F ti + Gti), Gti = log Mti(+1) - log Mti(-1) (3) where (u) = 1/(1 + e-u) is the sigmoid function. Hence each term in Eq. 1 simplifies to a cost function similar to that used in boosting: log Jt +Gt ) i, m i = log 1 + e-Si, m(F ti, m. (4) m 1. Input: a set of labeled pairs {xi, m; Si, m}, bound T Output: Local evidence functions f ti(x) and message update functions gti(bN ). i 2. Initialize: bt=0 i, m = 0; F t=0 i, m = 0; Gt=0 i, m = 0 3. For t=1. .T. (a) Fit local potential fi(xi, m) by weighted LS to Y t +Gt ) i, m i, m = Si, m(1 + e-Si, m(F t i ) (b). Fit compatibilities gti(bt-1 ) to Y t N i, m by weighted LS. i, m (c) Compute local potential F t i, m = F t-1 + f t i, m i (xi, m) (d) Compute compatibilities Gti, m = t gn ) n=1 i (bt-1 Ni, m (e) Update the beliefs bti, m = (F ti, m + Gti, m) (f) Update weights wt+1 = bt i, m i, m(-1) bt i, m(+1) Figure 1: BRF training algorithm. We assume that the graph is very densely connected so that the information that one single node sends to another is so small that we can make the approximation t+1 (+1)/ t+1 (-1) 1. (This is a reasonable approximation in the case of images, ki ki where each node represents a single pixel; only when the influence of many pixels is taken into account will the messages become informative. ) Hence bt (s k, m k ) M t+1(+1) s k, i(sk, +1) t (s Gt+1 = log i = log k [-1, +1] i k ) k i (5) M t+1(-1) bt (sk) i k, m k s k, i(sk, -1) k [-1, +1] t (s i k ) k k, i(sk, +1) bt (s k, m k) log sk[-1, +1] (6) k, i(sk, -1) bt (sk) k sk[-1, +1] k, m With this simplification, Gt+1 (bt i is now a non-linear function of the beliefs Gt+1 i m) at iteration t. Therefore, We can write the beliefs at iteration t as a function of the local evidences and the beliefs at time t - 1: bti(+1) = (F ti(xi, m) + Gti(bt-1 m )). The key idea behind BRFs is to use boosting to learn the G functions, which approximately implement message passing in densely connected graphs. We explain this in more detail below. 2. 1 Learning local evidence potentials Defining F ti(xi, m) = F t-1(x i i, m) + f t i (xi, m) as an additive model, where xi, m are the features of training sample m at node i, we can learn this function in a stagewise fashion by optimizing the second order Taylor expansion of Eq. 4 wrt f ti, as in logitBoost [5]: arg min log Jti arg min wti, m(Y ti, m - fti(xi, m))2 (7) f t f t i i m where Y t +Gt ) i, m i, m = Si, m(1+e-Si, m(F t i ). In the case that the weak learner is a "regression stump", fi(x) = ah(x)+b, we can find the optimal a, b by solving a weighted least squares problem, with weights wti, m = bti(-1) bti(+1); we can find the best basis function h(x) by searching over all elements of a dictionary. 2. 2 Learning compatibility potentials and graph structure In this section, we discuss how to learn the compatibility functions ij, and hence the structure of the graph. Instead of learning the compatibility functions ij, we propose to 1. Input: a set of inputs {xi, m} and functions f ti, gti Output: Set of beliefs bi, m and MAP estimates Si, m. 2. Initialize: bt=0 i, m = 0; F t=0 i, m = 0; Gt=0 i, m = 0 3. From t = 1 to T, repeat (a) Update local evidences F t i, m = F t-1 + f t i, m i (xi, m) (b) Update compatibilities Gti, m = t gn ) n=1 i (bt-1 Ni, m (c) Compute current beliefs bti, m = (F ti, m + Gti, m) 4. Output classification is Si, m = bti, m > 0. 5 Figure 2: BRF run-time inference algorithm. learn directly the function Gt+1 i. We propose to use an additive model for Gt+1 i as we did for learning F: Gt+1 = t gn i, m n=1 i (btm), where btm is a vector with the beliefs of all nodes in the graph at iteration t for the training sample m. The weak learners gn i (btm) can be regression stumps with the form gn i (btm) = a(w btm > ) + b, where a, b, are the parameters of the regression stump, and wi is a set of weights selected from a dictionary. In the case of a graph with weak and almost symmetrical connections (which holds if (s1, s2) 1, for all (s1, s2), which implies the messages are not very informative) we can further simplify the function Gt+1 i by approximating it as a linear function of the beliefs: Gt+1 = i, m k, i btk, m(+1) + k, i (8) kNi This step reduces the computational cost. The weak learners gn i (btm) will also be linear functions. Hence the belief update simplifies to bt+1(+1) = ( i, m i btm + i + F t i, m), which is similar to the mean-field update equations. The neighborhood Ni over which we sum incoming messages is determined by the graph structure, which is encoded in the non-zero values of i. Each weak learner gn i will compute a weighted combination of the beliefs of the some subset of the nodes; this subset may change from iteration to iteration, and can be quite large. At iteration t, we choose the weak learner gti so as to minimize t-1 log Jt +gt(bt-1)+ gn(bt-1)) i m i m i (bt-1) = - log 1 + e-Si, m(F ti, m n=1 m which reduces to a weighted least squares problem similar to Eq. 7. See Fig. 1 for the pseudo-code for the complete learning algorithm, and Fig. 2 for the pseudo-code for run- time inference. 3 BRFs for multiclass object detection and segmentation With the BRF training algorithm in hand, we describe our approach for multiclass object detection and region-labeling using densely connected BRFs. 3. 1 Weak learners for detecting stuff and things The square sliding window approach does not provide a natural way of working with irreg- ular objects. Using region labeling as an image representation allows dealing with irregular and extended objects (buildings, bookshelf, road, .. .). Extended stuff [1] may be a very important source of contextual information for other objects. (a) Examples from the dictionary of about 2000 patches and masks, Ux, y, Vx, y. (b) Examples from the dictionary of 30 graphs, Wx, y, c. f t=0 f t=1 f t=2 F S + +. .. = put thu utO Tr (c) Example feedforward segmentation for screens. Figure 3: Examples of patches from the dictionary and an example of the segmentation obtained using boosting trained with patches from (a). The weak learners we use for the local evidence potentials are based on the segmentation fragments proposed in [2]. Specifically, we create a dictionary of about 2000 image patches U, chosen at random (but overlapping each object), plus a corresponding set of binary (in- class/ out-of-class) image masks, V: see Fig. 3(a). At each round t, for each class c, and for each dictionary entry, we construct the following weak learner, whose output is a binary matrix of the same size as the image I: v(I) = ((I U ) > ) V > 0 (9) where represents normalized cross-correlation and represents convolution. The in- tuition behind this is that I U will produce peaks at image locations that contain this patch/template, and then convolving with V will superimpose the segmentation mask on top of the peaks. As a function of the threshold, the feature will behave more as a template detector ( 1) or as a texture descriptor ( car car building car road car Road F b=(F+G) Car car building building building road building Building x G car road building road road road y c) A car out of context a) Incoming messages (outside 3rd floor windows) to a car node. b) Compatibilities (W'). is less of a car. t=1 t=2 t=4 t=20 t=40 Final labeling b(car) S(all) d) Evolution of the beliefs for the car nodes (b) and labeling (S) for road, building, car. Figure 4: Street scene. The BRF is trained to detect cars, buildings and the road. In Fig. 4(a-b), we show the structures of the graph and the weights W defined by GT for a BRF trained to detect cars, buildings and roads in street scenes. 3. 2 Learning and inference For training we used a labeled dataset of office and street scenes with about 100 images in each set. During the training, in the first 5 rounds we only update the local potentials, to allow local evidence to accrue. After the 5th iteration we start updating also the compatibil- ity functions. At each round, we update only the local potential and compatibility function associated with a single object class that reduces the most the multiclass cost. This allows objects that need many features to have more complicated local potentials. The algorithm learns to first detect easy (and large) objects, since these reduce the error of all classes the fastest. The easy-to-detect objects can then pass information to the harder ones. For instance, in office scenes, the system first detects screens, then keyboards, and finally computer mice. Fig. 5 illustrates this behavior on the test set. A similar behavior is obtained for the car detector (Fig. 4(d)). The detection of building and road provides strong constraints for the locations of the car. 3. 3 Cascade of classifiers with BRFs The BRF can be turned into a cascade [12] by thresholding the beliefs. Computations can then be reduced by doing the convolutions (required for computing f and g) only in pixels that are still candidates for the presence of the target. At each round we update a binary rejection mask for each object class, Rtx, y, c, by thresholding the beliefs at round t: Rtx, y, c = Rt-1 x, y, c (btx, y, c > tc). A pixel in the rejection mask is set to zero when we can decide that the object is not present (when btx, y, c is below the threshold tc 0), and it is set to 1 when more processing is required. The threshold tc is chosen so that the percentage of missed detections is below a predefined level (we use 1%). Similarity we can define a detection mask that will indicate pixels in which we decide the object is present. The mask is then used for computing the features v(I) and messages G by applying the convolutions only on the pixels not yet classified. We can denote those operators as R and R. This Input image screen mouse Ground truth Output labeling keyboard t=5 t=10 t=15 t=25 t=50 b (screen) b (screen) b (screen) b (screen) b (screen) F G b (keyboard) b (keyboard) b (keyboard) b (keyboard) b (keyboard) F G b (mouse) b (mouse) b (mouse) b (mouse) b (mouse) F G 1 ROC Screen Boosting BRF Mouse a under Keyboard re Iteration (t) A 0. 5 t=0 t=20 t=50 Figure 5: Top. In this desk scene, it is easy to identify objects like the screen, keyboard and mouse, even though the local information is sometimes insufficient. Middle: the evolution of the beliefs (b and F and G) during detection for a test image. Bottom. The graph bellow shows the average evolution of the area under the ROC for the three objects on 120 test images. results in a more efficient classifier with only a slight decrease of performance. In Fig. 6 we compare the reduction of the search space when implementing a cascade using independent boosting (which reduces to Viola and Jones [12]), and when using BRF's. We see that for objects for which context is the main source of information, like the mouse, the reduction in search space is much more dramatic using BRFs than using boosting alone. 4 Conclusion The proposed BRF algorithm combines boosting and CRF's, providing an algorithm that is easy for both training and inference. We have demonstrated object detection in cluttered scenes by exploiting contextual relationships between objects. The BRF algorithm is com- putationally efficient and provides a natural extension of the cascade of classifiers by inte- grating evidence from other objects in order to quickly reject certain image regions. The BRF's densely connected graphs, which efficiently collect information over large image regions, provide an alternative framework to nearest-neighbor grids for vision problems.

NeurIPS Conference 2004 Conference Paper

Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation

Erik Sudderth
Michael Mandel
William Freeman
Alan Willsky

We describe a threedimensional geometric hand model suitable for vi- sual tracking applications. The kinematic constraints implied by the model's joints have a probabilistic structure which is well described by a graphical model. Inference in this model is complicated by the hand's many degrees of freedom, as well as multimodal likelihoods caused by ambiguous image measurements. We use nonparametric belief propaga- tion (NBP) to develop a tracking algorithm which exploits the graph's structure to control complexity, while avoiding costly discretization. While kinematic constraints naturally have a local structure, self occlusions created by the imaging process lead to complex interpenden- cies in color and edgebased likelihood functions. However, we show that local structure may be recovered by introducing binary hidden vari- ables describing the occlusion state of each pixel. We augment the NBP algorithm to infer these occlusion variables in a distributed fashion, and then analytically marginalize over them to produce hand position esti- mates which properly account for occlusion events. We provide simula- tions showing that NBP may be used to refine inaccurate model initializa- tions, as well as track hand motion through extended image sequences. 1 Introduction Accurate visual detection and tracking of threedimensional articulated objects is a chal- lenging problem with applications in humancomputer interfaces, motion capture, and scene understanding [1]. In this paper, we develop a probabilistic method for tracking a geometric hand model from monocular image sequences. Because articulated hand mod- els have many (roughly 26) degrees of freedom, exact representation of the posterior dis- tribution over model configurations is intractable. Trackers based on extended and un- scented Kalman filters [2, 3] have difficulties with the multimodal uncertainties produced by ambiguous image evidence. This has motived many researchers to consider nonparamet- ric representations, including particle filters [4, 5] and deterministic multiscale discretiza- tions [6]. However, the hand's high dimensionality can cause these trackers to suffer catas- trophic failures, requiring the use of models which limit the hand's motion [4] or sophisti- cated prior models of hand configurations and dynamics [5, 6]. An alternative way to address the high dimensionality of articulated tracking problems is to describe the posterior distribution's statistical structure using a graphical model. Graph- Figure 1: Projected edges (left block) and silhouettes (right block) for a configuration of the 3D structural hand model matching the given image. To aid visualization, the model is also projected following rotations by 35 (center) and 70 (right) about the vertical axis. ical models have been used to track viewbased human body representations [7], con- tour models of restricted hand configurations [8], viewbased 2. 5D "cardboard" models of hands and people [9], and a full 3D kinematic human body model [10]. Because the variables in these graphical models are continuous, and discretization is intractable for threedimensional models, most traditional graphical inference algorithms are inapplica- ble. Instead, these trackers are based on recently proposed extensions of particle filters to general graphs: mean field Monte Carlo in [9], and nonparametric belief propagation (NBP) [11, 12] in [10]. In this paper, we show that NBP may be used to track a threedimensional geometric model of the hand. To derive a graphical model for the tracking problem, we consider a redun- dant local representation in which each hand component is described by its own three dimensional position and orientation. We show that the model's kinematic constraints, including selfintersection constraints not captured by joint angle representations, take a simple form in this local representation. We also provide a local decomposition of the likelihood function which properly handles occlusion in a distributed fashion, a significant improvement over our earlier tracking results [13]. We conclude with simulations demon- strating our algorithm's robustness to occlusions. 2 Geometric Hand Modeling Structurally, the hand is composed of sixteen approximately rigid components: three pha- langes or links for each finger and thumb, as well as the palm [1]. As proposed by [2, 3], we model each rigid body by one or more truncated quadrics (ellipsoids, cones, and cylin- ders) of fixed size. These geometric primitives are well matched to the true geometry of the hand, allow tracking from arbitrary orientations (in contrast to 2. 5D "cardboard" mod- els [5, 9]), and permit efficient computation of projected boundaries and silhouettes [3]. Figure 1 shows the edges and silhouettes corresponding to a sample hand model configu- ration. Note that only a coarse model of the hand's geometry is necessary for tracking. 2. 1 Kinematic Representation and Constraints The kinematic constraints between different hand model components are well described by revolute joints [1]. Figure 2(a) shows a graph describing this kinematic structure, in which nodes correspond to rigid bodies and edges to joints. The two joints connecting the phalanges of each finger and thumb have a single rotational degree of freedom, while the joints connecting the base of each finger to the palm have two degrees of freedom (cor- responding to grasping and spreading motions). These twenty angles, combined with the palm's global position and orientation, provide 26 degrees of freedom. Forward kinematic transformations may be used to determine the finger positions corresponding to a given set of joint angles. While most modelbased hand trackers use this joint angle parameteriza- tion, we instead explore a redundant representation in which the ith rigid body is described by its position qi and orientation ri (a unit quaternion). Let xi = (qi, ri) denote this local description of each component, and x = {x1, .. ., x16} the overall hand configuration. Clearly, there are dependencies among the elements of x implied by the kinematic con- (a) (b) (c) (d) Figure 2: Graphs describing the hand model's constraints. (a) Kinematic constraints (EK ) de- rived from revolute joints. (b) Structural constraints (ES) preventing 3D component intersections. (c) Dynamics relating two consecutive time steps. (d) Occlusion consistency constraints (EO). straints. Let EK be the set of all pairs of rigid bodies which are connected by joints, or equivalently the edges in the kinematic graph of Fig. 2(a). For each joint (i, j) EK, define an indicator function K (x i, j i, xj ) which is equal to one if the pair (xi, xj ) are valid rigid body configurations associated with some setting of the angles of joint (i, j), and zero otherwise. Viewing the component configurations xi as random variables, the following prior explicitly enforces all constraints implied by the original joint angle representation: pK(x) K (x i, j i, xj ) (1) (i, j)EK Equation (1) shows that pK (x) is an undirected graphical model, whose Markov structure is described by the graph representing the hand's kinematic structure (Fig. 2(a)). 2. 2 Structural and Temporal Constraints In reality, the hand's joint angles are coupled because different fingers can never occupy the same physical volume. This constraint is complex in a joint angle parameterization, but simple in our local representation: the position and orientation of every pair of rigid bodies must be such that their component quadric surfaces do not intersect. We approximate this ideal constraint in two ways. First, we only explicitly constrain those pairs of rigid bodies which are most likely to intersect, corresponding to the edges ES of the graph in Fig. 2(b). Furthermore, because the relative orientations of each finger's quadrics are implicitly constrained by the kinematic prior pK (x), we may detect most intersections based on the distance between object centroids. The structural prior is then given by 1 ||q p i - qj || > i, j S (x) S (x (x i, j i, xj ) S i, j i, xj ) = (2) 0 otherwise (i, j)ES where i, j is determined from the quadrics composing rigid bodies i and j. Empirically, we find that this constraint helps prevent different fingers from tracking the same image data. In order to track hand motion, we must model the hand's dynamics. Let xt denote the i position and orientation of the ith hand component at time t, and xt = {xt1, .. ., xt16}. For each component at time t, our dynamical model adds a Gaussian potential connecting it to the corresponding component at the previous time step (see Fig. 2(c)): 16 pT xt | xt-1 = N xt - xt-1; 0, i i i (3) i=1 Although this temporal model is factorized, the kinematic constraints at the following time step implicitly couple the corresponding random walks. These dynamics can be justified as the maximum entropy model given observations of the nodes' marginal variances i. 3 Observation Model Skin colored pixels have predictable statistics, which we model using a histogram distribu- tion pskin estimated from training patches [14]. Images without people were used to create a histogram model pbkgd of nonskin pixels. Let (x) denote the silhouette of projected hand configuration x. Then, assuming pixels are independent, an image y has likelihood p p skin(u) C (y | x) = pskin(u) pbkgd(v) (4) pbkgd(u) u(x) v(x) u(x) The final expression neglects the proportionality constant p v bkgd(v), which is inde- pendent of x, and thereby limits computation to the silhouette region [8]. 3. 1 Distributed Occlusion Reasoning In configurations where there is no selfocclusion, pC (y | x) decomposes as a product of local likelihood terms involving the projections (xi) of individual hand components [13]. To allow a similar decomposition (and hence distributed inference) when there is occlu- sion, we augment the configuration xi of each node with a set of binary hidden variables zi = {zi } = 0 if pixel u in the projection of rigid body i is occluded (u) u. Letting zi(u) by any other body, and 1 otherwise, the color likelihood (eq. (4)) may be rewritten as 16 p z 16 i(u) p skin(u) C (y | x, z) = = p p C (y | xi, zi) (5) bkgd(u) i=1 u(xi) i=1 Assuming they are set consistently with the hand configuration x, the hidden occlusion variables z ensure that the likelihood of each pixel in (x) is counted exactly once. We may enforce consistency of the occlusion variables using the following function: 0 if x = 1 (x j occludes xi, u (xj ), and zi(u) j, zi; x (6) (u) i) = 1 otherwise Note that because our rigid bodies are convex and nonintersecting, they can never take mutually occluding configurations. The constraint (xj, zi; x (u) i) is zero precisely when pixel u in the projection of xi should be occluded by xj, but zi is in the unoccluded state. (u) The following potential encodes all of the occlusion relationships between nodes i and j: O (x (x; x; x i, j i, zi, xj, zj ) = j, zi(u) i) (xi, zj(u) j ) (7) u These occlusion constraints exist between all pairs of nodes. As with the structural prior, we enforce only those pairs EO (see Fig. 2(d)) most prone to xj occlusion: pO(x, z) O (x i, j i, zi, xj, zj ) (8) (i, j)EO z y i(u) xi Figure 3 shows a factor graph for the occlusion relationships between xi and its neighbors, as well as the observation potential pC (y | xi, zi). x u k The occlusion potential (xj, zi; x (u) i) has a very Figure 3: Factor graph showing weak dependence on xi, depending only on p(y | xi, zi), and the occlusion con- whether xi is behind xj relative to the camera. straints placed on xi by xj, xk. Dashed lines denote weak dependencies. The 3. 2 Modeling Edge Filter Responses plate is replicated once per pixel. Edges provide another important hand tracking cue. Using boundaries labeled in training images, we estimated a histogram pon of the response of a derivative of Gaussian filter steered to the edge's orientation [8, 10]. A similar histogram poff was estimated for filter outputs at randomly chosen locations. Let (x) denote the oriented edges in the projection of model configuration x. Then, again assuming pixel independence, image y has edge likelihood p 16 p z 16 i(u) p on(u) on(u) E (y | x, z) = = p p E (y | xi, zi) (9) off (u) poff(u) u(x) i=1 u(xi) i=1 where we have used the same occlusion variables z to allow a local decomposition. 4 Nonparametric Belief Propagation Over the previous sections, we have shown that a redundant, local representation of the geometric hand model's configuration xt allows p (xt | yt), the posterior distribution of the hand model at time t given image observations yt, to be written as 16 p xt | yt pK(xt)pS(xt)pO(xt, zt) pC(yt | xt, zt)p, zt) i i E (yt | xti i (10) zt i=1 The summation marginalizes over the hidden occlusion variables zt, which were needed to locally decompose the edge and color likelihoods. When video frames are observed, the overall posterior distribution is given by p (x | y) p xt | yt pT (xt | xt-1) (11) t=1 Excluding the potentials involving occlusion variables, which we discuss in detail in Sec. 4. 2, eq. (11) is an example of a pairwise Markov random field: p (x | y) i, j (xi, xj) i (xi, y) (12) (i, j)E iV Hand tracking can thus be posed as inference in a graphical model, a problem we propose to solve using belief propagation (BP) [15]. At each BP iteration, some node i V calculates a message m (x ij j ) to be sent to a neighbor j (i) {j | (i, j) E}: mn (x mn-1 (x ij j ) j, i (xj, xi) i (xi, y) ki i) dxi (13) xi k(i)\j At any iteration, each node can produce an approximation ^ p(xi | y) to the marginal distri- bution p (xi | y) by combining the incoming messages with the local observation: ^ pn(xi | y) i (xi, yi) mn (x ji i) (14) j(i) For treestructured graphs, the beliefs ^ pn(xi | y) will converge to the true marginals p (xi | y). On graphs with cycles, BP is approximate but often highly accurate [15]. 4. 1 Nonparametric Representations For the hand tracking problem, the rigid body configurations xi are sixdimensional con- tinuous variables, making accurate discretization intractable. Instead, we employ nonpara- metric, particlebased approximations to these messages using the nonparametric belief propagation (NBP) algorithm [11, 12]. In NBP, each message is represented using either a samplebased density estimate (a mixture of Gaussians) or an analytic function. Both types of messages are needed for hand tracking, as we discuss below. Each NBP message update involves two stages: sampling from the estimated marginal, followed by Monte Carlo ap- proximation of the outgoing message. For the general form of these updates, see [11]; the following sections focus on the details of the hand tracking implementation. The hand tracking application is complicated by the fact that the orientation component ri of xi = (qi, ri) is an element of the rotation group SO(3). Following [10], we represent orientations as unit quaternions, and use a linearized approximation when constructing den- sity estimates, projecting samples back to the unit sphere as necessary. This approximation is most appropriate for densities with tightly concentrated rotational components. 4. 2 Marginal Computation BP's estimate of the belief ^ p(xi | y) is equal to the product of the incoming messages from neighboring nodes with the local observation potential (see eq. (14)). NBP approximates this product using importance sampling, as detailed in [13] for cases where there is no selfocclusion. First, M samples are drawn from the product of the incoming kinematic and temporal messages, which are Gaussian mixtures. We use a recently proposed multi- scale Gibbs sampler [16] to efficiently draw accurate (albeit approximate) samples, while avoiding the exponential cost associated with direct sampling (a product of d M Gaussian mixtures contains M d Gaussians). Following normalization of the rotational component, each sample is assigned a weight equal to the product of the color and edge likelihoods with any structural messages. Finally, the computationally efficient "rule of thumb" heuris- tic [17] is used to set the bandwidth of Gaussian kernels placed around each sample. To derive BP updates for the occlusion masks zi, we first cluster (xi, zi) for each hand component so that p (xt, zt | yt) has a pairwise form (as in eq. (12)). In principle, NBP could manage occlusion constraints by sampling candidate occlusion masks zi along with rigid body configurations xi. However, due to the exponentially large number of possible occlusion masks, we employ a more efficient analytic approximation. Consider the BP message sent from xj to (zi, xi), calculated by applying eq. (13) to the occlusion potential (x; x u j, zi(u) i). We assume that ^ p(xj | y) is well separated from any candidate xi, a situation typically ensured by the kinematic and structural constraints. The occlusion constraint's weak dependence on xi (see Fig. 3) then separates the message computation into two cases. If xi lies in front of typical xj configurations, the BP message j, i(u)(zi ) is uninformative. If x (u) i is occluded, the message approximately equals j, i(u)(zi = 0) = 1 = 1) = 1 - Pr [u (x (u) j, i(u)(zi(u) j )] (15) where we have neglected correlations among pixel occlusion states, and where the prob- ability is computed with respect to ^ p(xj | y). By taking the product of these messages k, i(u)(zi ) from all potential occluders x (u) k and normalizing, we may determine an ap- proximation to the marginal occlusion probability i Pr[z = 0]. (u) i(u) Because the color likelihood pC (y | xi, zi) factorizes across pixels u, the BP approximation to pC (y | xi) may be written in terms of these marginal occlusion probabilites: p p skin(u) C (y | xi) i + (1 - ) (16) (u) i(u) pbkgd(u) u(xi) Intuitively, this equation downweights the color evidence at pixel u as the probability of that pixel's occlusion increases. The edge likelihood pE(y | xi) averages over zi similarly. The NBP estimate of ^ p(xi | y) is determined by sampling configurations of xi as before, and reweighting them using these occlusionsensitive likelihood functions. 4. 3 Message Propagation To derive the propagation rule for nonocclusion edges, as suggested by [18] we rewrite the message update equation (13) in terms of the marginal distribution ^ p(xi | y): ^ pn-1(x mn (x i | y) dx ij j ) = j, i (xj, xi) i (17) x mn-1 (x i ji i) Our explicit use of the current marginal estimate ^ pn-1(xi | y) helps focus the Monte Carlo approximation on the most important regions of the state space. Note that messages sent 1 2 1 2 Figure 4: Refinement of a coarse initialization following one and two NBP iterations, both without (left) and with (right) occlusion reasoning. Each plot shows the projection of the five most significant modes of the estimated marginal distributions. Note the difference in middle finger estimates. along kinematic, structural, and temporal edges depend only on the belief ^ p(xi | y) follow- ing marginalization over occlusion variables zi. Details and pseudocode for the message propagation step are provided in [13]. For kine- matic constraints, we sample uniformly among permissable joint angles, and then use forward kinematics to propagate samples from ^ pn-1(xi | y) /mn-1 (x ji i) to hypothesized configurations of xj. Following [12], temporal messages are determined by adjusting the bandwidths of the current marginal estimate ^ p(xi | y) to match the temporal covariance i. Because structural potentials (eq. (2)) equal one for all state configurations outside some ball, the ideal structural messages are not finitely integrable. We therefore approximate the structural message m (x ij j ) as an analytic function equal to the weights of all kernels in ^ p(xi | y) outside a ball centered at qj, the position of xj.

NeurIPS Conference 2003 Conference Paper

Efficient Multiscale Sampling from Products of Gaussian Mixtures

Alexander Ihler
Erik Sudderth
William Freeman
Alan Willsky

The problem of approximating the product of several Gaussian mixture distributions arises in a number of contexts, including the nonparametric belief propagation (NBP) inference algorithm and the training of prod- uct of experts models. This paper develops two multiscale algorithms for sampling from a product of Gaussian mixtures, and compares their performance to existing methods. The ﬁrst is a multiscale variant of pre- viously proposed Monte Carlo techniques, with comparable theoretical guarantees but improved empirical convergence rates. The second makes use of approximate kernel density evaluation methods to construct a fast approximate sampler, which is guaranteed to sample points to within a tunable parameter (cid: 15) of their true probability. We compare both multi- scale samplers on a set of computational examples motivated by NBP, demonstrating signiﬁcant improvements over existing methods.

NeurIPS Conference 2003 Conference Paper

Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

Kevin Murphy
Antonio Torralba
William Freeman

Standard approaches to object detection focus on local patches of the image, and try to classify them as background or not. We propose to use the scene context (image as a whole) as an extra source of (global) information, to help resolve local ambiguities. We present a conditional random ﬁeld for jointly solving the tasks of object detection and scene classiﬁcation.

NeurIPS Conference 2002 Conference Paper

Recovering Intrinsic Images from a Single Image

Marshall Tappen
William Freeman
Edward Adelson

We present an algorithm that uses multiple cues to recover shading and reﬂectance intrinsic images from a single image. Using both color in- formation and a classiﬁer trained to recognize gray-scale patterns, each image derivative is classiﬁed as being caused by shading or a change in the surface’s reﬂectance. Generalized Belief Propagation is then used to propagate information from areas where the correct classiﬁcation is clear to areas where it is ambiguous. We also show results on real images.

NeurIPS Conference 2002 Conference Paper

Shape Recipes: Scene Representations that Refer to the Image

William Freeman
Antonio Torralba

The goal of low-level vision is to estimate an underlying scene, given an observed image. Real-world scenes (eg, albedos or shapes) can be very complex, conventionally requiring high dimensional representations which are hard to estimate and store. We propose a low-dimensional rep- resentation, called a scene recipe, that relies on the image itself to de- scribe the complex scene conﬁgurations. Shape recipes are an example: these are the regression coefﬁcients that predict the bandpassed shape from image data. We describe the beneﬁts of this representation, and show two uses illustrating their properties: (1) we improve stereo shape estimates by learning shape recipes at low resolution and applying them at full resolution; (2) Shape recipes implicitly contain information about lighting and materials and we use them for material segmentation.

NeurIPS Conference 2000 Conference Paper

Generalized Belief Propagation

Jonathan Yedidia
William Freeman
Yair Weiss

Belief propagation (BP) was only supposed to work for tree-like networks but works surprisingly well in many applications involving networks with loops, including turbo codes. However, there has been little understanding of the algorithm or the nature of the solutions it finds for general graphs. We show that BP can only converge to a stationary point of an approximate free energy, known as the Bethe free energy in statis(cid: 173) tical physics. This result characterizes BP fixed-points and makes connections with variational approaches to approximate inference. More importantly, our analysis lets us build on the progress made in statistical physics since Bethe's approximation was introduced in 1935. Kikuchi and others have shown how to construct more ac(cid: 173) curate free energy approximations, of which Bethe's approximation is the simplest. Exploiting the insights from our analysis, we de(cid: 173) rive generalized belief propagation (GBP) versions ofthese Kikuchi approximations. These new message passing algorithms can be significantly more accurate than ordinary BP, at an adjustable in(cid: 173) crease in complexity. We illustrate such a new GBP algorithm on a grid Markov network and show that it gives much more accurate marginal probabilities than those found using ordinary BP.

NeurIPS Conference 2000 Conference Paper

Learning Joint Statistical Models for Audio-Visual Fusion and Segregation

John Fisher III
Trevor Darrell
William Freeman
Paul Viola

People can understand complex auditory and visual information, often using one to disambiguate the other. Automated analysis, even at a low(cid: 173) level, faces severe challenges, including the lack of accurate statistical models for the signals, and their high-dimensionality and varied sam(cid: 173) pling rates. Previous approaches [6] assumed simple parametric models for the joint distribution which, while tractable, cannot capture the com(cid: 173) plex signal relationships. We learn the joint distribution of the visual and auditory signals using a non-parametric approach. First, we project the data into a maximally informative, low-dimensional subspace, suitable for density estimation. We then model the complicated stochastic rela(cid: 173) tionships between the signals using a nonparametric density estimator. These learned densities allow processing across signal modalities. We demonstrate, on synthetic and real signals, localization in video of the face that is speaking in audio, and, conversely, audio enhancement of a particular speaker selected from the video.

NeurIPS Conference 1999 Conference Paper

Bayesian Reconstruction of 3D Human Motion from Single-Camera Video

Nicholas Howe
Michael Leventon
William Freeman

The three-dimensional motion of humans is underdetermined when the observation is limited to a single camera, due to the inherent 3D ambi(cid: 173) guity of 2D video. We present a system that reconstructs the 3D motion of human subjects from single-camera video, relying on prior knowledge about human motion, learned from training data, to resolve those am(cid: 173) biguities. After initialization in 2D, the tracking and 3D reconstruction is automatic; we show results for several video sequences. The results show the power of treating 3D body tracking as an inference problem.

NeurIPS Conference 1999 Conference Paper

Correctness of Belief Propagation in Gaussian Graphical Models of Arbitrary Topology

Yair Weiss
William Freeman

Local "belief propagation" rules of the sort proposed by Pearl [15] are guaranteed to converge to the correct posterior probabilities in singly connected graphical models. Recently, a number of researchers have em(cid: 173) pirically demonstrated good performance of "loopy belief propagation"(cid: 173) using these same rules on graphs with loops. Perhaps the most dramatic instance is the near Shannon-limit performance of "Turbo codes", whose decoding algorithm is equivalent to loopy belief propagation. Except for the case of graphs with a single loop, there has been little theo(cid: 173) retical understanding of the performance of loopy propagation. Here we analyze belief propagation in networks with arbitrary topologies when the nodes in the graph describe jointly Gaussian random variables. We give an analytical formula relating the true posterior probabilities with those calculated using loopy propagation. We give sufficient conditions for convergence and show that when belief propagation converges it gives the correct posterior means for all graph topologies, not just networks with a single loop. The related "max-product" belief propagation algorithm finds the max(cid: 173) imum posterior probability estimate for singly connected networks. We show that, even for non-Gaussian probability distributions, the conver(cid: 173) gence points of the max-product algorithm in loopy networks are max(cid: 173) ima over a particular large local neighborhood of the posterior proba(cid: 173) bility. These results help clarify the empirical performance results and motivate using the powerful belief propagation algorithm in a broader class of networks. Problems involving probabilistic belief propagation arise in a wide variety of applications, including error correcting codes, speech recognition and medical diagnosis. If the graph is singly connected, there exist local message-passing schemes to calculate the posterior probability of an unobserved variable given the observed variables. Pearl [15] derived such a scheme for singly connected Bayesian networks and showed that this "belief propagation" algorithm is guaranteed to converge to the correct posterior probabilities (or "beliefs"). Several groups have recently reported excellent experimental results by running algorithms 674 Y. Weiss and W T. Freeman equivalent to Pearl's algorithm on networks with loops [8, 13, 6]. Perhaps the most dramatic instance of this performance is for "Turbo code" [2] error correcting codes. These codes have been described as "the most exciting and potentially important development in coding theory in many years" [12] and have recently been shown [10, 11] to utilize an algorithm equivalent to belief propagation in a network with loops. Progress in the analysis of loopy belief propagation has been made for the case of networks with a single loop [17, 18, 4, 1]. For these networks, it can be shown that (1) unless all the compatabilities are deterministic, loopy belief propagation will converge. (2) The difference between the loopy beliefs and the true beliefs is related to the convergence rate the faster the convergence the more exact the approximation and (3) If of the messages - the hidden nodes are binary, then the loopy beliefs and the true beliefs are both maximized by the same assignments, although the confidence in that assignment is wrong for the loopy beliefs. In this paper we analyze belief propagation in graphs of arbitrary topology, for nodes de(cid: 173) scribing jointly Gaussian random variables. We give an exact formula relating the correct marginal posterior probabilities with the ones calculated using loopy belief propagation. We show that if belief propagation converges, then it will give the correct posterior means for all graph topologies, not just networks with a single loop. We show that the covari(cid: 173) ance estimates will generally be incorrect but present a relationship between the error in the covariance estimates and the convergence speed. For Gaussian or non-Gaussian vari(cid: 173) ables, we show that the "max-product" algorithm, which calculates the MAP estimate in singly connected networks, only converges to points that are maxima over a particular large neighborhood of the posterior probability of loopy networks.

NeurIPS Conference 1998 Conference Paper

Learning to Estimate Scenes from Images

William Freeman
Egon Pasztor

We seek the scene interpretation that best explains image data. For example, we may want to infer the projected velocities (scene) which best explain two consecutive image frames (image). From synthetic data, we model the relationship between image and scene patches, and between a scene patch and neighboring scene patches. Given' a new image, we propagate likelihoods in a Markov network (ignoring the effect of loops) to infer the underlying scene. This yields an efficient method to form low-level scene interpretations. We demonstrate the technique for motion analysis and estimating high resolution images from low-resolution ones.

NeurIPS Conference 1997 Conference Paper

Bayesian Model of Surface Perception

William Freeman
Paul Viola

Image intensity variations can result from several different object surface effects, including shading from 3-dimensional relief of the object, or paint on the surface itself. An essential problem in vision, which people solve naturally, is to attribute the proper physical cause, e. g. surface relief or paint, to an observed image. We ad(cid: 173) dressed this problem with an approach combining psychophysical and Bayesian computational methods. We assessed human performance on a set of test images, and found that people made fairly consistent judgements of surface properties. Our computational model assigned simple prior probabilities to different relief or paint explanations for an image, and solved for the most probable interpretation in a Bayesian framework. The ratings of the test images by our algorithm compared surprisingly well with the mean ratings of our subjects.

NeurIPS Conference 1996 Conference Paper

Separating Style and Content

Joshua Tenenbaum
William Freeman

We seek to analyze and manipulate two factors, which we call style and content, underlying a set of observations. We fit training data with bilinear models which explicitly represent the two-factor struc(cid: 173) ture. These models can adapt easily during testing to new styles or content, allowing us to solve three general tasks: extrapolation of a new style to unobserved content; classification of content observed in a new style; and translation of new content observed in a new style. For classification, we embed bilinear models in a probabilistic framework, Separable Mixture Models (SMMsj, which generalizes earlier work on factorial mixture models [7, 3]. Significant per(cid: 173) formance improvement on a benchmark speech dataset shows the benefits of our approach.