{"title": "Learning Hyper-Features for Visual Identification", "book": "Advances in Neural Information Processing Systems", "page_first": 425, "page_last": 432, "abstract": null, "full_text": " Learning Hyper-Features for\n Visual Identification\n\n\n\n Andras Ferencz Erik G. Learned-Miller Jitendra Malik\n Computer Science Division, EECS\n University of California at Berkeley\n Berkeley, CA 94720\n\n\n\n Abstract\n\n We address the problem of identifying specific instances of a class (cars)\n from a set of images all belonging to that class. Although we cannot build\n a model for any particular instance (as we may be provided with only one\n \"training\" example of it), we can use information extracted from observ-\n ing other members of the class. We pose this task as a learning problem,\n in which the learner is given image pairs, labeled as matching or not, and\n must discover which image features are most consistent for matching in-\n stances and discriminative for mismatches. We explore a patch based\n representation, where we model the distributions of similarity measure-\n ments defined on the patches. Finally, we describe an algorithm that\n selects the most salient patches based on a mutual information criterion.\n This algorithm performs identification well for our challenging dataset\n of car images, after matching only a few, well chosen patches.\n\n\n1 Introduction\n\nFigure 1 shows six cars: the two leftmost cars were captured by one camera; the right four\ncars were seen later by another camera from a different angle. The goal is to determine\nwhich images, if any, show the same vehicle. We call this task visual identification. Most\nexisting identification systems are aimed at biometric applications such as identifying fin-\ngerprints or faces. While object recognition is used loosely for several problems (including\nthis one), we differentiate visual identification, where the challenge is distinguishing be-\ntween visually similar objects of one category (e.g. faces, cars), and categorization where\n\n\n\n\n\nFigure 1: The Identification Problem: Which of these cars are the same? The two cars on the\nleft, photographed from camera 1, also drive past camera 2. Which of the four images on the right,\ntaken by camera 2, match the cars on the left? Solving this problem will enable applications such as\nwide area tracking of cars with a sparse set of cameras [2, 9].\n\n\f\nFigure 2: Detecting and warping car images into alignment: Our identification algorithm assumes\nthat a detection process has found members of the class and approximately aligned them to a canon-\nical view. For our data set, detection is performed by a blob tracker. A projective warp to align the\nsides is computed by calibrating the pose of the camera to the road and finding the wheels of the\nvehicle. Note that this is only a rough approximation (the two warped images, center and right, are\nfar from perfectly aligned) that helps to simplify our patch descriptors and positional bookkeeping.\n\nthe algorithm must group together objects that belong to the same category but may be vi-\nsually diverse[1, 5, 10, 13]. Identification is also distinct from \"object localization,\" where\nthe goal is locating a specific object in scenes in which distractors have little similarity to\nthe target object [6].1\n\nOne characteristic of the identification problem is that the algorithm typically only receives\none positive example of each query class (e.g. a single image of a specific car), before\nhaving to classify other images as the \"same\" or \"different\". Given this lack of a class\nspecific training set, we cannot use standard supervised feature selection and classification\nmethods such as [12, 13, 14]. One possible solution to this problem is to try to pick uni-\nversally good features, such as corners [4, 6], for detecting salient points. However, such\nfeatures are likely to be suboptimal as they are not category specific. Another possibility\nis to hand-select good features for the task, such as the distance between the eyes for face\nidentification.\n\nHere we present an identification framework that attempts to be more general. The core\nidea is to use a training set of other image pairs from the category (in our case cars), labeled\nas matching or not, to learn what characterizes features that are informative in distinguish-\ning one instance from another (i.e. consistent for matching instances and dissimilar for\nmismatches). Our algorithm, given a single novel query image, can build a \"same\" vs.\n\"different\" classifier by: (1) examining a set of candidate features (local image patches)\non the query image (2) selecting a small number of them that are likely to be the most\ninformative for this query class and (3) estimating a function for scoring the match for each\nselected feature. Note that a different set of features (patches) will be selected for each\nunique query.\n\nThe paper is organized as follows. In Section 2, we describe our decision framework in-\ncluding the decomposition of an image pair into bi-patches, which give local indications\nof match or mismatch, and introduce the appearance distance between the two halves as\na discriminative statistic of bi-patches. This model is then refined in Section 3 by condi-\ntioning the distance distributions on hyper-features such as patch location, contrast, and\ndominant orientation. A patch saliency measure based on the estimated distance distribu-\ntions is introduced in Section 3.4. In Section 4, we extend our model to include another\ncomparison statistic, the difference in patch position between images. Finally, in Section 5,\nwe conclude and show that comparing a small number of well-chosen patches produces\nperformance nearly as good as matching a dense sampling of them.\n\n2 Matching Patches\n\nWe seek to determine whether a new query image I L (the \"Left\" image) represents the\nsame vehicle as any of our previously seen database images I R (the \"Right\" image). We\nassume that these images are known to contain vehicles, have been brought into rough\ncorrespondence (in our data set, through a projective transformation that aligns the sides\nof the car) and have been scaled to approximately 200 pixels in length (see Figure 2 for\ndetails).\n\n 1There is evidence that this distinction exists in the human visual system. Some findings suggest\nthat the fusiform face area is specialized for identification of instances from familiar categories[11].\n\n\f\nFigure 3: Patch Matching: The left (query) image is sampled (red dots) by patches encoded as\noriented filter channels (for labeled patch 2, this encoding is shown). Each patch is matched to the\nbest point in the database image of the same car by maximizing the appearance similarity between the\npatches (the similarity score is indicated by the size and color of the dots, where larger and redder is\nmore similar). Three bi-patches are labeled. Although the classification result for this pair of images\nshould be \"same\" (C = 1), notice that some bi-patches are better predictors of this result than others\n(the similarity score of 2 & 3 is much better than for patch 1). Our goal is to be able to predict the\ndistribution of P (d|C = 1) and P (d|C = 0) for each patch accurately based on the appearance and\nposition of the patch in the query image (for the 3 patches, our predictions are shown on the right).\n\n2.1 Image Patch Features\n\nOur strategy is to break up the whole image comparison problem into multiple local match-\ning problems, where we encode a small patch F L (1 j n) of the query image IL and\n j\ncompare each piece separately [12, 14]. As the exact choice of features, their encoding and\ncomparison metric is not crucial to our technique, we chose a fairly simple representation\nthat was general enough to use in a wide variety of settings, but informative enough to\ncapture the details of objects (given the subtle variation that can distinguish two different\ncars, features such as [6] were found not to be precise enough for this task).\n\nSpecifically, we apply a first derivative Gaussian odd-symmetric filter to the patch at four\norientations (horizontal, vertical, and two diagonal), giving four signed numbers per pixel.\nTo compare a query patch F L to an area of the right image F R, we encode both patches\n j j\nas 4 252 length vectors (4 orientations per pixel) and compute the normalized correlation\n(dj = 1 - CorrCoef(F L, F R)) between these vectors. As the two car images are in rough\n j j\nalignment, we need only to search a small area of I R to find the best corresponding patch\nF R - i.e. the one that minimizes d\n j j . We will refer to such a matched left and right patch\npair F L, F R, together with the derived distance d\n j j j , as a bi-patch Fj .\n\n\n2.2 The Decision Rule\n\nWe pose the task of deciding if the a database image I R is the same as a query image I L as\na decision rule\n\n P (C = 1|IL, IR) P (IL, IR|C = 1)P (C = 1)\n R = = > . (1)\n P (C = 0|IL, IR) P (IL, IR|C = 0)P (C = 0)\n\nwhere is chosen to balance the cost of the two types of decision errors. The priors are\nassumed to be known.2 Specifically, for the remaining equations in this paper, the priors are\nassumed to be equal, and hence are dropped from subsequent equations. With our image\ndecomposition into patches, the posteriors from Eq. (1) will be approximated using the bi-\npatches F1, ..., Fn as P (C|IL, IR) P (C|F1, ..., Fm) P (F1, ..., Fm|C). Furthermore,\nin this paper, we will assume a naive Bayes model in which, conditioned on C, the bi-\npatches are assumed to be independent. That is,\n\n m\n P (IL, IR|C = 1) P (F1, ..., F P (F\n R = m|C = 1) = j |C = 1) . (2)\n P (IL, IR|C = 0) P (F1, ..., Fm|C = 0) P (Fj|C = 0)\n j=1\n\n\n 2For our application, dynamic models of traffic flow can supply the prior on P (C).\n\n\f\nIn practice, we compute the log of this likelihood ratio, where each patch contributes an\nadditive term (denoted LLRi for patch i). Modeling the likelihoods in this ratio (P (Fj|C))\nis the central focus of this paper.\n\n2.3 Uniform Appearance Model\n\nThe most straightforward way to esti-\nmate P (Fj|C) is to assume that the ap-\npearance difference dj captures all of the\ninformation Fj about the probability of\na match (i.e. C and Fj are independent\ngiven dj), and that all of dj's from all\npatches are identically distributed. Thus\nthe decision rule, Eqn. 1, becomes\n\n m P (d\n R j |C = 1) > . (3)\n P (dj|C = 0)\n j=1\n\n\nThe two conditional distributions,\n Figure 4: Identification using appearance dif-\nP (dj | C {0, 1}), are estimated as nor- ferences: The bottom curve shows the precision\nmalized histograms from all bi-patches\n vs. recall for non-patch based direct comparison of\nmatched within the training data.3 For rectified images. (An ideal precision-recall curve\neach value of , we evaluate Eqn.(3) to would reach the top right corner.) Notice that all\nclassify each test pair as matching or three patch based models outperform this method.\nnot, producing a precision-recall curve. The three top curves show results for various mod-\nFigure 4 compares this patch-based els of dj from Sections 2.3 (Baseline), 3.1 (Dis-\nmodel to a direct image comparison crete), and 3.2 & 3.3 (Continuous). The regression\nmethod.4 Notice that even this naive model outperforms the uniform one significantly -\npatch-based technique significantly it reduces the error in precision by close to 50%\noutperforms the global matching. for most values of recall below 90%.\n\n3 Refining the Appearance Distributions with Hyper-Features\n\nThe most significant weakness of the above model is the assumption that the dj's from\ndifferent bi-patches should be identically distributed (observe the 3 labeled patches in Fig-\nure 3). When a training set of \"same\" (C = 1) and \"different\" (C = 0) images is available\nfor a specific query image, estimating these distributions directly for each patch is straight-\nforward. How can we estimate a distribution for P (dj|C = 1), where F L is a patch from\n j\na new query image, when we only have that single positive example of F L? The intuitive\n j\nanswer: by finding analogous patches in the training set of labeled (same/different) image\npairs. However, since the space of all possible patches (appearance & position, 2525+2)\nis very large, the chance of having seen a very similar patch to F L in the training set is\n j\nsmall. In the next sections we present two approaches both of which rely on projecting F L\n j\ninto a much lower dimensional space by extracting meaningful features from its position\nand appearance (the hyper-features).\n\n3.1 Non-Parametric Model with Discrete Hyper-Features\n\nFirst we attempted a non-parametric approach, where we model the joint distribu-\ntion of dj and a few hyper-features (e.g. the x and y coordinate of the patch F L,\n j\n\n 3Data consisted of 175 pairs (88 training, 87 test pairs) of matching car images (C=1) from two\ncameras located on the same side of the street one block apart. Within training and testing sets, about\n4000 pairs of mismatched cars (C=0) were formed from non-corresponding images, one from each\ncamera. All comparisons were performed on grayscale (not color) images.\n 4The global image comparison method used here as a baseline technique uses normalized correla-\ntion on a combination of intensity and filter channels, and attempts to overcome slight misalignment.\n\n\f\nFigure 5: Fitting a GLM to the distribution: we demonstrate our approach by fitting a gamma\ndistribution, through the latent variables = (, ), to the y position of the patches. Here we\nallowed and to be a 3rd degree polynomial function of y (i.e. Z = [y3, y2, y, 1]T). The center-\nleft square shows, on each row, a distribution of d conditioned on the y position of the left patch (F L)\nfor each bi-patch, for training data taken from matching vehicles. The center-right square shows the\nsame distributions for mismatched data. The height of histogram distributions is color-coded, dark\nred indicating higher density. The central curve shows the polynomial fit to the conditional means,\nwhile the outer curves show the range. For reference, we include a partial image of a car whose\ny-coordinate is aligned with the center images. On the right, we show two histogram plots, each\ncorresponding to one row of the center images (a small range of y corresponding to the black arrows).\nThe resulting gamma distributions are superimposed on the histograms.\n\ni.e. Z = [x, y]). The distribution is modeled \"non-parametrically\" (similar to Sec-\ntion 2.3) using an N-dimensional normalized histogram where each dimension (d,x, and\ny) has been quantized into several bins. In this model P (C|Fj) P (C|dj, yj, xj) \nP (dj|yj, xj, C)P (yj, xj|C)P (C) P (dj|yj, xj, C), where the last formula follows from\nthe assumption of equal priors (P (C) = 0.5) and the independence of (yj, xj) and C. The\nDiscrete Hyper-Features curve in Figure 4 shows the performance gain from conditioning\non these positional hyper-features.\n\n3.2 Parametric Model with Continuous Hyper-Features\n\nThe drawback of using a non-parametric model for the distributions is that the amount\nof data needed to populate the histograms grows exponentially with the number of di-\nmensions. In order to add additional appearance-based hyper-features, such as contrast,\noriented edge energy, etc., we moved to a smooth parametric representation for both the\ndistribution of dj and the model by which the the hyper-features influence this distribution.\n\nSpecifically, we model the distributions P (dj|C = 1) and P (dj|C = 0) as gamma dis-\ntributions (notated ()) parameterized by the mean and shape parameter = {, } (see\nthe right panel of Figure 5 for examples of the () fitting the empirical distributions). The\nsmooth variation of with respect to the hyper-features can be modeled using a general-\nized linear model (GLM). Ordinary (least-squares) linear models assume that the data is\nnormally distributed with constant variance. GLMs are extensions to ordinary linear mod-\nels that can fit data which is not normally distributed and where the dispersion parameter\nalso depends on the covariates (see [7] for more information on GLMs).\n\nOur goal is to fit gamma distributions to the distributions of d values for various patches by\nmaximizing the probability density of data under gamma distributions whose parameters\nare simple polynomial functions of the hyper-features. Consider a set X1, ..., Xk of hyper-\nfeatures such as position, contrast, and brightness of a patch. Let Z = [Z1, ..., Zl]T be a\nvector of l pre-chosen functions of those hyper-features, like squares, cubes, cross terms,\nor simply copies of the variables themselves. Then each bi-patch distance distribution has\nthe form\n P (d|X1, X2, ..., Xk, C) = (d; \n C Z, C Z), (4)\nwhere the second and third arguments to () are mean and shape parameters.5 Each \n(there are four of these: , , , ) is a vector of parameters of length l\n C=0 C=0 C=1 C=1\n\n 5For the GLM, we use the identity link function for both and . While the identity is not\nthe canonical link function for , its advantage is that our ML optimization can be initialized by\n\n\f\nthat weights each hyper-feature monomial Zi. The 's are adapted to maximize the joint\ndata likelihood over all patches for C = 0 or C = 1 withing the training set. These ideas\nare illustrated in detail in Figure 5.\n\n3.3 Automatic Selection of Hyper-Features\n\nIn this section we describe the automatic determination of Z. Recall that in our GLM model\nwe assumed a linear relationship between Z and , . This allows us to use standard feature\nselection techniques, such as Least Angle Regression (LARS)[3], to choose a few (around\n10) hyper-features from a large set of candidates,6 such as: (a) the x and y positions of F L,\n(b) the intensity and contrast within F L and the average intensity of the entire vehicle, (c)\nthe average energy in each of the 8 oriented filter channels, and (d) derived quantities from\nthe above (e.g. square, cubic, and cross terms). LARS was then asked to choose Z from\nthese features. Once Z is set, we proceed as in Section 3.2.\n\nRunning an automatic feature selection technique on this large set of possible conditioning\nfeatures gives us a principled method of reducing the complexity of our model. Reducing\nthe complexity is important not only to speed up computation, but also to mitigate the risk\nof over-fitting to the training set. The top curve in Figure 4 shows results when Z includes\nthe first 10 features found by LARS. Even with such a naive set of features to choose from,\nthe performance of the system improves significantly.\n\n3.4 Estimating the Saliency of a Patch\n\nFrom the distributions P (dj|C = 0) and P (dj|C = 1) computed separately for each patch,\nit is also possible to estimate the saliency of the patch, i.e. the amount of information about\nour decision variable C we are likely to gain should we compute the best corresponding\nF R. Intuitively, if the distribution of D\n j j is very different for C = 0 and C = 1, then\nthe amount of information gained by matching patch j is likely to be large (see the 3\ndistributions on the right of Figure 3). To emphasize the fact that the distribution P (dj|C)\nis a fixed function of F L, given the learned hyper-feature weights , we slightly abuse\n j\nnotation and refer to the random variable from which dj is sampled as F L.\n j\n\nWith this notation, computing the mutual information between F L and C gives us a mea-\n j\nsure of the expected information gain from a patch with particular hyper-features:\n\n I(F L; C) = H(F L) - H(F L|C).\n j j j\n\n\nHere H() is Shannon entropy. The key fact to notice is that this measure can be computed\njust from the estimated distributions over dj (which, in turn, were estimated from the hyper-\nfeatures of F L) before the patch has been matched. This allows us to match only those\n j\npatches that are likely to be informative, leading to significant computational savings.\n\n\n4 Modeling Appearance and Position Differences\n\nIn the last section, we only considered the similarity of two matching patches that make up\na bi-patch in terms of the appearance of the patches (dj). Recall that for each left patch\nF L, a matching right patch F R is found by searching for the most similar patch in some\n j j\nlarge neighborhood around the expected location for the match. In this section, we show\nhow to model the change in position, rj, of the match relative to its expected location, and\nhow this, when combined with the appearance model, improves the matching performance.\n\nsolving an ordinary least squares problem. We experimentally compared it to the canonical inverse\nlink ( = ( T Z)-1), but observed no noticeable change in performance on our data set.\n C\n 6In order to use LARS (or most other feature selection methods) \"out of the box\", we use regres-\nsion based on an L2 loss function. While this is not optimal for non-normal data, from experiments\nwe have verified that it is a reasonable approximation for the feature selection step.\n\n\f\nFigure 6: Results: The LEFT plot shows precision vs. recall curves for models of r. The results\nfor x and y are shown separately (as there are often more horizontal than vertical features on cars,\ny is better). Re-estimating parameters of the global alignment, W (affine fit), significantly improves\nthe curves. Finally, performance is improved by combining position with appearance (\"Complete\"\ncurve) compared to using appearance alone. The CENTER pair of images show a correct match,\nwith the patch centers indicated by circles. The color of the circles in the top image indicates MI j ,\nin bottom image LLRj . Our patch selection algorithm chooses the top patches based on MI where\nsubsequent patches are penalized for overlapping with earlier ones (neighborhood suppression). The\ntop 10 \"left\" patches chosen are marked with arrows connecting them to the corresponding \"right\"\npatches. Notice that these are concentrated in informative regions. The RIGHT plot quantifies this\nobservation: the curves show 3 different methods of choosing the order of patches - random order,\nMI and MI with neighborhood suppression. Notice that this top curve with 3 patches does as well\nas the direct comparison method. All 3 methods converge above 50 patches.\n\nLet rj = (xj, yj) be the difference in position between the coordinates of F L and F R\n j j\nwithin the standardized coordinate frames. Generally, we expect rj 0 if the two images\nportray the same object (C = 1). The estimate for R, incorporating the information from\nboth d and r becomes\n m P (r\n R j |dj , Zj , C = 1)P (dj |Zj , C = 1) , (5)\n P (rj|dj, Zj, C = 0)P (dj|Zj, C = 0)\n j=1\n\nwhere Zj again refers to a set of hyper-features.\n\nHere we focus on the first factor, where the distribution of rj given C is dependent on the\nappearance and position of the left patch (F L, through the hyper-features\n j Zj) and on the\nsimilarity in appearance (dj). The intuition for the dependence on dj is that for the C = 1\ncase, we expect rj to be smaller on average when a good appearance match (small dj) was\nfound.\n\nFollowing our approach for dj, we model the distribution of rj as a 0 mean normal dis-\ntribution, N (0, ) , where (we use a diagonal covariance) is a function of Zj,dj. The\nparameterization of (Zj,dj) is found through feature selection, while the weights for the\nlinear function are obtained by maximizing the likelihood of rj over the training data. To\naddress initial misalignment, we select a small number of patches, match them, and com-\npute a global affine alignment between the images. We subsequently score each match\nrelative to this global alignment.\n\nThe bottom four curves of Figure 6 show that fitting an affine model first significantly\nimproves the positional signal. While position seems to be less informative than appear-\nance, the complete model, which combines appearance and position (Eq. 5), outperforms\nappearance alone.\n\n5 Conclusion\n\nThe center and right sides of Figure 6 show our ability to select the most informative\npatches using the estimated mutual information I(F L, C) of each patch. To prevent spa-\n j\ntially overlapping patches from being chosen, we added a penalty factor to the mutual in-\n\n\f\nformation score that penalizes patches that are very close to other chosen patches (MI with\nneighborhood suppression). To give a numerical indication of the performance, we note\nthat with only 10 patches, given a 1-to-87 forced choice problem, our algorithm chooses\nthe correct matching image 93% of the time.\n\nA different approach to a learning problem that is similar to ours can be found in [5, 8],\nwhich describe methods for learning character or object categories from few training ex-\namples. These works approach this problem by learning distributions on shared factors\n[8] or priors on parameters of fixed distributions for a category [5] where the training data\nconsists of images from other categories. We, on the other hand, abandon the notion of\nbuilding a model with a fixed form for an object from a single example. Instead, we take\na discriminative approach and model the statistical properties of image patch differences\nconditioned on properties of the patch. These learned conditional distributions allow us to\nevaluate, for each feature, the amount of information potentially gained by matching it to\nthe other image.7\n\nAcknowledgments\n\nThis work was partially funded by DARPA under the Combat Zones That See project.\n\nReferences\n\n [1] Y. Amit and D. Geman. A computational model for visual selection. Neural Computation,\n 11(7), 1999.\n\n [2] D. Beymer, P. McLauchlan, B. Coifman, and J. Malik. A real-time computer vision system for\n measuring traffic parameters. CVPR, 1997.\n\n [3] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics,\n 32(2):407499, 2004.\n\n [4] T. Kadir and M. Brady. Scale, saliency and image description. International Journal of Com-\n puter Vision, 45(2):83105, 2001.\n\n [5] F. Li, R. Fergus, and P. Perona. A Bayesian approach to unsupervised one-shot learning of\n object categories. In ICCV, 2003.\n\n [6] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of\n Computer Vision, 60(2):91110, 2004.\n\n [7] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall, 1989.\n\n [8] E. Miller, N. Matsakis, and P. Viola. Learning from one example through shared densities on\n transforms. In CVPR, 2000.\n\n [9] H. Pasula, S. Russell, M. Ostland, and Y. Ritov. Tracking many objects with many sensors.\n IJCAI, 1999.\n\n[10] H. Schneiderman and T. Kanade. A statistical approach to 3d object detection applied to faces\n and cars. CVPR, 2000.\n\n[11] M. Tarr and I. Gauthier. FFA: A flexible fusiform area for subordinate-level visual processing\n automatized by expertise. Nature Neuroscience, 3(8):764769, 2000.\n\n[12] M. Vidal-Naquet and S. Ullman. Object recognition with informative features and linear clas-\n sification. In International Conference on Computer Vision, 2003.\n\n[13] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In\n CVPR, 2001.\n\n[14] M. Weber, M. Welling, and P. Perona. Unsupervised learning of models for recognition. ECCV,\n 2000.\n\n\n\n\n 7Answer to Figure 1: top left matches bottom center; bottom left matches bottom right. For our\nalgorithm, matching these images was not a challenge.\n\n\f\n", "award": [], "sourceid": 2735, "authors": [{"given_name": "Andras", "family_name": "Ferencz", "institution": null}, {"given_name": "Erik", "family_name": "Learned-miller", "institution": null}, {"given_name": "Jitendra", "family_name": "Malik", "institution": null}]}