{"title": "Phase Transitions and the Perceptual Organization of Video Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 850, "page_last": 856, "abstract": null, "full_text": "Phase transitions and the perceptual \n\norganization of video sequences \n\nYair Weiss \n\nDept. of Brain and Cognitive Sciences \nMassachusetts Institute of Technology \n\nElO-120, Cambridge, MA 02139 \nhttp://www-bcs.mit.edu;-yweiss \n\nAbstract \n\nEstimating motion in scenes containing multiple moving objects \nremains a difficult problem in computer vision. A promising ap(cid:173)\nproach to this problem involves using mixture models, where the \nmotion of each object is a component in the mixture. However, ex(cid:173)\nisting methods typically require specifying in advance the number \nof components in the mixture, i.e. the number of objects in the \nscene. \n\nHere we show that the number of objects can be estimated auto(cid:173)\nmatically in a maximum likelihood framework, given an assumption \nabout the level of noise in the video sequence. We derive analytical \nresults showing the number of models which maximize the likeli(cid:173)\nhood for a given noise level in a given sequence. We illustrate these \nresults on a real video sequence, showing how the phase transitions \ncorrespond to different perceptual organizations of the scene. \n\n\u2022 \n\nFigure la depicts a scene where motion estimation is difficult for many computer \nvision systems. A semi-transparent surface partially occludes a second surface, \nand the camera is translating horizontally. Figure 1 b shows a slice through the \nhorizontal component of the motion generated by the camera - points that are \ncloser to the camera move faster than those further away. In practice, the local \nmotion information would be noisy as shown in figure lc and this imposes conflicting \ndemands on a motion analysis system - reliable estimates require pooling together \nmany measurements while avoiding mixing together measurements derived from the \ntwo different surfaces. \n\n\fPhase Transitions and the Perceptual Organization of Video Sequences \n\n851 \n\n. , .. . ,. \".... \n\nrd rd 'Cd \n\n. . . . . ' \n' \n'..\". . .... : .. \n. ...... .:. ... .... .. \n\nI:l .... \".'-, .. ~ . .'~ ... -:\".. \n.. -\" .... \n...\nc \n\n. . . .... .. . .. .. \nr \nI: ____________ , \n.... \n\n. .. \n\n.. _.. \nb \n\nd \n\n.\nr \n\n. \n\n\" \n\na \n\nFigure 1: a: A simple scene that can cause problems for motion estimation. One surface \npartially occludes another surface. b: A cross section through the horizontal motion field \ngenerated when the camera translates horizontally. Points closer to the camera move \nfaster. c: Noisy motion field. In practice each local measurement will be somewhat noisy \nand pooling of information is required. d: A cross section through the output of a multiple \nmotion analysis system. Points are assigned to surfaces (denoted by different plot symbols) \nand the motion of each surface is estimated. \n\n\u2022 \u2022 \n\nII. \n\n\u2022 W. \u2022 \n\n0. \n\n\" \n.' I. \n\n. \n. \n\nw' \u2022 \n\n.' \n\n.1,. I'\u00b7 \u2022\u2022 \n\n\u2022 \u2022 \n\n.. ..' \n\nI-\n\n.' \n\n\u2022 I , \n\n\u2022 \u2022 11 \u2022 \u2022 11 \u2022 \u2022 \n\n\u00b71 \n\n-0 .. -01 \n\n-04 \n\n-a.2 \n\n0. \n\n0..1 \n\n0.. O. \n\n\u2022\u2022 \n\n1 \n\n-I \n\n-OJ! \n\n..Q. -0. \n\n-OJ! \n\n0 \n\n0., \n\nGAl O' \n\n0 1 \n\nI \n\nFigure 2: The \"correct\" number of surfaces in a given scene is often ambiguous. Was the \nmotion here generated by one or two surfaces? \n\nSignificant progress in the analysis of such scenes has been achieved by multiple \nmotion analyzers - systems that simultaneously segment the scene into surfaces and \nestimating the motion of each surface [9]. Mixture models are a commonly used \nframework for performing mUltiple motion estimation [5, 1, 10]. Figure 1d shows \na slice through the output of a multiple motion analyzer on this scene - pixels are \nassigned to one of two surfaces and motion information is only combined for pixels \nbelonging to the same surface. \n\nThe output shown in figure 1d was obtained by assuming the scene contains two \nsurfaces. In general, of course, one does not know the number of surfaces in the \nscene in advance. Figure 2 shows the difficulty in estimating this number. It is not \nclear whether this is very noisy data generated by a single surface, or less noisy \ndata generated by two surfaces. There seems no reason to prefer one description \nover another. Indeed, the description where there are as many surfaces as pixels is \nalso a valid interpretation of this data. \n\nHere we take the approach that there is no single \"correct\" number of surfaces for \na given scene in the absence of any additional assumptions. However, given an \nassumption about the noise in the sequence, there are more likely and less likely \ninterpretations. Intuitively, if we know that the data in figure 2a was taken with \na very noisy camera, we would tend to prefer the one surface solution - adding \nadditional surfaces would cause us to fit the noise rather than the data. However, if \nwe know that there is little noise in the sequence, we would prefer solutions that use \nmany surfaces, there is a lot less danger of \"overfitting\". In this paper} we show, \n\n1 A longer version of this paper is available on the author's web page. \n\n\f852 \n\nY. Weiss \n\nfollowing [6, 8] that this intuition regarding the dependence of number of surfaces to \nassumed noise level is captured in the maximum likelihood framework. We derive \nanalytical results for the critical values of noise levels where the likelihood function \nundergoes a \"phase transition\" - from being maximized by a single model to being \nmaximized by mUltiple models. We illustrate these transitions on synthetic and real \nvideo data. \n\n1 Theory \n\n1.1 Mixture Models for optical flow \n\nIn mixture models for optical flow (cf. [5, 1]) the scene is modeled as com(cid:173)\nposed of K surfaces with the velocity of each vsurface at location (x, y) given by \n(uk(x,y),vk(x,y). The velocity field is parameterized by a vector f)k. A typical \nchoice [9] is the affine representation: \n\nUk (x, y) = f)~ + f)~ x + f)~ Y \nvk(x, y) = f)! + f)~x + f):y \n\n(1) \n(2) \n\nThe affine family of motions includes rotations, translations, scalings and shears. It \ncorresponds to the 2D projection of a plane undergoing rigid motion in depth. \n\nCorresponding pixels in subsequent frames are assumed to have identical intensity \nvalues, up to imaging noise which is modeled as a Gaussian with variance a 2 \u2022 The \ntask of multiple motion estimation is to find the most likely motion parameter values \ngiven the image data. A standard derivation (see e.g. [1]) gives the following log \nlikelihood function for the parameters e: \nK \n\nlee) = L log(L e-R~(x.Y)/2u2) \n\n(3) \n\n(4) \n\nx,y \n\nk=l \n\nWith Rk(X, y) the residual intensity at pixel (x, y) for velocity k: \n\nRk(x, y) = Ix (x, y)uk(x, y) + Iy(x, y)vk(x, y) + It(x, y) \n\nwhere Ix, Iy,It denote the spatial and temporal derivatives of the image sequence. \nAlthough our notation does not make it explicit, Rk(X, y) is a function of f)k through \nequations 1-2. As in most mixture estimation applications, equation 3 is not maxi(cid:173)\nmized directly, but rather an Expectation-Maximization (EM) algorithm is used to \niteratively increase the likelihood [3]. \n\n1.2 Maximum Likelihood not necessarily with maximum number of \n\nmodels \n\nIt may seem that since K is fixed in the likelihood function (equation 3) there is \nno way that the number of surfaces can be found by maximizing the likelihood. \nHowever, maximizing over the likelihood may lead to a a solution in which some \nof the f) parameters are identical [6, 5, 8]. In this case, although the number of \nsurfaces is still K, the number of distinct surfaces may be any number less than K. \nConsider a very simple case where K = 2 and the motion of each surface is restricted \nto horizontal translation u(x, y) = u, vex, y) = O. The advantage of this simplified \n\n\fPhase Transitions and the Perceptual Organization of Video Sequences \n\n853 \n\nFigure 3: The log likelihood for the data in figure 2 undergoes a phase transition when a \nis varied. For small values of a the likelihood has two maxima, and at both these maxima \nthe two motions are distinct. For large a 2 the likelihood function has single maximum at \nthe origin, corresponding to the solution where both velocities are equal to zero, or only \none unique surface. \n\ncase is that the likelihood function is a function of two variables and can be easily \nvisualized. Figure 3 shows the likelihood function for the data in figure 2 as (7 is \nvaried. Observe that for small values of (72 the likelihood has two maxima, and \nat both these maxima the two motions are distinct. For large (72 the likelihood \nfunction has single maximum at the origin, corresponding to the solution where \nboth velocities are equal to zero, or only one unique surface. This is a simple \nexample where the ML solution corresponds to a small number of unique surfaces. \n\nCan we predict the range of values for (7 for which the likelihood function has a \nmaximum at the origin? This happens when the gradient of the likelihood at the \norigin is zero and the Hessian has two negative eigenvalues. It is easy to show \nthat the if the data has zero mean, the gradient is zero regardless of (7. As for the \nHessian, H, direct calculation gives: \n\nwhere E is the mean squared residual of a single motion and c is a positive constant. \nThe two eigenvalues are proportional to -1 and E / (72 -1. So the likelihood function \nhas a local maximum at the origin if and only if E < (72. (see [6, 4, 8] for a similar \nanalysis in other contexts). \n\nThis result makes intuitive sense. Recall that (72 is the expected noise variance. \nThus if the mean squared residual is less than (72 with a single surface, there is no \nneed to add additional surfaces. The result on the Hessian shows that this intuition \nis captured in the likelihood function. There is no need to introduce additional \n\"complexity costs\" to avoid overfitting in this case. \n\nMore generally, if we assume the velocity fields are of general parametric form, the \nHessian evaluated at the point where both surfaces are identical has the form: \n\nwhere E and F are matrices: \n\nz ,y \n\n-~ ) \n\nA.-I \n\n20-~ \n\n-~ ) \n\nb-F \n\n20-\n\n(5) \n\n(6) \n\n(7) \n\n\f854 \n\nY. Weiss \n\n. \n.. -.. \n\n.. \" \n\n0\":\" \n\no \"0: \n\na \n\nb \n\nFigure 4: a: data generated by two lines. b: the predicted phase diagram for the \nlikelihood of this dataset in a four component mixture. The phase transitions are at \n(J\" = 0.084, 0.112, 0.8088 \n\nF = L d(x, y)d(x, y)t \n\n:t,Y \n\n(8) \n\nwith d(x, y) = aR~~,y), and R(x, y) the residual as before. \n\nA necessary and sufficient condition for the Hessian to have only negative eigenvalues \nis: \n\n(9) \n\nThus when the maximal eigenvalue of F- 1 E is less than (12 the fit with a single \nmodel is a local maximum of the likelihood. Note that F- 1 E is very similar to a \nweighted mean squared error, with every residual weighted by a positive definite \nmatrix (E sums all the residuals times their weight, and F sums all the weights, so \nF- 1 E is similar to a weighted average). \n\nThe above analysis predicts the phase transition of a two component mixture likeli(cid:173)\nhood, i.e. the critical value of (12 such that above this critical value, the maximum \nlikelihood solution will have identical motion parameters for both surfaces. This \nanalysis can be straightforwardly generalized to finding the first phase transition of \na K component mixture, although the subsequent transitions are harder to analyze. \n\n2 Results \n\nThe fact that the likelihood function undergoes a phase transition as (1 is varied \npredicts that a ML technique will converge to different number of distinct models \nas (1 is varied. We first illustrate these phase transitions on a ID line fitting prob(cid:173)\nlem which shares some of the structure of multiple motion analysis and is easily \nvisualized. \n\nFigure 4a shows data generated by two lines with additive noise, and figure 4b \nshows a phase diagram calculated using repeated application of equation 9; i.e. by \nsolving equation 9 for all the data, taking the two line solution obtained after the \ntransition, and repeating the calculation separately for points assigned to each of \nthe two lines. \n\nFigure 5 shows the output of an EM algorithm on this data set. Initial conditions \nare identical in all runs, and the algorithm converges to one, two, three or four \ndistinct lines depending on (1. \n\nWe now illustrate the phase transitions on a real video sequence. Figures 6- 8 \nshow the output of an EM motion segmentation algorithm with four components \non the MPEG flower garden sequence (cf. [9, 10]). The camera is translating in \n\n\fPhase Transitions and the Perceptual Organization of Video Sequences \n\nx\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 \n\n. \n\n855 \n\n..... \n\n. \n\n.078 \n\n.089 \n\n0.1183 \n\n1.0 \n\nFigure 5: The data in figure 1 are fit with one, two, three or four models depending on \na. The results of EM with identical initial conditions are shown, only a is varied. The \ntransitions are consistent with the theoretical predictions . \n\n; \n\n.. ;v~~ \\; \"\" \n\u2022 \u2022 \n.. ,~ \n\na \n\nb \n\nFigure 6: The first phase transition. The algorithm finds two segments corresponding \nto the tree and the rest of the scene. The critical value of a 2 for which this transition \nhappens is consistent with the theoretical prediction. \n\nthe scene, and objects move with different velocities due to parallax. The phase \ntransitions correspond to different perceptual organizations of the scene - first the \ntree is segmented from the background, then branches are split from the tree, and \nfinally the background splits into the flower bed and the house. \n\n3 Discussion \n\nEstimating the number of components in a Gaussian mixture is a well researched \ntopic in statistics and data mining [7]. Most approaches involve some tradeoff \nparameter to balance the benefit of an additional component versus the added \ncomplexity [2]. Here we have shown how this tradeoff parameter can be implicitly \nspecified by the assumed level of noise in the image sequence. \n\nWhile making an assumption regarding a may seem rather arbitrary in the abstract \nGaussian mixture problem, we find it quite reasonable in the context of motion es(cid:173)\ntimation, where the noise is often a property of the imaging system, not of the \nunderlying surfaces. Furthermore, as the phase diagram in figure 4 shows, a wide \nrange of assumed a values will give similar answer, suggesting that an exact speci(cid:173)\nfication of a is not needed. In current work we are exploring the use of weak priors \non a as well as comparing our method to those based on cross validation [7] . \n\n\u2022 , ' ,h \n\n~3 \n\n, \n\n\u2022 \n\n-\n\n> \n\nt \n' \n\n> \n\nFigure 7: The second phase transition. The algorithm finds three segments - branches \nwhich are closer to the camera than the rest of the tree are segmented from it. Since the \nsegmentation is based solely on motion, portions of the flower bed that move consistently \nwith the branches are erroneously grouped with them. \n\n\f856 \n\nY. Weiss \n\nFigure 8: The third phase transition. The algorithm finds four segments - the Bower bed \nand the house are segregated. \n\nOur analytical and simulation results show that an assumption of the noise level \nin the sequence enables automatic determination of the number of moving objects \nusing well understood maximum likelihood techniques. Furthermore, for a given \nscene, varying the assumed noise level gives rise to different perceptually meaningful \nsegmentations. Thus mixture models may be a first step towards a well founded \nprobabilistic framework for perceptual organization. \n\nAcknowledgments \n\nI thank D. Fleet, E. Adelson, J. Tenenbaum and G. Hinton for stimulating discussions. \nSupported by a training grant from NIG MS. \n\nReferences \n\n[1] Serge Ayer and Harpreet S. Sawhney. Layered representation of motion video using \nrobust maximum likelihood estimation of mixture models and MDL encoding. In \nProc. Int'l Con/. Comput. Vision, pages 777-784, 1995. \n\n[2] J. Buhmann. Data clustering and learning. In M. Arbib, editor, Handbook of Brain \n\nTheory and Neural Networks. MIT Press, 1995. \n\n[3] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete \n\ndata via the EM algorithm. J. R. Statist. Soc. B, 39:1-38, 1977. \n\n[4] R. Durbin, R. Szeliski, and A. Yuille. An analysis of the elastic net approach to the \n\ntravelling salesman problem. Neural Computation, 1(3):348-358, 1989. \n\n[5] A. Jepson and M. J. Black. Mixture models for optical Bow computation. In Proc. \n\nIEEE Con/. Comput. Vision Pattern Recog., pages 760-761, New York, June 1993. \n\n[6] K. Rose, F. Gurewitz, and G. Fox. Statistical mechanics and phase transitions in \n\nclustering. Physical Review Letters, 65:945-948, 1990. \n\n[7] P. Smyth. Clustering using monte-carlo cross-validation. In KDD-96, pages 126-133, \n\n1996. \n\n[8] J. B. Tenenbaum and E. V. Todorov. Factorial learning by clustering features. In \nG. Tesauro, D.S. Touretzky, and K. Leen, editors, Advances in Neural Information \nProcessing Systems 7, 1995. \n\n[9] J. Y. A. Wang and E. H. Adelson. Representing moving images with layers. \nIEEE Transactions on Image Processing Special Issue: Image Sequence Compression, \n3(5):625-638, September 1994. \n\n[10] Y. Weiss and E. H. Adelson. A unified mixture framework for motion segmentation: \nincorporating spatial coherence and estimating the number of models. In Proc. IEEE \nCon/. Comput. Vision Pattern Recog., pages 321-326, 1996. \n\n\f", "award": [], "sourceid": 1354, "authors": [{"given_name": "Yair", "family_name": "Weiss", "institution": null}]}