{"title": "No-regret Algorithms for Online Convex Programs", "book": "Advances in Neural Information Processing Systems", "page_first": 489, "page_last": 496, "abstract": null, "full_text": "No-regret Algorithms for Online Convex Programs\n\nGeoffrey J. Gordon Department of Machine Learning Carnegie Mellon University Pittsburgh, PA 15213 ggordon@cs.cmu.edu\n\nAbstract\nOnline convex programming has recently emerged as a powerful primitive for designing machine learning algorithms. For example, OCP can be used for learning a linear classifier, dynamically rebalancing a binary search tree, finding the shortest path in a graph with unknown edge lengths, solving a structured classification problem, or finding a good strategy in an extensive-form game. Several researchers have designed no-regret algorithms for OCP. But, compared to algorithms for special cases of OCP such as learning from expert advice, these algorithms are not very numerous or flexible. In learning from expert advice, one tool which has proved particularly valuable is the correspondence between no-regret algorithms and convex potential functions: by reasoning about these potential functions, researchers have designed algorithms with a wide variety of useful guarantees such as good performance when the target hypothesis is sparse. Until now, there has been no such recipe for the more general OCP problem, and therefore no ability to tune OCP algorithms to take advantage of properties of the problem or data. In this paper we derive a new class of no-regret learning algorithms for OCP. These Lagrangian Hedging algorithms are based on a general class of potential functions, and are a direct generalization of known learning rules like weighted majority and external-regret matching. In addition to proving regret bounds, we demonstrate our algorithms learning to play one-card poker.\n\n1\n\nIntroduction\n\nIn a sequence of trials we must pick hypotheses y1 , y2 , . . . Y . After we choose yt , the correct answer is revealed as a convex loss function t (yt ).1 Just before seeing the tth example, our total t-1 loss is therefore Lt = i=1 i (yi ). If we had predicted using some fixed hypothesis y instead, then t-1 our loss would have been i=1 i (y ). Our total regret at time t is the difference between these two losses, with positive regret meaning that we would have preferred y to our actual plays: t (y ) = Lt -\nt-1 i =1\n\ni (y )\n\nt = sup t (y )\ny Y\n\nWe assume that Y is a compact convex subset of Rd that has at least two elements. In classical no-regret algorithms such as weighted majority, Y is a simplex: the corners of Y represent pure actions, the interior points of Y represent probability distributions over pure actions, and the number of corners n is the same as the number of dimensions d. In a more general OCP, Y may have\n1 t Many problems use loss functions of the form t (yt ) = (yt , yt rue ), where is a fixed function such as t squared error and yt rue is a target output. The more general notation allows for problems where there may be more than one correct prediction.\n\n\f\nThe shape of Y captures the structure in our prediction problem. Each point in Y is a separate hypothesis, but the losses of different hypotheses are related to each other because they are all embedded in the common representation space Rd . While we could run a standard no-regret algorithm such as weighted majority on a structured Y by giving it hypotheses corresponding to the extreme points c1 . . . cn of Y , this transformation would lose the connections among hypotheses (with a corresponding loss in runtime and generalization ability). Our algorithms below are stated in terms of linear loss functions, t (y ) = ct y . If t is nonlinear but convex, we can substitute the derivative at the current prediction, t (yt ), for ct , and our regret bounds will still hold (see [1, p. 53]). We will write C for the set of possible gradient vectors ct .\n\nmany more extreme points than dimensions, n d. For example, Y could be a convex set like {y | Ay = b, y 0} for some matrix A and vector b, or it could even be a sphere.\n\n2\n\nRelated Work\n\nA large number of researchers have studied online prediction in general and OCP in particular. The OCP problem dates back to Hannan in 1957 [2]. The name \"online convex programming\" is due to Zinkevich [3], who gave a clever gradient-descent algorithm. A similar algorithm and a weaker bound were presented somewhat earlier in [1]: that paper's GGD algorithm, using potential function 0 (w) = k w 2 , is equivalent to Zinkevich's \"lazy projection\" with a fixed learning rate. Another 2 clever algorithm for OCP was presented by Kalai and Vempala [4]. Compared to the above papers, the most important contribution of the current paper is its generality: no previous family of OCP algorithms can use as flexible a class of potential functions. As an illustration of the importance of this generality, consider the problem of learning from expert advice. Well-known regret bounds for this problem are logarithmic in the number of experts (e.g., [5]); no previous bounds for general OCP algorithms are sublinear in the number of experts, but logarithmic bounds follow directly as a special case of our results [6, sec. 8.1.2]. Despite this generality, our core result, Thm. 4 below, takes only half a dozen short equations to prove. From the online prediction literature, the closest related work is that of Cesa-Bianchi and Lugosi [7], which follows in the tradition of an algorithm and proof by Blackwell [8]. Cesa-Bianchi and Lugosi consider choosing predictions from an essentially-arbitrary decision space and receiving outcomes from an essentially-arbitrary outcome space. Together a decision and an outcome determine how a marker Rt Rd will move. Given a potential function G, they present algorithms which keep G(Rt ) from growing too quickly. This result is similar in flavor to our Thm. 4, and both Thm. 4 and the results of Cesa-Bianchi and Lugosi are based on Blackwell-like conditions. In fact, our Thm. 4 can be thought of as the first generalization of well-known online learning results such as Cesa-Bianchi and Lugosi's to online convex programming. The main differences between the Cesa-BianchiLugosi results and ours are the restrictions on their potential functions. They write their potential function as G(u) = f ((u)); they require to i i (ui ) for one-dimensional functions i ), nonnegative, and twice be additive (that is, (u) = differentiable, and they require f : R+ R+ to be increasing, concave, and twice differentiable. These restrictions rule out many of the potential functions used here, and in fact they rule out most online convex programming problems. The most restrictive requirement is additivity; for example, when defining potentials for OCPs via Eq. (7) below, unless the set Y can be factored as Y1 Y2 . . . YN the potentials are generally not expressible as f ((u)) for additive . During the preparation of this manuscript, we became aware of the recent work of Shalev-Shwartz and Singer [9]. This work generalizes some of the theorems in [6] and provides a very simple and elegant proof technique for algorithms based on convex potential functions. However, it does not consider the problem of defining appropriate potential functions for the feasible regions of OCPs (as discussed in Sec. 5 below and in more detail in [6]); finding such functions is an important requirement for applying potential-based algorithms to OCPs. In addition to the general papers above, there are many no-regret algorithms for specific OCPs, such as predicting as well as the best pruning of a decision tree [10], reorganizing a binary search tree so that frequent items are near the root [4], and picking paths in a graph with unknown edge costs [11].\n\n\f\n1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1\n\n-1\n\n-0.8\n\n-0.6\n\n-0.4\n\n-0.2\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\ns1 0 for t 1, 2, . . . yt f (st ) if yt u > 0 then yt yt /(yt u) else yt arbitrary element of Y fi Observe ct , compute st+1 from (1) end\n\n(*)\n\nFigure 1: A set Y = {y1 + y2 = 1, y 0} (thick dark line) and its safe set S (light shaded region).\n\nFigure 2: The gradient form of the Lagrangian Hedging algorithm.\n\n3\n\nRegret Vectors\nst+1 = st + (yt ct )u - ct (1)\n\nLagrangian Hedging algorithms maintain their state in a regret vector, st , defined by the recursion with the base case s1 = 0. Here u is an arbitrary vector which satisfies y u = 1 for all y Y . (If necessary we can append a constant element to each y so that such a u exists.) The regret vector contains information about our actual losses and the gradients of our loss functions: from st we can find our regret versus any y as follows. (This property justifies the name \"regret vector.\") y st =\nt-1 i =1\n\n(yi ci )y u -\n\nt-1 i =1\n\ny ci = Lt -\n\nt-1 i =1\n\ny ci = t (y ) (2)\n\nWe can define a safe set, in which our regret is guaranteed to be nonpositive: The goal of the Lagrangian Hedging algorithm is to keep its regret vector st near the safe set S . S is a convex cone: it is closed under positive linear combinations of its elements. And, it is polar [12] to the cone of unnormalized hypotheses: S = Y {y | y Y , 0} (3) S = {s | (y Y ) y s 0}\n\n4\n\nThe Main Algorithm\n\nWe will present the general LH algorithm first, then (in Sec. 5) a specialization which is often easier to implement. The two versions are called the gradient form and the optimization form. The gradient form is shown in Fig. 2. At each step it chooses its play based on the current regret vector st (Eq. (1)) and a closed convex potential function F (s) : Rd R with subgradient f (s) : Rd Rd . This potential function is what distinguishes one instance of the LH algorithm from another. F (s) should be small when s is in the safe set, and large when s is far from the safe set. For example, suppose that Y is the probability simplex in Rd , so that S is the negative orthant in Rd . (This choice of Y would be appropriate for playing a matrix game or predicting from expert advice.) For this Y , two possible potential functions are i i 2 F1 (s) = ln F2 (s) = esi - ln d [si ]+ /2 where > 0 is a learning rate and [s]+ = max(s, 0). The potential F1 leads to the Hedge [5] and weighted majority [13] algorithms, while the potential F2 results in external-regret matching [14, Theorem B]. For more examples of useful potential functions, see [6]. To ensure the LH algorithm chooses legal hypotheses yt Y , we require the following (note the constant 0 is arbitrary; any other k would work as well) F (s) 0 s S (4)\n\n\f\nTheorem 1 The LH algorithm is well-defined: define S as in (2) and fix a finite convex potential function F . If F (s) 0 for all s S , then the LH algorithm picks hypotheses yt Y for all t. (Omitted proofs are given in [6].) We can also define a version of the LH algorithm with an adjustable learning rate: replacing F (s) with F ( s) is equivalent to updating st with learning rate . Adjustable learning rates will help us obtain regret bounds for some classes of potentials.\n\n5\n\nThe Optimization Form\n\nEven if we have a convenient representation of our hypothesis space Y , it may not be easy to work directly with the safe set S . In particular, it may be difficult to define, evaluate, and differentiate a potential function F which has the necessary properties. To avoid these difficulties, we can work with an alternate form of the LH algorithm. This form, called the optimization form, defines F in terms of a simpler function W which we will call the hedging function. It uses the same pseudocode as the gradient form (Fig. 2), but on each step it computes F and F by solving an optimization problem involving W and the hypothesis set Y (Eq. (8) below). For example, two possible hedging functions are i i yi = 1 yi ln yi + ln d if y 0, W1 (y ) = otherwise i 2 W2 (y ) = yi /2 (5) (6)\n\nIf Y is the probability simplex in Rd , it will turn out that W1 (y / ) and W2 (y ) correspond to the potentials F1 and F2 from Section 4 above. So, these hedging functions result in the weighted majority and external-regret matching algorithms. For an example where the hedging function is easy to write analytically but the potential function is much more complicated, see Sec. 8 or [6]. The optimization form of the LH algorithm using hedging function W is defined to be equivalent to the gradient form using F (s) = sup (s y - W (y )) (7)\ny Y \n\n Here Y is defined as in (3).2 To implement the LH algorithm using the F of Eq. (7), we need an efficient way to compute F . As Thm. 2 below shows, there is always a y which satisfies y arg max (s y - W (y )) (8)\ny Y \n\nand any such y is an element of F . So, the optimization form of the LH algorithm uses the same pseudocode as the gradient form (Fig. 2), but uses Eq. (8) with s = st to compute yt in line (). \n\nTo gain an intuition for Eqs. (78), consider the example of external-regret matching. Since Y is the unit simplex in Rd , Y is the positive orthant in Rd . So, with W2 (y ) = y 2 /2, the optimization 2 problem (8) will be equivalent to 1 y = arg min s-y 2 2 d2 y R + \n\nThat is, y is the projection of s onto Rd by minimum Euclidean distance. It is not hard to verify that + this projection replaces the negative elements of s with zeros, y = [s]+ . Substituting this value for y back into (7) and using the fact that s [s]+ = [s]+ [s]+ , the resulting potential function is i i F2 (s) = s [s]+ - [si ]2 /2 = [si ]2 /2 + + as claimed above. This potential function is the standard one for analyzing external-regret matching. Theorem 2 Let W be convex, dom W Y be nonempty, and W (y ) 0 for all y . Suppose the sets {y | W (y ) + s y k } are compact for all s and k . Define F as in (7). Then F is finite and F (s) 0 for all s S . And, the optimization form of the LH algorithm using the hedging function W is equivalent to the gradient form of the LH algorithm with potential function F .\n2 Eq. (7) is similar to the definition of the convex dual W , but the supremum is over y Y instead of over all y . As a result, F and W can be very different functions. As discussed in [6], F can be expressed as the dual of a function related to W .\n\n\f\n6\n\nTheoretical Results\n\nOur main theoretical results are regret bounds for the LH algorithm. The bounds depend on the curvature of our potential F , the size of the hypothesis set Y , and the possible slopes C of our loss functions. Intuitively, F must be neither too curved nor too flat on the scale of the updates to st from Eq. (1): if F is too curved then F will change too quickly and our hypothesis yt will jump around a lot, while if F is too flat then we will not react quickly enough to changes in regret. We will state our results for the gradient form of the LH algorithm. For the optimization form, essentially the same results hold, but the constants are defined in terms of the hedging function instead. Therefore, we never need to work with (or even be able to write down) the corresponding potential function. For more details, see [6]. One result which is slightly tricky to carry over is tuning learning rates. The choice of learning rate below and the resulting bound are the same as for the gradient form, but the implementation is slightly different: to set a learning rate > 0, we replace W (y ) with W (y / ). We will need upper and lower bounds on F . We will assume F (s + ) F (s) + f (s) + C for all regret vectors s and increments , and [F (s) + A]+ inf B s - s p\ns S 2\n\n(9)\n\n(10)\n\nfor all s. Here is an arbitrary finite norm, and A 0, B > 0, C > 0, and 1 p 2 are constants. Eq. (9), together with the convexity of F , implies that F is differentiable and f is its gradient; the LH algorithm is applicable if F is not differentiable, but its regret bounds are weaker. We will bound the size of Y by assuming that y\n\n\nM\n\n(11)\n\nfor all y in Y . Here, is the dual of the norm used in Eq. (9) [12].\n\nThe size of our update to st (in Eq. (1)) depends on the hypothesis set Y , the cost vector set C , and the vector u. We have already bounded Y ; rather than bounding C and u separately, we will assume that there is a constant D so that E ( st+1 - st 2 | st ) D (12)\n\nHere the expectation is taken with respect to our choice of hypothesis, so the inequality must hold for all possible values of ct . (The expectation is only necessary if we randomize our choice of hypothesis, as would happen if Y is the convex hull of some non-convex set. If interior points of Y are valid plays, we need not randomize, so we can drop the expectation in (12) and below.) Our theorem then bounds our regret in terms of the above constants. Since the bounds are sublinear in t, they show that Lagrangian Hedging is a no-regret algorithm when we choose an appropriate potential F . Theorem 3 Suppose the potential function F is convex and satisfies Eqs. (4), (9) and (10). Suppose that the problem definition is bounded according to (11) and (12). Then the LH algorithm (Fig. 2) achieves expected regret E (t+1 (y )) M ((tC D + A)/B )1/p = O(t1/p ) versus any hypothesis y Y . If p = 1 the above bound is O(t). But, suppose that we know ahead of time the number of trials t we will see. Define G(s) = F ( s), where A = /(tC D)\n\nThen the LH algorithm with potential G achieves regret E (t+1 (y )) (2M /B ) tAC D = O( t) for any hypothesis y Y .\n\n\f\nThe full proof of Thm. 3 appears in [6]; here, we sketch the proof of one of the most important intermediate results. Thm. 4 shows that, if we can guarantee E (st+1 - st ) F (t) 0, then F (st ) cannot grow too quickly. This result is analogous to Blackwell's approachability theorem: since the level sets of F are related to S , we will be able to show st /t S , implying no regret. Theorem 4 (Gradient descent) Let F (s) and f (s) satisfy Equation (9) with seminorm and t-1 constant C . Let x0 , x1 , . . . be a sequence of random vectors. Write st = i=0 xi , and let D be a constant so that E ( xt 2 | st ) D. Suppose that, for all t, E (xt f (st ) | st ) 0. Then for all t, E (F (st+1 ) | s1 ) - F (s1 ) tC D P RO O F : The proof is by induction: for t 2, assume E (F (st ) | s1 ) F (s1 ) + (t - 1)C D. (It is obvious that the base case holds for t = 1.) Then: F (st+1 ) = F (st + xt ) F (st ) + xt f (st ) + C xt 2 E (F (st+1 ) | st ) F (st ) + C D E (F (st+1 ) | s1 ) E (F (st ) | s1 ) + C D E (F (st+1 ) | s1 ) F (s1 ) + (t - 1)C D + C D 2\n\nwhich is the desired result.\n\n7\n\nExamples\n\nThe classical applications of no-regret algorithms are learning from expert advice and learning to play a repeated matrix game. These two tasks are essentially equivalent, since they both use the i probability simplex Y = {y | y 0, yi = 1} for their hypothesis set. This choice of Y simplifies the required algorithms greatly; with appropriate choices of potential functions, it can be shown that standard no-regret algorithms such as Freund and Schapire's Hedge [5], Littlestone and Warmuth's weighted majority [13], and Hart and Mas-Colell's external-regret matching [14, Theorem B] are all special cases of the LH algorithm. A large variety of other online prediction problems can also be cast in our framework. These problems include path planning when costs are chosen by an adversary [11], planning in a Markov decision process when costs are chosen by an adversary [15], online pruning of a decision tree [16], and online balancing of a binary search tree [4]. More uses of online convex programming are given in [1, 3, 4]. In each case the bounds for the LH algorithm will be polynomial or better in the dimensionality of the appropriate hypothesis set and sublinear in the number of trials.\n\n8\n\nExperiments\n\nTo demonstrate that our theoretical bounds translate to good practical performance, we implemented the LH algorithm with the potential function W2 from (6) and used it to learn policies for the game of one-card poker. (The hypothesis space for this learning problem is the set of sequence weight vectors, which is convex because one-card poker is an extensive-form game [17].) In one-card poker, two players (called the gambler and the dealer) each ante $1 and receive one card from a 13-card deck. The gambler bets first, adding either $0 or $1 to the pot. Then the dealer gets a chance to bet, again either $0 or $1. Finally, if the gambler bet $0 and the dealer bet $1, the gambler gets a second chance to bring her bet up to $1. If either player bets $0 when the other has already bet $1, that player folds and loses her ante. If neither player folds, the higher card wins the pot, resulting in a net gain of either $1 or $2 (equal to the other player's ante plus the bet of $0 or $1). In contrast to the usual practice in poker we assume that the payoff vector ct is observable after each hand; the partially-observable extension is beyond the scope of this paper. One-card poker is a simple game; nonetheless it has many of the elements of more complicated games, including incomplete information, chance events, and multiple stages. And, optimal play requires behaviors like randomization and bluffing. The biggest strategic difference between onecard poker and larger variants such as draw, stud, or hold-em is the idea of hand potential: while\n\n\f\n0.6 Gambler bound Dealer bound Avg payoff Minimax value\n\n0.6 Gambler bound Dealer bound Avg payoff Minimax value\n\n0.5\n\n0.5\n\n0.4\n\n0.4\n\n0.3\n\n0.3\n\n0.2\n\n0.2\n\n0.1\n\n0.1\n\n0\n\n0\n\n-0.1\n\n-0.1\n\n-0.2\n\n-0.2\n\n-0.3\n\n-0.3\n\n-0.4\n\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\n-0.4\n\n0\n\n50\n\n100\n\n150\n\n200\n\n250\n\nFigure 3: Performance in self-play (left) and against a fixed opponent (right).\n\n45679 and 24679 are almost equally strong hands in a showdown (they are both 9-high), holding 45679 early in the game is much more valuable because replacing the 9 with either a 3 or an 8 turns it into a straight. Fig. 3 shows the results of two typical runs: in both panels the dealer is using our no-regret algorithm. In the left panel the gambler is also using our no-regret algorithm, while in the right panel the gambler is playing a fixed policy. The x-axis shows number of hands played; the y -axis shows the average payoff per hand from the dealer to the gambler. The value of the game, -$0.064, is indicated with a dotted line. The middle solid curve shows the actual performance of the dealer (who is trying to minimize the payoff). The upper curve measures the progress of the dealer's learning: after every fifth hand we extracted a a strategy yt vg by taking the average of our algorithm's predictions so far. We then plotted the a a worst-case value of yt vg . That is, we plotted the payoff for playing yt vg against an opponent which avg knows yt and is optimized to maximize the dealer's losses. Similarly, the lower curve measures the progress of the gambler's learning. In the right panel, the dealer quickly learns to win against the non-adaptive gambler. The dealer never plays a minimax strategy, as shown by the fact that the upper curve does not approach the value of the game. Instead, she plays to take advantage of the gambler's weaknesses. In the left panel, the gambler adapts and forces the dealer to play more conservatively; in this case, the limiting strategies for both players are minimax. The curves in the left panel of Fig. 3 show an interesting effect: the small, damping oscillations result from the dealer and the gambler \"chasing\" each other around a minimax strategy. One player will learn to exploit a weakness in the other, but in doing so will open up a weakness in her own play; then the second player will adapt to try to take advantage of the first, and the cycle will repeat. Each weakness will be smaller than the last, so the sequence of strategies will converge to a minimax equilibrium. This cycling behavior is a common phenomenon when two learning players play against each other. Many learning algorithms will cycle so strongly that they fail to achieve the value of the game, but our regret bounds eliminate this possibility.\n\n9\n\nDiscussion\n\nWe have presented the Lagrangian Hedging algorithms, a family of no-regret algorithms for OCP based on general potential functions. We have proved regret bounds for LH algorithms and demonstrated experimentally that these bounds lead to good predictive performance in practice. The regret bounds for LH algorithms have low-order dependences on d, the number of dimensions in the hypothesis set Y . This low-order dependence means that the LH algorithms can learn well in prediction problems with complicated hypothesis sets; these problems would otherwise require an impractical amount of training data and computation time.\n\n\f\nOur work builds on previous work in online learning and online convex programming. Our contributions include a new, deterministic algorithm; a simple, general proof; the ability to build algorithms from a more general class of potential functions; and a new way of building good potential functions from simpler hedging functions, which allows us to construct potential functions for arbitrary convex hypothesis sets. Future work includes a no-internal-regret version of the LH algorithm, as well as a bandit-style version. The former will guarantee convergence to a correlated equilibrium in nonzero-sum games, while the latter will allow us to work from incomplete observations of the cost vector (e.g., as might happen in an extensive-form game such as poker). Acknowledgments Thanks to Amy Greenwald, Martin Zinkevich, and Sebastian Thrun, as well as Yoav Shoham and his research group. This work was supported by NSF grant EF-0331657 and DARPA contracts F30602-01-C-0219, NBCH-1020014, and HR0011-06-0023. The opinions and conclusions are the author's and do not reflect those of the US government or its agencies.\n\nReferences\n[1] Geoffrey J. Gordon. Approximate Solutions to Markov Decision Processes. PhD thesis, Carnegie Mellon University, 1999. [2] James F. Hannan. Approximation to Bayes risk in repeated play. In M. Dresher, A. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume 3, pages 97139. Princeton University Press, 1957. [3] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning. AAAI Press, 2003. [4] Adam Kalai and Santosh Vempala. Geometric algorithms for online optimization. Technical Report MIT-LCS-TR-861, Massachusetts Institute of Technology, 2002. [5] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT 95, pages 2337. Springer-Verlag, 1995. [6] Geoffrey J. Gordon. No-regret algorithms for structured prediction problems. Technical Report CMU-CALD-05-112, Carnegie Mellon University, 2005. ` [7] Nicolo Cesa-Bianchi and Gabor Lugosi. Potential-based algorithms in on-line prediction and game theory. Machine Learning, 51:239261, 2003. [8] David Blackwell. An analogue of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):18, 1956. [9] Shai Shalev-Shwartz and Yoram Singer. Convex repeated games and Fenchel duality. In B. Scholkopf, J.C. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems, volume 19, Cambridge, MA, 2007. MIT Press. [10] David P. Helmbold and Robert E. Schapire. Predicting nearly as well as the best pruning of a decision tree. In Proceedings of COLT, pages 6168, 1995. [11] Eiji Takimoto and Manfred Warmuth. Path kernels and multiplicative updates. In COLT, 2002. [12] R. Tyrell Rockafellar. Convex Analysis. Princeton University Press, New Jersey, 1970. [13] Nick Littlestone and Manfred Warmuth. The weighted majority algorithm. Technical Report UCSC-CRL-91-28, University of California Santa Cruz, 1992. [14] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):11271150, 2000. [15] H. Brendan McMahan, Geoffrey J. Gordon, and Avrim Blum. Planning in the presence of cost functions controlled by an adversary. In Proceedings of the Twentieth International Conference on Machine Learning, 2003. [16] David P. Helmbold and Robert E. Schapire. Predicting nearly as well as the best pruning of a decision tree. In COLT, 1995. [17] D. Koller, N. Meggido, and B. von Stengel. Efficient computation of equilibria for extensive two-person games. Games and Economic Behaviour, 14(2), 1996.\n\n\f\n", "award": [], "sourceid": 3082, "authors": [{"given_name": "Geoffrey", "family_name": "Gordon", "institution": null}]}