Jump to content

Edit filter log

Details for log entry 38849785

00:18, 30 September 2024: 2603:7080:a400:8416:4c73:5796:7df1:8c5c (talk) triggered filter 833, performing the action "edit" on Bregman divergence. Actions taken: none; Filter description: Newer user possibly adding unreferenced or improperly referenced material (examine | diff)

Changes made in edit



::<math>D_{F}(p, q) = F(p) + F^*(q^*) - \langle p, q^* \rangle </math>
::<math>D_{F}(p, q) = F(p) + F^*(q^*) - \langle p, q^* \rangle </math>

* '''Integral form:''' by the integral remainder form of Taylor's Theorem, a Bregman divergence can be written as the integral of the Hessian of $F$ along the line segment between the Bregman divergence's arguments.


* '''Mean as minimizer''': A key result about Bregman divergences is that, given a random vector, the mean vector minimizes the expected Bregman divergence from the random vector. This result generalizes the textbook result that the mean of a set minimizes total squared error to elements in the set. This result was proved for the vector case by (Banerjee et al. 2005), and extended to the case of functions/distributions by (Frigyik et al. 2008). This result is important because it further justifies using a mean as a representative of a random set, particularly in Bayesian estimation.
* '''Mean as minimizer''': A key result about Bregman divergences is that, given a random vector, the mean vector minimizes the expected Bregman divergence from the random vector. This result generalizes the textbook result that the mean of a set minimizes total squared error to elements in the set. This result was proved for the vector case by (Banerjee et al. 2005), and extended to the case of functions/distributions by (Frigyik et al. 2008). This result is important because it further justifies using a mean as a representative of a random set, particularly in Bayesian estimation.
* '''Bregman balls are bounded, and compact if X is closed''': Define Bregman ball centered at x with radius r by <math>B_f(x, r):= \left\{y\in X: D_f(y, x)\leq r\right\}</math>. When <math>X\subset \R^n</math> is finite dimensional, <math>\forall x\in X</math>, if <math>x</math> is in the relative interior of <math>X</math>, or if <math>X</math> is locally closed at <math>x</math> (that is, there exists a closed ball <math>B(x, r)</math> centered at <math>x</math>, such that <math>B(x,r) \cap X</math> is closed), then <math>B_f(x, r)</math> is bounded for all <math>r</math> . If <math>X</math> is closed, then <math>B_f(x, r)</math> is compact for all <math>r</math>.
* '''Bregman balls are bounded, and compact if X is closed''': Define Bregman ball centered at x with radius r by <math>B_f(x, r):= \left\{y\in X: D_f(y, x)\leq r\right\}</math>. When <math>X\subset \R^n</math> is finite dimensional, <math>\forall x\in X</math>, if <math>x</math> is in the relative interior of <math>X</math>, or if <math>X</math> is locally closed at <math>x</math> (that is, there exists a closed ball <math>B(x, r)</math> centered at <math>x</math>, such that <math>B(x,r) \cap X</math> is closed), then <math>B_f(x, r)</math> is bounded for all <math>r</math> . If <math>X</math> is closed, then <math>B_f(x, r)</math> is compact for all <math>r</math>.
* '''Law of cosines''':<ref name=cs.utexas.edu>{{cite web|url=https://www.cs.utexas.edu/users/inderjit/Talks/bregtut.pdf|title=Learning with Bregman Divergences|website=utexas.edu|access-date=19 August 2023}}</ref>
* '''Law of cosines''':<ref name="cs.utexas.edu">{{cite web|url=https://www.cs.utexas.edu/users/inderjit/Talks/bregtut.pdf|title=Learning with Bregman Divergences|website=utexas.edu|access-date=19 August 2023}}</ref>
For any <math>p,q,z</math>
For any <math>p,q,z</math>
::<math>D_F(p, q) = D_F(p, z) + D_F(z, q) - (p - z)^T(\nabla F(q) - \nabla F(z))</math>
::<math>D_F(p, q) = D_F(p, z) + D_F(z, q) - (p - z)^T(\nabla F(q) - \nabla F(z))</math>

Action parameters

VariableValue
Edit count of the user (user_editcount)
null
Name of the user account (user_name)
'2603:7080:A400:8416:4C73:5796:7DF1:8C5C'
Type of the user account (user_type)
'ip'
Age of the user account (user_age)
0
Groups (including implicit) the user is in (user_groups)
[ 0 => '*' ]
Rights that the user has (user_rights)
[ 0 => 'createaccount', 1 => 'read', 2 => 'edit', 3 => 'createtalk', 4 => 'viewmyprivateinfo', 5 => 'editmyprivateinfo', 6 => 'editmyoptions', 7 => 'abusefilter-log-detail', 8 => 'urlshortener-create-url', 9 => 'centralauth-merge', 10 => 'abusefilter-view', 11 => 'abusefilter-log', 12 => 'vipsscaler-test' ]
Whether or not a user is editing through the mobile interface (user_mobile)
false
Whether the user is editing from mobile app (user_app)
false
Page ID (page_id)
4491248
Page namespace (page_namespace)
0
Page title without namespace (page_title)
'Bregman divergence'
Full page title (page_prefixedtitle)
'Bregman divergence'
Edit protection level of the page (page_restrictions_edit)
[]
Last ten users to contribute to the page (page_recent_contributors)
[ 0 => 'Nbarth', 1 => '162.157.84.254', 2 => 'David Eppstein', 3 => '77.119.174.223', 4 => 'Thatsme314', 5 => 'TheMathCat', 6 => 'Egeymi', 7 => '193.49.200.225', 8 => '2001:6B0:17:FC08:3869:A302:1D5F:85B5', 9 => '194.199.100.15' ]
Page age in seconds (page_age)
584511291
Action (action)
'edit'
Edit summary/reason (summary)
'Add integral form of Bregman divergence - missing useful property.'
Time since last page edit in seconds (page_last_edit_age)
8504991
Old content model (old_content_model)
'wikitext'
New content model (new_content_model)
'wikitext'
Old page wikitext, before the edit (old_wikitext)
'{{Short description|A measure of difference between two points}} {{Use dmy dates|date=August 2023}} In [[mathematics]], specifically [[statistics]] and [[information geometry]], a '''Bregman divergence''' or '''Bregman distance''' is a measure of difference between two points, defined in terms of a strictly [[convex function]]; they form an important class of [[Divergence (statistics)|divergence]]s. When the points are interpreted as [[probability distribution]]s – notably as either values of the parameter of a [[parametric model]] or as a data set of observed values – the resulting distance is a [[statistical distance]]. The most basic Bregman divergence is the [[squared Euclidean distance]]. Bregman divergences are similar to [[metric (mathematics)|metric]]s, but satisfy neither the [[triangle inequality]] (ever) nor symmetry (in general). However, they satisfy a generalization of the [[Pythagorean theorem]], and in information geometry the corresponding [[statistical manifold]] is interpreted as a (dually) [[flat manifold]]. This allows many techniques of [[optimization theory]] to be generalized to Bregman divergences, geometrically as generalizations of [[least squares]]. Bregman divergences are named after Russian mathematician [[Lev M. Bregman]], who introduced the concept in 1967. == Definition == Let <math>F\colon \Omega \to \mathbb{R} </math> be a continuously-differentiable, strictly [[convex function]] defined on a [[convex set]] <math>\Omega</math>. The Bregman distance associated with ''F'' for points <math>p, q \in \Omega</math> is the difference between the value of ''F'' at point ''p'' and the value of the first-order [[Taylor expansion]] of ''F'' around point ''q'' evaluated at point ''p'': :<math>D_F(p, q) = F(p)-F(q)-\langle \nabla F(q), p-q\rangle. </math> == Properties == * '''Non-negativity''': <math>D_F(p, q) \ge 0</math> for all <math>p</math>, <math>q</math>. This is a consequence of the convexity of <math>F</math>. * '''Positivity''': When <math>F</math> is strictly convex, <math>D_F(p, q) = 0</math> iff <math>p=q</math>. * '''Uniqueness up to affine difference''': <math>D_F = D_G</math> iff <math>F-G</math> is an affine function. * '''Convexity''': <math>D_F(p, q)</math> is convex in its first argument, but not necessarily in the second argument. If F is strictly convex, then <math>D_F(p, q)</math> is strictly convex in its first argument. ** For example, Take f(x) = |x|, smooth it at 0, then take <math>y = 1, x_1 = 0.1, x_2 = -0.9, x_3 = 0.9x_1 + 0.1x_2</math>, then <math>D_f(y, x_3) \approx 1 > 0.9 D_f(y, x_1) + 0.1 D_f(y, x_2) \approx 0.2</math>. * '''Linearity''': If we think of the Bregman distance as an operator on the function ''F'', then it is linear with respect to non-negative coefficients. In other words, for <math>F_1, F_2</math> strictly convex and differentiable, and <math>\lambda \ge 0</math>, ::<math>D_{F_1 + \lambda F_2}(p, q) = D_{F_1}(p, q) + \lambda D_{F_2}(p, q)</math> * '''Duality''': If F is strictly convex, then the function F has a [[convex conjugate]] <math>F^*</math> which is also strictly convex and continuously differentiable on some convex set <math>\Omega^*</math>. The Bregman distance defined with respect to <math>F^*</math> is dual to <math>D_F(p, q)</math> as ::<math>D_{F^*}(p^*, q^*) = D_F(q, p)</math> :Here, <math>p^* = \nabla F(p)</math> and <math>q^* = \nabla F(q)</math> are the dual points corresponding to p and q. :Moreover, using the same notations : ::<math>D_{F}(p, q) = F(p) + F^*(q^*) - \langle p, q^* \rangle </math> * '''Mean as minimizer''': A key result about Bregman divergences is that, given a random vector, the mean vector minimizes the expected Bregman divergence from the random vector. This result generalizes the textbook result that the mean of a set minimizes total squared error to elements in the set. This result was proved for the vector case by (Banerjee et al. 2005), and extended to the case of functions/distributions by (Frigyik et al. 2008). This result is important because it further justifies using a mean as a representative of a random set, particularly in Bayesian estimation. * '''Bregman balls are bounded, and compact if X is closed''': Define Bregman ball centered at x with radius r by <math>B_f(x, r):= \left\{y\in X: D_f(y, x)\leq r\right\}</math>. When <math>X\subset \R^n</math> is finite dimensional, <math>\forall x\in X</math>, if <math>x</math> is in the relative interior of <math>X</math>, or if <math>X</math> is locally closed at <math>x</math> (that is, there exists a closed ball <math>B(x, r)</math> centered at <math>x</math>, such that <math>B(x,r) \cap X</math> is closed), then <math>B_f(x, r)</math> is bounded for all <math>r</math> . If <math>X</math> is closed, then <math>B_f(x, r)</math> is compact for all <math>r</math>. * '''Law of cosines''':<ref name=cs.utexas.edu>{{cite web|url=https://www.cs.utexas.edu/users/inderjit/Talks/bregtut.pdf|title=Learning with Bregman Divergences|website=utexas.edu|access-date=19 August 2023}}</ref> For any <math>p,q,z</math> ::<math>D_F(p, q) = D_F(p, z) + D_F(z, q) - (p - z)^T(\nabla F(q) - \nabla F(z))</math> * [[Parallelogram law]]: for any <math>\theta, \theta_1, \theta_2</math>, <math>B_{F}\left(\theta_{1}: \theta\right)+B_{F}\left(\theta_{2}: \theta\right)=B_{F}\left(\theta_{1}: \frac{\theta_{1}+\theta_{2}}{2}\right)+B_{F}\left(\theta_{2}: \frac{\theta_{1}+\theta_{2}}{2}\right)+2 B_{F}\left(\frac{\theta_{1}+\theta_{2}}{2}: \theta\right)</math>[[File:Bregman_divergence_Pythagorean.png|right|thumb|300x300px|Generalized Pythagorean theorem for Bregman divergence .<ref name="Martin2014">{{cite journal |last1=Adamčík |first1=Martin |year=2014 |title=The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning |journal=Entropy |volume=16 |issue=12 |pages=6338–6381 |bibcode=2014Entrp..16.6338A |doi=10.3390/e16126338 |doi-access=free}}</ref>]] * '''Bregman projection''': For any <math>W\subset \Omega</math>, define the "Bregman projection" of <math>q</math> onto <math>W</math>: <math>P_W(q) = \text{argmin}_{\omega\in W} D_F(\omega, q)</math>. Then ** if <math>W</math> is convex, then the projection is unique if it exists; ** if <math>W</math> is nonempty, closed, and convex and <math>\Omega\subset \R^n</math> is finite dimensional, then the projection exists and is unique.<ref>{{cite journal|last1=Dhillon|first1=Inderjit|author-link=Inderjit Dhillon|last2=Tropp|first2=Joel|title=Matrix Nearness Problems with Bregman Divergence|journal=SIAM Journal on Matrix Analysis and Applications|volume=29|issue=4|year=2008|quote=Supposed D_\varphi is a Bregman divergence, supposed that {C_k} is a finite collection of closed, convex sets whose intersection is nonempty. Given an input matrix Y our goal is to produce a matrix \mathbf{X} in the intersection that diverges the least from \textbf{Y}, i.e. to solve \min_{\mathbf{X} } D_\varphi(\mathbf{X};\mathbf{Y}) subject to \mathbf{X} \in \big\cap_k C_k. Under mild conditions, the solution is unique and it has a variational characterization analogous with the characterization of an orthogonal projection onto a convex set" (see s2.4, page 1125 for more)|url=https://www.cs.utexas.edu/users/inderjit/public_papers/bregman-nearness.pdf}}</ref> * '''Generalized Pythagorean Theorem''':<ref name="cs.utexas.edu"/> For any <math>v\in \Omega, a\in W </math>, <math>D_F(a, v) \ge D_F(a, P_W(v)) + D_F(P_W(v), v).</math> This is an equality if <math>P_W(v)</math> is in the [[relative interior]] of <math>W</math>. In particular, this always happens when <math>W</math> is an affine set. * ''Lack'' of triangle inequality: Since the Bregman divergence is essentially a generalization of squared Euclidean distance, there is no triangle inequality. Indeed, <math>D_F(z, x) - D_F(z, y) - D_F(y, x) = \langle\nabla f(y) - \nabla f(x), z-y\rangle</math>, which may be positive or negative. === Proofs === * Non-negativity and positivity: use [[Jensen's inequality]]. * Uniqueness up to affine difference: Fix some <math>x\in \Omega</math>, then for any other <math>y\in \Omega</math>, we have by definition<math>F(y) - G(y) = F(x) - G(x) + \langle\nabla F(x) - \nabla G(x) , y-x \rangle </math>. * Convexity in the first argument: by definition, and use convexity of F. Same for strict convexity. * Linearity in F, law of cosines, parallelogram law: by definition. * Duality: See figure 1 of.<ref>{{Cite journal |last=Nielsen |first=Frank |date=2021-10-28 |title=Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions |journal=Entropy |volume=23 |issue=11 |pages=1417 |doi=10.3390/e23111417 |pmid=34828115 |pmc=8619509 |arxiv=2107.05901 |bibcode=2021Entrp..23.1417N |issn=1099-4300|doi-access=free }}</ref> * Bregman balls are bounded, and compact if X is closed: Fix <math>x\in X</math> . Take affine transform on <math>f</math> , so that <math>\nabla f(x) = 0</math>. Take some <math>\epsilon > 0</math>, such that <math>\partial B(x, \epsilon) \subset X</math>. Then consider the "radial-directional" derivative of <math>f</math> on the Euclidean sphere <math>\partial B(x, \epsilon)</math>. <math>\langle\nabla f(y), (y-x)\rangle</math> for all <math>y\in \partial B(x, \epsilon)</math>. Since <math>\partial B(x, \epsilon)\subset \R^n</math> is compact, it achieves minimal value <math>\delta</math> at some <math>y_0\in \partial B(x, \epsilon)</math>. Since <math>f</math> is strictly convex, <math>\delta > 0</math>. Then <math>B_f(x, r)\subset B(x, r/\delta)\cap X</math>. Since <math>D_f(y, x)</math> is <math>C^1</math> in <math>y</math>, <math>D_f</math> is continuous in <math>y</math>, thus <math>B_f(x, r)</math> is closed if <math>X</math> is. * Projection <math>P_W</math> is well-defined when <math>W</math> is closed and convex. Fix <math>v\in X</math>. Take some <math>w\in W</math> , then let <math>r := D_f(w, v)</math>. Then draw the Bregman ball <math>B_f(v, r)\cap W</math>. It is closed and bounded, thus compact. Since <math>D_f(\cdot, v)</math> is continuous and strictly convex on it, and bounded below by <math>0</math>, it achieves a unique minimum on it. * Pythagorean inequality. By cosine law, <math>D_f(w, v) - D_f(w, P_W(v)) - D_f(P_W(v), v) = \langle \nabla_y D_f(y, v)|_{y = P_W(v)} , w - P_W(v)\rangle</math>, which must be <math>\geq 0</math>, since <math>P_W(v)</math> minimizes <math>D_f(\cdot, v)</math> in <math>W</math>, and <math>W</math> is convex. * Pythagorean equality when <math>P_W(v)</math> is in the relative interior of <math>X</math>. If <math>\langle \nabla_y D_f(y, v)|_{y = P_W(v)}, w - P_W(v)\rangle > 0</math>, then since <math>w</math> is in the relative interior, we can move from <math>P_W(v)</math> in the direction opposite of <math>w</math>, to decrease <math>D_f(y, v)</math> , contradiction. Thus <math>\langle \nabla_y D_f(y, v)|_{y = P_W(v)}, w - P_W(v)\rangle = 0</math>. === Classification theorems === * The only symmetric Bregman divergences on <math>X\subset \R^n</math> are squared generalized Euclidean distances ([[Mahalanobis distance]]), that is, <math>D_f(y, x) = (y-x)^T A (y-x)</math> for some [[Positive definiteness|positive definite]] <math>A</math>.<ref>{{Cite journal | last1=Nielsen | first1=Frank | last2=Boissonnat | first2=Jean-Daniel | authorlink2=Jean-Daniel Boissonnat | last3=Nock | first3=Richard | date=September 2010 | title=Bregman Voronoi Diagrams: Properties, Algorithms and Applications | journal=[[Discrete & Computational Geometry]] | volume=44 | issue=2 | pages=281–307 | doi=10.1007/s00454-010-9256-1 | issn=0179-5376 | arxiv=0709.2196 | s2cid=1327029 }}</ref> {{Math proof|drop=hidden|proof= [[File:Bregman_divergence_diagram_used_in_proof_of_squared_generalized_Euclidean_distances.png|thumb|Bregman divergence interpreted as areas.]] For any <math>x \neq y\in X</math> , define <math>r = \|y-x\|, v = (y-x)/r, g(t)= f(x + t v)</math> for <math>t\in [0, r]</math> . Let <math>z(t) = x + tv</math>. Then <math>g'(t) = \langle \nabla f(z(t)), v \rangle</math> for <math>t\in(0, r)</math> , and since <math>\nabla f</math> is continuous, also for <math>t=0, r</math> . Then, from the diagram, we see that for <math>D_f(x; z(t)) = D_f(z(t); x)</math> for all <math>t\in [0, r]</math> , we must have <math>g'(t)</math> linear on <math>t\in [0, r]</math>. Thus we find that <math>\nabla f</math> varies linearly along any direction. By the next lemma, <math>f</math> is quadratic. Since <math>f</math> is also strictly convex, it is of form <math>f(x) + x^TAx + B^T x + C</math> , where <math>A\succ 0</math>. '''Lemma''': If <math>S</math> is an open subset of <math>\R^n</math> , <math>f: S\to \R</math> has continuous derivative, and given any line segment <math>[x, x + v]\subset S</math> , the function <math>h(t):= \langle\nabla f(x + tv), v\rangle</math> is linear in <math>t</math> , then <math>f</math> is a quadratic function. Proof idea: For any quadratic function <math>q: S\to \R</math> , we have <math>f-q</math> still has such derivative-linearity, so we will subtract away a few quadratic functions and show that <math>f</math> becomes zero. The proof idea can be illustrated fully for the case of <math>S=\R^2</math> , so we prove it in this case. By the derivative-linearity, <math>f</math> is a quadratic function on any line segment in <math>\R^2</math>. We subtract away four quadratic functions, such that <math>g:= f-q_0 - q_1 - q_2 - q_3</math> becomes identically zero on the x-axis, y-axis, and the <math>\{x=y\}</math> line. Let <math>q_0(x, y) = f(0, 0) + \nabla f(0, 0)\cdot (x, y), q_1(x, y) = A_1 x^2, q_2(x, y) = A_2 y^2, q_3(x, y) = A_3 xy</math>, for well-chosen <math>A_1, A_2, A_3</math>. Now use <math>q_0</math> to remove the linear term, and use <math>q_1, q_2, q_3</math> respectively to remove the quadratic terms along the three lines. <math>\forall (x, y)\in \R^2</math> not on the origin, there exists a line <math>l</math> across <math>(x, y)</math> that intersects the x-axis, y-axis, and the <math>\{x=y\}</math> line at three different points. Since <math>g</math> is quadratic on <math>l</math> , and is zero on three different points, <math>g</math> is identically zero on <math>l</math> , thus <math>g(x, y) = 0</math>. Thus <math>f = q_0 + q_1 + q_2 + q_3</math> is quadratic. }} The following two characterizations are for divergences on <math>\Gamma_n</math>, the set of all probability measures on <math>\{1, 2, ..., n\}</math>, with <math>n \geq 2</math>. Define a divergence on <math>\Gamma_n</math> as any function of type <math>D: \Gamma_n \times \Gamma_n \to [0, \infty]</math>, such that <math>D(x, x) = 0</math> for all <math>x\in\Gamma_n</math>, then: *The only divergence on <math>\Gamma_n</math> that is both a Bregman divergence and an [[f-divergence]] is the [[Kullback–Leibler divergence]].<ref name=":0">{{Cite journal |last1=Jiao |first1=Jiantao |last2=Courtade |first2=Thomas |last3=No |first3=Albert |last4=Venkat |first4=Kartik |last5=Weissman |first5=Tsachy |date=December 2014 |title=Information Measures: the Curious Case of the Binary Alphabet |journal=IEEE Transactions on Information Theory |volume=60 |issue=12 |pages=7616–7626 |doi=10.1109/TIT.2014.2360184 |issn=0018-9448|arxiv=1404.6810 |s2cid=13108908 }}</ref> *If <math>n \geq 3</math>, then any Bregman divergence on <math>\Gamma_n</math> that satisfies the [[data processing inequality]] must be the Kullback–Leibler divergence. (In fact, a weaker assumption of "sufficiency" is enough.) Counterexamples exist when <math>n = 2</math>.<ref name=":0" /> Given a Bregman divergence <math>D_F</math>, its "opposite", defined by <math>D_F^*(v, w) = D_F(w, v)</math>, is generally not a Bregman divergence. For example, the Kullback-Leiber divergence is both a Bregman divergence and an f-divergence. Its reverse is also an f-divergence, but by the above characterization, the reverse KL divergence cannot be a Bregman divergence. == Examples == * The squared [[Mahalanobis distance]] <math>D_F(x,y)=\tfrac{1}{2}(x-y)^T Q (x-y)</math> is generated by the convex [[quadratic form]] <math>F(x) = \tfrac{1}{2} x^T Q x</math>. * The canonical example of a Bregman distance is the squared Euclidean distance <math>D_F(x,y) = \|x - y\|^2</math>. It results as the special case of the above, when <math>Q</math> is the identity, i.e. for <math>F(x) = \|x\|^2</math>. As noted, affine differences, i.e. the lower orders added in <math>F</math>, are irrelevant to <math>D_F</math>. * The generalized Kullback–Leibler divergence ::<math>D_F(p, q) = \sum_i p(i) \log \frac{p(i)}{q(i)} - \sum p(i) + \sum q(i)</math> :is generated by the negative [[entropy (information theory)|entropy]] function ::<math>F(p) = \sum_i p(i)\log p(i)</math> :When restricted to the [[simplex]], the last two terms cancel, giving the usual [[Kullback–Leibler divergence]] for distributions. * The [[Itakura–Saito distance]], ::<math>D_F(p, q) = \sum_i \left(\frac {p(i)}{q(i)} - \log \frac{p(i)}{q(i)} - 1 \right)</math> :is generated by the convex function ::<math>F(p) = - \sum_i \log p(i)</math> == Generalizing projective duality == A key tool in [[computational geometry]] is the idea of [[projective duality]], which maps points to hyperplanes and vice versa, while preserving incidence and above-below relationships. There are numerous analytical forms of the projective dual: one common form maps the point <math>p = (p_1, \ldots p_d)</math> to the hyperplane <math>x_{d+1} = \sum_1^d 2p_i x_i</math>. This mapping can be interpreted (identifying the hyperplane with its normal) as the convex conjugate mapping that takes the point p to its dual point <math>p^* = \nabla F(p)</math>, where ''F'' defines the ''d''-dimensional paraboloid <math>x_{d+1} = \sum x_i^2</math>. If we now replace the paraboloid by an arbitrary convex function, we obtain a different dual mapping that retains the incidence and above-below properties of the standard projective dual. This implies that natural dual concepts in computational geometry like [[Voronoi diagram]]s and [[Delaunay triangulation]]s retain their meaning in distance spaces defined by an arbitrary Bregman divergence. Thus, algorithms from "normal" geometry extend directly to these spaces (Boissonnat, Nielsen and Nock, 2010) == Generalization of Bregman divergences == Bregman divergences can be interpreted as limit cases of skewed [[Jensen–Shannon divergence|Jensen divergence]]s (see Nielsen and Boltz, 2011). Jensen divergences can be generalized using comparative convexity, and limit cases of these skewed Jensen divergences generalizations yields generalized Bregman divergence (see Nielsen and Nock, 2017). The Bregman chord divergence<ref>{{cite book | arxiv= 1810.09113| last1= Nielsen| first1= Frank| title= Geometric Science of Information| last2= Nock| first2= Richard| chapter= The Bregman Chord Divergence| series= Lecture Notes in Computer Science| year= 2019| volume= 11712| pages= 299–308| doi= 10.1007/978-3-030-26980-7_31| isbn= 978-3-030-26979-1| s2cid= 53046425}}</ref> is obtained by taking a chord instead of a tangent line. == Bregman divergence on other objects == Bregman divergences can also be defined between matrices, between functions, and between measures (distributions). Bregman divergences between matrices include the [[Stein's loss]] and [[von Neumann entropy]]. Bregman divergences between functions include total squared error, relative entropy, and squared bias; see the references by Frigyik et al. below for definitions and properties. Similarly Bregman divergences have also been defined over sets, through a [[submodular set function]] which is known as the discrete analog of a [[convex function]]. The submodular Bregman divergences subsume a number of discrete distance measures, like the [[Hamming distance]], [[precision and recall]], [[mutual information]] and some other set based distance measures (see Iyer & Bilmes, 2012 for more details and properties of the submodular Bregman.) For a list of common matrix Bregman divergences, see Table 15.1 in.<ref>"Matrix Information Geometry", R. Nock, B. Magdalou, E. Briys and F. Nielsen, [http://www1.univ-ag.fr/~rnock/Articles/Drafts/book12-nmbn.pdf pdf], from this [https://doi.org/10.1007%2F978-3-642-30232-9 book]</ref> == Applications == In machine learning, Bregman divergences are used to calculate the bi-tempered logistic loss, performing better than the [[softmax function]] with noisy datasets.<ref>Ehsan Amid, Manfred K. Warmuth, Rohan Anil, Tomer Koren (2019). "Robust Bi-Tempered Logistic Loss Based on Bregman Divergences". Conference on Neural Information Processing Systems. pp. 14987-14996. [https://papers.nips.cc/paper/9638-robust-bi-tempered-logistic-loss-based-on-bregman-divergences.pdf pdf]</ref> Bregman divergence is used in the formulation of [[mirror descent]], which includes optimization algorithms used in machine learning such as [[gradient descent]] and the [[hedge algorithm]]. == References == {{reflist}} {{refbegin|2}} *{{cite journal | last1 = Banerjee | first1 = Arindam | last2 = Merugu | first2 = Srujana | last3 = Dhillon | first3 = Inderjit S. | last4 = Ghosh | first4 = Joydeep | journal = [[Journal of Machine Learning Research]] | pages = 1705–1749 | title = Clustering with Bregman divergences | url = http://jmlr.csail.mit.edu/papers/v6/banerjee05b.html | volume = 6 | year = 2005}} *{{cite journal | last = Bregman | first = L. M. | doi = 10.1016/0041-5553(67)90040-7 | journal = USSR Computational Mathematics and Mathematical Physics | pages = 200–217 | title = The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming | volume = 7 | year = 1967 | issue = 3}} *{{cite journal | last1 = Frigyik | first1 = Bela A. | last2 = Srivastava | first2 = Santosh | last3 = Gupta | first3 = Maya R. | doi = 10.1109/TIT.2008.929943 | journal = [[IEEE Transactions on Information Theory]] | pages = 5130–5139 | title = Functional Bregman Divergences and Bayesian Estimation of Distributions | url = http://www.ee.washington.edu/research/guptalab/publications/FrigyikSrivastavaGupta.pdf | volume = 54 | year = 2008 | issue = 11 | url-status = dead | archive-url = https://web.archive.org/web/20100812221422/http://www.ee.washington.edu/research/guptalab/publications/FrigyikSrivastavaGupta.pdf | archive-date = 2010-08-12 | arxiv = cs/0611123 | s2cid = 1254 }} *{{cite conference | last1 = Iyer | first1 = Rishabh | last2 = Bilmes | first2 = Jeff | book-title = [[Conference on Neural Information Processing Systems]] | title = Submodular-Bregman divergences and Lovász-Bregman divergences with Applications | year = 2012}} *{{cite book | last1 = Frigyik | first1 = Bela A. | last2 = Srivastava | first2 = Santosh | last3 = Gupta | first3 = Maya R. | publisher = University of Washington, Dept. of Electrical Engineering | series = UWEE Tech Report 2008-0001 | title = An Introduction to Functional Derivatives | url = https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2008-0001.pdf | year = 2008 | access-date = 2014-03-20 | archive-url = https://web.archive.org/web/20170217025324/https://www2.ee.washington.edu/techsite/papers/documents/UWEETR-2008-0001.pdf | archive-date = 2017-02-17 | url-status = dead }} *{{cite journal | last1 = Harremoës | first1 = Peter | title = Divergence and Sufficiency for Convex Optimization | journal = Entropy | doi = 10.3390/e19050206 | volume = 19 | issue = 5 | pages = 206 | year = 2017| arxiv = 1701.01010| bibcode = 2017Entrp..19..206H| doi-access = free }} *{{cite conference | last1 = Nielsen | first1 = Frank | last2 = Nock | first2 = Richard | title = The dual Voronoi diagrams with respect to representational Bregman divergences | doi = 10.1109/ISVD.2009.15 | publisher = IEEE | book-title = Proc. 6th International Symposium on Voronoi Diagrams | url = http://www.lix.polytechnique.fr/~nielsen/ISVD09-GenBregmanVD.pdf | year = 2009}} *{{cite arXiv | last1 = Nielsen | first1 = Frank | last2 = Nock | first2 = Richard | eprint = 0711.3242 | title = On the Centroids of Symmetrized Bregman Divergences | year = 2007 | class = cs.CG}} *{{cite conference | last1 = Nielsen | first1 = Frank | last2 = Boissonnat | first2 = Jean-Daniel | last3 = Nock | first3 = Richard | title = Visualizing Bregman Voronoi diagrams | book-title = Proc. 23rd ACM Symposium on Computational Geometry (video track) | doi = 10.1145/1247069.1247089 | url = https://franknielsen.github.io/SlidesVideo/BregmanVoronoiVideo-SoCG2007.pdf | year = 2007 }} *{{cite journal | last1 = Boissonnat | first1 = Jean-Daniel | authorlink1=Jean-Daniel Boissonnat | last2 = Nielsen | first2 = Frank | last3 = Nock | first3 = Richard | title = Bregman Voronoi Diagrams | journal = [[Discrete & Computational Geometry]] | volume = 44 | issue = 2 | url = http://hal.archives-ouvertes.fr/hal-00488441/en/ | doi = 10.1007/s00454-010-9256-1 | year = 2010 | pages=281–307| arxiv = 0709.2196| s2cid = 1327029 }} *{{cite conference | last1 = Nielsen | first1 = Frank | last2 = Nock | first2 = Richard | title = On approximating the smallest enclosing Bregman Balls | doi = 10.1145/1137856.1137931 | pages = 485–486 | book-title = Proc. 22nd ACM Symposium on Computational Geometry | year = 2006}} *{{cite journal | last1 = Nielsen | first1 = Frank | last2 = Boltz | first2 = Sylvain | title = The Burbea-Rao and Bhattacharyya centroids | journal = [[IEEE Transactions on Information Theory]] | volume = 57 | issue = 8 | doi = 10.1109/TIT.2011.2159046 | year = 2011 | pages=5455–5466| arxiv = 1004.5049| s2cid = 14238708 }} *{{cite journal | last1 = Nielsen | first1 = Frank | last2 = Nock | first2 = Richard | title = Generalizing Skew Jensen Divergences and Bregman Divergences With Comparative Convexity | journal = [[IEEE Signal Processing Letters]] | volume = 24 | issue = 8 | arxiv = 1702.04877 | doi = 10.1109/LSP.2017.2712195 | year = 2017 | pages=1123–1127| bibcode = 2017ISPL...24.1123N| s2cid = 31899023 }} {{refend}} [[Category:Geometric algorithms]] [[Category:Statistical distance]]'
New page wikitext, after the edit (new_wikitext)
'{{Short description|A measure of difference between two points}} {{Use dmy dates|date=August 2023}} In [[mathematics]], specifically [[statistics]] and [[information geometry]], a '''Bregman divergence''' or '''Bregman distance''' is a measure of difference between two points, defined in terms of a strictly [[convex function]]; they form an important class of [[Divergence (statistics)|divergence]]s. When the points are interpreted as [[probability distribution]]s – notably as either values of the parameter of a [[parametric model]] or as a data set of observed values – the resulting distance is a [[statistical distance]]. The most basic Bregman divergence is the [[squared Euclidean distance]]. Bregman divergences are similar to [[metric (mathematics)|metric]]s, but satisfy neither the [[triangle inequality]] (ever) nor symmetry (in general). However, they satisfy a generalization of the [[Pythagorean theorem]], and in information geometry the corresponding [[statistical manifold]] is interpreted as a (dually) [[flat manifold]]. This allows many techniques of [[optimization theory]] to be generalized to Bregman divergences, geometrically as generalizations of [[least squares]]. Bregman divergences are named after Russian mathematician [[Lev M. Bregman]], who introduced the concept in 1967. == Definition == Let <math>F\colon \Omega \to \mathbb{R} </math> be a continuously-differentiable, strictly [[convex function]] defined on a [[convex set]] <math>\Omega</math>. The Bregman distance associated with ''F'' for points <math>p, q \in \Omega</math> is the difference between the value of ''F'' at point ''p'' and the value of the first-order [[Taylor expansion]] of ''F'' around point ''q'' evaluated at point ''p'': :<math>D_F(p, q) = F(p)-F(q)-\langle \nabla F(q), p-q\rangle. </math> == Properties == * '''Non-negativity''': <math>D_F(p, q) \ge 0</math> for all <math>p</math>, <math>q</math>. This is a consequence of the convexity of <math>F</math>. * '''Positivity''': When <math>F</math> is strictly convex, <math>D_F(p, q) = 0</math> iff <math>p=q</math>. * '''Uniqueness up to affine difference''': <math>D_F = D_G</math> iff <math>F-G</math> is an affine function. * '''Convexity''': <math>D_F(p, q)</math> is convex in its first argument, but not necessarily in the second argument. If F is strictly convex, then <math>D_F(p, q)</math> is strictly convex in its first argument. ** For example, Take f(x) = |x|, smooth it at 0, then take <math>y = 1, x_1 = 0.1, x_2 = -0.9, x_3 = 0.9x_1 + 0.1x_2</math>, then <math>D_f(y, x_3) \approx 1 > 0.9 D_f(y, x_1) + 0.1 D_f(y, x_2) \approx 0.2</math>. * '''Linearity''': If we think of the Bregman distance as an operator on the function ''F'', then it is linear with respect to non-negative coefficients. In other words, for <math>F_1, F_2</math> strictly convex and differentiable, and <math>\lambda \ge 0</math>, ::<math>D_{F_1 + \lambda F_2}(p, q) = D_{F_1}(p, q) + \lambda D_{F_2}(p, q)</math> * '''Duality''': If F is strictly convex, then the function F has a [[convex conjugate]] <math>F^*</math> which is also strictly convex and continuously differentiable on some convex set <math>\Omega^*</math>. The Bregman distance defined with respect to <math>F^*</math> is dual to <math>D_F(p, q)</math> as ::<math>D_{F^*}(p^*, q^*) = D_F(q, p)</math> :Here, <math>p^* = \nabla F(p)</math> and <math>q^* = \nabla F(q)</math> are the dual points corresponding to p and q. :Moreover, using the same notations : ::<math>D_{F}(p, q) = F(p) + F^*(q^*) - \langle p, q^* \rangle </math> * '''Integral form:''' by the integral remainder form of Taylor's Theorem, a Bregman divergence can be written as the integral of the Hessian of $F$ along the line segment between the Bregman divergence's arguments. * '''Mean as minimizer''': A key result about Bregman divergences is that, given a random vector, the mean vector minimizes the expected Bregman divergence from the random vector. This result generalizes the textbook result that the mean of a set minimizes total squared error to elements in the set. This result was proved for the vector case by (Banerjee et al. 2005), and extended to the case of functions/distributions by (Frigyik et al. 2008). This result is important because it further justifies using a mean as a representative of a random set, particularly in Bayesian estimation. * '''Bregman balls are bounded, and compact if X is closed''': Define Bregman ball centered at x with radius r by <math>B_f(x, r):= \left\{y\in X: D_f(y, x)\leq r\right\}</math>. When <math>X\subset \R^n</math> is finite dimensional, <math>\forall x\in X</math>, if <math>x</math> is in the relative interior of <math>X</math>, or if <math>X</math> is locally closed at <math>x</math> (that is, there exists a closed ball <math>B(x, r)</math> centered at <math>x</math>, such that <math>B(x,r) \cap X</math> is closed), then <math>B_f(x, r)</math> is bounded for all <math>r</math> . If <math>X</math> is closed, then <math>B_f(x, r)</math> is compact for all <math>r</math>. * '''Law of cosines''':<ref name="cs.utexas.edu">{{cite web|url=https://www.cs.utexas.edu/users/inderjit/Talks/bregtut.pdf|title=Learning with Bregman Divergences|website=utexas.edu|access-date=19 August 2023}}</ref> For any <math>p,q,z</math> ::<math>D_F(p, q) = D_F(p, z) + D_F(z, q) - (p - z)^T(\nabla F(q) - \nabla F(z))</math> * [[Parallelogram law]]: for any <math>\theta, \theta_1, \theta_2</math>, <math>B_{F}\left(\theta_{1}: \theta\right)+B_{F}\left(\theta_{2}: \theta\right)=B_{F}\left(\theta_{1}: \frac{\theta_{1}+\theta_{2}}{2}\right)+B_{F}\left(\theta_{2}: \frac{\theta_{1}+\theta_{2}}{2}\right)+2 B_{F}\left(\frac{\theta_{1}+\theta_{2}}{2}: \theta\right)</math>[[File:Bregman_divergence_Pythagorean.png|right|thumb|300x300px|Generalized Pythagorean theorem for Bregman divergence .<ref name="Martin2014">{{cite journal |last1=Adamčík |first1=Martin |year=2014 |title=The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning |journal=Entropy |volume=16 |issue=12 |pages=6338–6381 |bibcode=2014Entrp..16.6338A |doi=10.3390/e16126338 |doi-access=free}}</ref>]] * '''Bregman projection''': For any <math>W\subset \Omega</math>, define the "Bregman projection" of <math>q</math> onto <math>W</math>: <math>P_W(q) = \text{argmin}_{\omega\in W} D_F(\omega, q)</math>. Then ** if <math>W</math> is convex, then the projection is unique if it exists; ** if <math>W</math> is nonempty, closed, and convex and <math>\Omega\subset \R^n</math> is finite dimensional, then the projection exists and is unique.<ref>{{cite journal|last1=Dhillon|first1=Inderjit|author-link=Inderjit Dhillon|last2=Tropp|first2=Joel|title=Matrix Nearness Problems with Bregman Divergence|journal=SIAM Journal on Matrix Analysis and Applications|volume=29|issue=4|year=2008|quote=Supposed D_\varphi is a Bregman divergence, supposed that {C_k} is a finite collection of closed, convex sets whose intersection is nonempty. Given an input matrix Y our goal is to produce a matrix \mathbf{X} in the intersection that diverges the least from \textbf{Y}, i.e. to solve \min_{\mathbf{X} } D_\varphi(\mathbf{X};\mathbf{Y}) subject to \mathbf{X} \in \big\cap_k C_k. Under mild conditions, the solution is unique and it has a variational characterization analogous with the characterization of an orthogonal projection onto a convex set" (see s2.4, page 1125 for more)|url=https://www.cs.utexas.edu/users/inderjit/public_papers/bregman-nearness.pdf}}</ref> * '''Generalized Pythagorean Theorem''':<ref name="cs.utexas.edu"/> For any <math>v\in \Omega, a\in W </math>, <math>D_F(a, v) \ge D_F(a, P_W(v)) + D_F(P_W(v), v).</math> This is an equality if <math>P_W(v)</math> is in the [[relative interior]] of <math>W</math>. In particular, this always happens when <math>W</math> is an affine set. * ''Lack'' of triangle inequality: Since the Bregman divergence is essentially a generalization of squared Euclidean distance, there is no triangle inequality. Indeed, <math>D_F(z, x) - D_F(z, y) - D_F(y, x) = \langle\nabla f(y) - \nabla f(x), z-y\rangle</math>, which may be positive or negative. === Proofs === * Non-negativity and positivity: use [[Jensen's inequality]]. * Uniqueness up to affine difference: Fix some <math>x\in \Omega</math>, then for any other <math>y\in \Omega</math>, we have by definition<math>F(y) - G(y) = F(x) - G(x) + \langle\nabla F(x) - \nabla G(x) , y-x \rangle </math>. * Convexity in the first argument: by definition, and use convexity of F. Same for strict convexity. * Linearity in F, law of cosines, parallelogram law: by definition. * Duality: See figure 1 of.<ref>{{Cite journal |last=Nielsen |first=Frank |date=2021-10-28 |title=Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions |journal=Entropy |volume=23 |issue=11 |pages=1417 |doi=10.3390/e23111417 |pmid=34828115 |pmc=8619509 |arxiv=2107.05901 |bibcode=2021Entrp..23.1417N |issn=1099-4300|doi-access=free }}</ref> * Bregman balls are bounded, and compact if X is closed: Fix <math>x\in X</math> . Take affine transform on <math>f</math> , so that <math>\nabla f(x) = 0</math>. Take some <math>\epsilon > 0</math>, such that <math>\partial B(x, \epsilon) \subset X</math>. Then consider the "radial-directional" derivative of <math>f</math> on the Euclidean sphere <math>\partial B(x, \epsilon)</math>. <math>\langle\nabla f(y), (y-x)\rangle</math> for all <math>y\in \partial B(x, \epsilon)</math>. Since <math>\partial B(x, \epsilon)\subset \R^n</math> is compact, it achieves minimal value <math>\delta</math> at some <math>y_0\in \partial B(x, \epsilon)</math>. Since <math>f</math> is strictly convex, <math>\delta > 0</math>. Then <math>B_f(x, r)\subset B(x, r/\delta)\cap X</math>. Since <math>D_f(y, x)</math> is <math>C^1</math> in <math>y</math>, <math>D_f</math> is continuous in <math>y</math>, thus <math>B_f(x, r)</math> is closed if <math>X</math> is. * Projection <math>P_W</math> is well-defined when <math>W</math> is closed and convex. Fix <math>v\in X</math>. Take some <math>w\in W</math> , then let <math>r := D_f(w, v)</math>. Then draw the Bregman ball <math>B_f(v, r)\cap W</math>. It is closed and bounded, thus compact. Since <math>D_f(\cdot, v)</math> is continuous and strictly convex on it, and bounded below by <math>0</math>, it achieves a unique minimum on it. * Pythagorean inequality. By cosine law, <math>D_f(w, v) - D_f(w, P_W(v)) - D_f(P_W(v), v) = \langle \nabla_y D_f(y, v)|_{y = P_W(v)} , w - P_W(v)\rangle</math>, which must be <math>\geq 0</math>, since <math>P_W(v)</math> minimizes <math>D_f(\cdot, v)</math> in <math>W</math>, and <math>W</math> is convex. * Pythagorean equality when <math>P_W(v)</math> is in the relative interior of <math>X</math>. If <math>\langle \nabla_y D_f(y, v)|_{y = P_W(v)}, w - P_W(v)\rangle > 0</math>, then since <math>w</math> is in the relative interior, we can move from <math>P_W(v)</math> in the direction opposite of <math>w</math>, to decrease <math>D_f(y, v)</math> , contradiction. Thus <math>\langle \nabla_y D_f(y, v)|_{y = P_W(v)}, w - P_W(v)\rangle = 0</math>. === Classification theorems === * The only symmetric Bregman divergences on <math>X\subset \R^n</math> are squared generalized Euclidean distances ([[Mahalanobis distance]]), that is, <math>D_f(y, x) = (y-x)^T A (y-x)</math> for some [[Positive definiteness|positive definite]] <math>A</math>.<ref>{{Cite journal | last1=Nielsen | first1=Frank | last2=Boissonnat | first2=Jean-Daniel | authorlink2=Jean-Daniel Boissonnat | last3=Nock | first3=Richard | date=September 2010 | title=Bregman Voronoi Diagrams: Properties, Algorithms and Applications | journal=[[Discrete & Computational Geometry]] | volume=44 | issue=2 | pages=281–307 | doi=10.1007/s00454-010-9256-1 | issn=0179-5376 | arxiv=0709.2196 | s2cid=1327029 }}</ref> {{Math proof|drop=hidden|proof= [[File:Bregman_divergence_diagram_used_in_proof_of_squared_generalized_Euclidean_distances.png|thumb|Bregman divergence interpreted as areas.]] For any <math>x \neq y\in X</math> , define <math>r = \|y-x\|, v = (y-x)/r, g(t)= f(x + t v)</math> for <math>t\in [0, r]</math> . Let <math>z(t) = x + tv</math>. Then <math>g'(t) = \langle \nabla f(z(t)), v \rangle</math> for <math>t\in(0, r)</math> , and since <math>\nabla f</math> is continuous, also for <math>t=0, r</math> . Then, from the diagram, we see that for <math>D_f(x; z(t)) = D_f(z(t); x)</math> for all <math>t\in [0, r]</math> , we must have <math>g'(t)</math> linear on <math>t\in [0, r]</math>. Thus we find that <math>\nabla f</math> varies linearly along any direction. By the next lemma, <math>f</math> is quadratic. Since <math>f</math> is also strictly convex, it is of form <math>f(x) + x^TAx + B^T x + C</math> , where <math>A\succ 0</math>. '''Lemma''': If <math>S</math> is an open subset of <math>\R^n</math> , <math>f: S\to \R</math> has continuous derivative, and given any line segment <math>[x, x + v]\subset S</math> , the function <math>h(t):= \langle\nabla f(x + tv), v\rangle</math> is linear in <math>t</math> , then <math>f</math> is a quadratic function. Proof idea: For any quadratic function <math>q: S\to \R</math> , we have <math>f-q</math> still has such derivative-linearity, so we will subtract away a few quadratic functions and show that <math>f</math> becomes zero. The proof idea can be illustrated fully for the case of <math>S=\R^2</math> , so we prove it in this case. By the derivative-linearity, <math>f</math> is a quadratic function on any line segment in <math>\R^2</math>. We subtract away four quadratic functions, such that <math>g:= f-q_0 - q_1 - q_2 - q_3</math> becomes identically zero on the x-axis, y-axis, and the <math>\{x=y\}</math> line. Let <math>q_0(x, y) = f(0, 0) + \nabla f(0, 0)\cdot (x, y), q_1(x, y) = A_1 x^2, q_2(x, y) = A_2 y^2, q_3(x, y) = A_3 xy</math>, for well-chosen <math>A_1, A_2, A_3</math>. Now use <math>q_0</math> to remove the linear term, and use <math>q_1, q_2, q_3</math> respectively to remove the quadratic terms along the three lines. <math>\forall (x, y)\in \R^2</math> not on the origin, there exists a line <math>l</math> across <math>(x, y)</math> that intersects the x-axis, y-axis, and the <math>\{x=y\}</math> line at three different points. Since <math>g</math> is quadratic on <math>l</math> , and is zero on three different points, <math>g</math> is identically zero on <math>l</math> , thus <math>g(x, y) = 0</math>. Thus <math>f = q_0 + q_1 + q_2 + q_3</math> is quadratic. }} The following two characterizations are for divergences on <math>\Gamma_n</math>, the set of all probability measures on <math>\{1, 2, ..., n\}</math>, with <math>n \geq 2</math>. Define a divergence on <math>\Gamma_n</math> as any function of type <math>D: \Gamma_n \times \Gamma_n \to [0, \infty]</math>, such that <math>D(x, x) = 0</math> for all <math>x\in\Gamma_n</math>, then: *The only divergence on <math>\Gamma_n</math> that is both a Bregman divergence and an [[f-divergence]] is the [[Kullback–Leibler divergence]].<ref name=":0">{{Cite journal |last1=Jiao |first1=Jiantao |last2=Courtade |first2=Thomas |last3=No |first3=Albert |last4=Venkat |first4=Kartik |last5=Weissman |first5=Tsachy |date=December 2014 |title=Information Measures: the Curious Case of the Binary Alphabet |journal=IEEE Transactions on Information Theory |volume=60 |issue=12 |pages=7616–7626 |doi=10.1109/TIT.2014.2360184 |issn=0018-9448|arxiv=1404.6810 |s2cid=13108908 }}</ref> *If <math>n \geq 3</math>, then any Bregman divergence on <math>\Gamma_n</math> that satisfies the [[data processing inequality]] must be the Kullback–Leibler divergence. (In fact, a weaker assumption of "sufficiency" is enough.) Counterexamples exist when <math>n = 2</math>.<ref name=":0" /> Given a Bregman divergence <math>D_F</math>, its "opposite", defined by <math>D_F^*(v, w) = D_F(w, v)</math>, is generally not a Bregman divergence. For example, the Kullback-Leiber divergence is both a Bregman divergence and an f-divergence. Its reverse is also an f-divergence, but by the above characterization, the reverse KL divergence cannot be a Bregman divergence. == Examples == * The squared [[Mahalanobis distance]] <math>D_F(x,y)=\tfrac{1}{2}(x-y)^T Q (x-y)</math> is generated by the convex [[quadratic form]] <math>F(x) = \tfrac{1}{2} x^T Q x</math>. * The canonical example of a Bregman distance is the squared Euclidean distance <math>D_F(x,y) = \|x - y\|^2</math>. It results as the special case of the above, when <math>Q</math> is the identity, i.e. for <math>F(x) = \|x\|^2</math>. As noted, affine differences, i.e. the lower orders added in <math>F</math>, are irrelevant to <math>D_F</math>. * The generalized Kullback–Leibler divergence ::<math>D_F(p, q) = \sum_i p(i) \log \frac{p(i)}{q(i)} - \sum p(i) + \sum q(i)</math> :is generated by the negative [[entropy (information theory)|entropy]] function ::<math>F(p) = \sum_i p(i)\log p(i)</math> :When restricted to the [[simplex]], the last two terms cancel, giving the usual [[Kullback–Leibler divergence]] for distributions. * The [[Itakura–Saito distance]], ::<math>D_F(p, q) = \sum_i \left(\frac {p(i)}{q(i)} - \log \frac{p(i)}{q(i)} - 1 \right)</math> :is generated by the convex function ::<math>F(p) = - \sum_i \log p(i)</math> == Generalizing projective duality == A key tool in [[computational geometry]] is the idea of [[projective duality]], which maps points to hyperplanes and vice versa, while preserving incidence and above-below relationships. There are numerous analytical forms of the projective dual: one common form maps the point <math>p = (p_1, \ldots p_d)</math> to the hyperplane <math>x_{d+1} = \sum_1^d 2p_i x_i</math>. This mapping can be interpreted (identifying the hyperplane with its normal) as the convex conjugate mapping that takes the point p to its dual point <math>p^* = \nabla F(p)</math>, where ''F'' defines the ''d''-dimensional paraboloid <math>x_{d+1} = \sum x_i^2</math>. If we now replace the paraboloid by an arbitrary convex function, we obtain a different dual mapping that retains the incidence and above-below properties of the standard projective dual. This implies that natural dual concepts in computational geometry like [[Voronoi diagram]]s and [[Delaunay triangulation]]s retain their meaning in distance spaces defined by an arbitrary Bregman divergence. Thus, algorithms from "normal" geometry extend directly to these spaces (Boissonnat, Nielsen and Nock, 2010) == Generalization of Bregman divergences == Bregman divergences can be interpreted as limit cases of skewed [[Jensen–Shannon divergence|Jensen divergence]]s (see Nielsen and Boltz, 2011). Jensen divergences can be generalized using comparative convexity, and limit cases of these skewed Jensen divergences generalizations yields generalized Bregman divergence (see Nielsen and Nock, 2017). The Bregman chord divergence<ref>{{cite book | arxiv= 1810.09113| last1= Nielsen| first1= Frank| title= Geometric Science of Information| last2= Nock| first2= Richard| chapter= The Bregman Chord Divergence| series= Lecture Notes in Computer Science| year= 2019| volume= 11712| pages= 299–308| doi= 10.1007/978-3-030-26980-7_31| isbn= 978-3-030-26979-1| s2cid= 53046425}}</ref> is obtained by taking a chord instead of a tangent line. == Bregman divergence on other objects == Bregman divergences can also be defined between matrices, between functions, and between measures (distributions). Bregman divergences between matrices include the [[Stein's loss]] and [[von Neumann entropy]]. Bregman divergences between functions include total squared error, relative entropy, and squared bias; see the references by Frigyik et al. below for definitions and properties. Similarly Bregman divergences have also been defined over sets, through a [[submodular set function]] which is known as the discrete analog of a [[convex function]]. The submodular Bregman divergences subsume a number of discrete distance measures, like the [[Hamming distance]], [[precision and recall]], [[mutual information]] and some other set based distance measures (see Iyer & Bilmes, 2012 for more details and properties of the submodular Bregman.) For a list of common matrix Bregman divergences, see Table 15.1 in.<ref>"Matrix Information Geometry", R. Nock, B. Magdalou, E. Briys and F. Nielsen, [http://www1.univ-ag.fr/~rnock/Articles/Drafts/book12-nmbn.pdf pdf], from this [https://doi.org/10.1007%2F978-3-642-30232-9 book]</ref> == Applications == In machine learning, Bregman divergences are used to calculate the bi-tempered logistic loss, performing better than the [[softmax function]] with noisy datasets.<ref>Ehsan Amid, Manfred K. Warmuth, Rohan Anil, Tomer Koren (2019). "Robust Bi-Tempered Logistic Loss Based on Bregman Divergences". Conference on Neural Information Processing Systems. pp. 14987-14996. [https://papers.nips.cc/paper/9638-robust-bi-tempered-logistic-loss-based-on-bregman-divergences.pdf pdf]</ref> Bregman divergence is used in the formulation of [[mirror descent]], which includes optimization algorithms used in machine learning such as [[gradient descent]] and the [[hedge algorithm]]. == References == {{reflist}} {{refbegin|2}} *{{cite journal | last1 = Banerjee | first1 = Arindam | last2 = Merugu | first2 = Srujana | last3 = Dhillon | first3 = Inderjit S. | last4 = Ghosh | first4 = Joydeep | journal = [[Journal of Machine Learning Research]] | pages = 1705–1749 | title = Clustering with Bregman divergences | url = http://jmlr.csail.mit.edu/papers/v6/banerjee05b.html | volume = 6 | year = 2005}} *{{cite journal | last = Bregman | first = L. M. | doi = 10.1016/0041-5553(67)90040-7 | journal = USSR Computational Mathematics and Mathematical Physics | pages = 200–217 | title = The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming | volume = 7 | year = 1967 | issue = 3}} *{{cite journal | last1 = Frigyik | first1 = Bela A. | last2 = Srivastava | first2 = Santosh | last3 = Gupta | first3 = Maya R. | doi = 10.1109/TIT.2008.929943 | journal = [[IEEE Transactions on Information Theory]] | pages = 5130–5139 | title = Functional Bregman Divergences and Bayesian Estimation of Distributions | url = http://www.ee.washington.edu/research/guptalab/publications/FrigyikSrivastavaGupta.pdf | volume = 54 | year = 2008 | issue = 11 | url-status = dead | archive-url = https://web.archive.org/web/20100812221422/http://www.ee.washington.edu/research/guptalab/publications/FrigyikSrivastavaGupta.pdf | archive-date = 2010-08-12 | arxiv = cs/0611123 | s2cid = 1254 }} *{{cite conference | last1 = Iyer | first1 = Rishabh | last2 = Bilmes | first2 = Jeff | book-title = [[Conference on Neural Information Processing Systems]] | title = Submodular-Bregman divergences and Lovász-Bregman divergences with Applications | year = 2012}} *{{cite book | last1 = Frigyik | first1 = Bela A. | last2 = Srivastava | first2 = Santosh | last3 = Gupta | first3 = Maya R. | publisher = University of Washington, Dept. of Electrical Engineering | series = UWEE Tech Report 2008-0001 | title = An Introduction to Functional Derivatives | url = https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2008-0001.pdf | year = 2008 | access-date = 2014-03-20 | archive-url = https://web.archive.org/web/20170217025324/https://www2.ee.washington.edu/techsite/papers/documents/UWEETR-2008-0001.pdf | archive-date = 2017-02-17 | url-status = dead }} *{{cite journal | last1 = Harremoës | first1 = Peter | title = Divergence and Sufficiency for Convex Optimization | journal = Entropy | doi = 10.3390/e19050206 | volume = 19 | issue = 5 | pages = 206 | year = 2017| arxiv = 1701.01010| bibcode = 2017Entrp..19..206H| doi-access = free }} *{{cite conference | last1 = Nielsen | first1 = Frank | last2 = Nock | first2 = Richard | title = The dual Voronoi diagrams with respect to representational Bregman divergences | doi = 10.1109/ISVD.2009.15 | publisher = IEEE | book-title = Proc. 6th International Symposium on Voronoi Diagrams | url = http://www.lix.polytechnique.fr/~nielsen/ISVD09-GenBregmanVD.pdf | year = 2009}} *{{cite arXiv | last1 = Nielsen | first1 = Frank | last2 = Nock | first2 = Richard | eprint = 0711.3242 | title = On the Centroids of Symmetrized Bregman Divergences | year = 2007 | class = cs.CG}} *{{cite conference | last1 = Nielsen | first1 = Frank | last2 = Boissonnat | first2 = Jean-Daniel | last3 = Nock | first3 = Richard | title = Visualizing Bregman Voronoi diagrams | book-title = Proc. 23rd ACM Symposium on Computational Geometry (video track) | doi = 10.1145/1247069.1247089 | url = https://franknielsen.github.io/SlidesVideo/BregmanVoronoiVideo-SoCG2007.pdf | year = 2007 }} *{{cite journal | last1 = Boissonnat | first1 = Jean-Daniel | authorlink1=Jean-Daniel Boissonnat | last2 = Nielsen | first2 = Frank | last3 = Nock | first3 = Richard | title = Bregman Voronoi Diagrams | journal = [[Discrete & Computational Geometry]] | volume = 44 | issue = 2 | url = http://hal.archives-ouvertes.fr/hal-00488441/en/ | doi = 10.1007/s00454-010-9256-1 | year = 2010 | pages=281–307| arxiv = 0709.2196| s2cid = 1327029 }} *{{cite conference | last1 = Nielsen | first1 = Frank | last2 = Nock | first2 = Richard | title = On approximating the smallest enclosing Bregman Balls | doi = 10.1145/1137856.1137931 | pages = 485–486 | book-title = Proc. 22nd ACM Symposium on Computational Geometry | year = 2006}} *{{cite journal | last1 = Nielsen | first1 = Frank | last2 = Boltz | first2 = Sylvain | title = The Burbea-Rao and Bhattacharyya centroids | journal = [[IEEE Transactions on Information Theory]] | volume = 57 | issue = 8 | doi = 10.1109/TIT.2011.2159046 | year = 2011 | pages=5455–5466| arxiv = 1004.5049| s2cid = 14238708 }} *{{cite journal | last1 = Nielsen | first1 = Frank | last2 = Nock | first2 = Richard | title = Generalizing Skew Jensen Divergences and Bregman Divergences With Comparative Convexity | journal = [[IEEE Signal Processing Letters]] | volume = 24 | issue = 8 | arxiv = 1702.04877 | doi = 10.1109/LSP.2017.2712195 | year = 2017 | pages=1123–1127| bibcode = 2017ISPL...24.1123N| s2cid = 31899023 }} {{refend}} [[Category:Geometric algorithms]] [[Category:Statistical distance]]'
Unified diff of changes made by edit (edit_diff)
'@@ -30,8 +30,10 @@ ::<math>D_{F}(p, q) = F(p) + F^*(q^*) - \langle p, q^* \rangle </math> + +* '''Integral form:''' by the integral remainder form of Taylor's Theorem, a Bregman divergence can be written as the integral of the Hessian of $F$ along the line segment between the Bregman divergence's arguments. * '''Mean as minimizer''': A key result about Bregman divergences is that, given a random vector, the mean vector minimizes the expected Bregman divergence from the random vector. This result generalizes the textbook result that the mean of a set minimizes total squared error to elements in the set. This result was proved for the vector case by (Banerjee et al. 2005), and extended to the case of functions/distributions by (Frigyik et al. 2008). This result is important because it further justifies using a mean as a representative of a random set, particularly in Bayesian estimation. * '''Bregman balls are bounded, and compact if X is closed''': Define Bregman ball centered at x with radius r by <math>B_f(x, r):= \left\{y\in X: D_f(y, x)\leq r\right\}</math>. When <math>X\subset \R^n</math> is finite dimensional, <math>\forall x\in X</math>, if <math>x</math> is in the relative interior of <math>X</math>, or if <math>X</math> is locally closed at <math>x</math> (that is, there exists a closed ball <math>B(x, r)</math> centered at <math>x</math>, such that <math>B(x,r) \cap X</math> is closed), then <math>B_f(x, r)</math> is bounded for all <math>r</math> . If <math>X</math> is closed, then <math>B_f(x, r)</math> is compact for all <math>r</math>. -* '''Law of cosines''':<ref name=cs.utexas.edu>{{cite web|url=https://www.cs.utexas.edu/users/inderjit/Talks/bregtut.pdf|title=Learning with Bregman Divergences|website=utexas.edu|access-date=19 August 2023}}</ref> +* '''Law of cosines''':<ref name="cs.utexas.edu">{{cite web|url=https://www.cs.utexas.edu/users/inderjit/Talks/bregtut.pdf|title=Learning with Bregman Divergences|website=utexas.edu|access-date=19 August 2023}}</ref> For any <math>p,q,z</math> ::<math>D_F(p, q) = D_F(p, z) + D_F(z, q) - (p - z)^T(\nabla F(q) - \nabla F(z))</math> '
New page size (new_size)
26581
Old page size (old_size)
26362
Size change in edit (edit_delta)
219
Lines added in edit (added_lines)
[ 0 => '', 1 => '* '''Integral form:''' by the integral remainder form of Taylor's Theorem, a Bregman divergence can be written as the integral of the Hessian of $F$ along the line segment between the Bregman divergence's arguments.', 2 => '* '''Law of cosines''':<ref name="cs.utexas.edu">{{cite web|url=https://www.cs.utexas.edu/users/inderjit/Talks/bregtut.pdf|title=Learning with Bregman Divergences|website=utexas.edu|access-date=19 August 2023}}</ref>' ]
Lines removed in edit (removed_lines)
[ 0 => '* '''Law of cosines''':<ref name=cs.utexas.edu>{{cite web|url=https://www.cs.utexas.edu/users/inderjit/Talks/bregtut.pdf|title=Learning with Bregman Divergences|website=utexas.edu|access-date=19 August 2023}}</ref>' ]
Whether or not the change was made through a Tor exit node (tor_exit_node)
false
Unix timestamp of change (timestamp)
'1727655508'