index.xml

<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Algebra &amp; Fire</title>
    <link>/</link>
      <atom:link href="/index.xml" rel="self" type="application/rss+xml" />
    <description>Algebra &amp; Fire</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><lastBuildDate>Mon, 11 May 2020 15:29:33 +0930</lastBuildDate>
    <image>
      <url>/images/icon_huff46a34809cf36f98ff565ff8b0a5f91_4201_512x512_fill_lanczos_center_2.png</url>
      <title>Algebra &amp; Fire</title>
      <link>/</link>
    </image>
    
    <item>
      <title>Optimization Landscape Symmetry, Saddle Points and Beyond</title>
      <link>/talk/2020-05-optimization/</link>
      <pubDate>Mon, 11 May 2020 15:29:33 +0930</pubDate>
      <guid>/talk/2020-05-optimization/</guid>
      <description>&lt;p&gt;
&lt;a href=&#34;../../slides/stochastic-optimization-techniques&#34;&gt;Slides&lt;/a&gt;&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Optimization Landscape Symmetry, Saddle Points and Beyond</title>
      <link>/slides/stochastic-optimization-techniques/</link>
      <pubDate>Wed, 06 May 2020 00:00:00 +0000</pubDate>
      <guid>/slides/stochastic-optimization-techniques/</guid>
      <description>&lt;h1 id=&#34;optimization&#34;&gt;Optimization&lt;/h1&gt;
&lt;p&gt;Landscape Symmetry, Saddle Points and Beyond&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id=&#34;non-convex-optimization&#34;&gt;Non-convex Optimization&lt;/h2&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
Proabilistic Models,
&lt;/span&gt;&lt;span class=&#34;fragment &#34; &gt;
Deep Neural Nets
&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
&lt;strong&gt;Theory&lt;/strong&gt;: 
&lt;a href=&#34;https://en.wikipedia.org/wiki/NP-hardness&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;NP-hard&lt;/a&gt;. Better avoid or use convex relaxation.
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
&lt;strong&gt;Practice&lt;/strong&gt;: Easy! just run SGD.
&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;In practice, these algorithms converge to good solutions.&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h2 id=&#34;gradient-descent&#34;&gt;Gradient Descent&lt;/h2&gt;
&lt;p&gt;$$
x_{t+1} = x_t - \eta\nabla f(x_t)
$$&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
Converges to stationary point ($\nabla f(x_t)=0$) 
&lt;a href=&#34;http://rd.springer.com/book/10.1007%2F978-1-4419-8853-9&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;[Nesterov &amp;lsquo;98]&lt;/a&gt;
&lt;/span&gt;
&lt;span class=&#34;fragment &#34; &gt;
or &amp;ldquo;local minimum&amp;rdquo; [[Ge etal. &amp;lsquo;15]][GHJY15]
&lt;/span&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;gd-cannot-escape-from-a-local-optimal-solution&#34;&gt;GD cannot escape from a local optimal solution&lt;/h3&gt;
&lt;hr&gt;
&lt;h2 id=&#34;landscape&#34;&gt;Landscape&lt;/h2&gt;
&lt;style&gt;
.container{
    display: flex;
}
.col{
    flex: 1;
}
&lt;/style&gt;
&lt;div class=&#34;container&#34;&gt;
&lt;div class=&#34;col&#34;&gt;
&lt;span class=&#34;fragment &#34; &gt;
&lt;p&gt;Convex Functions&lt;/p&gt;
&lt;img height=&#34;200&#34; src=&#34;img/convex.png&#34;&gt;
&lt;ul&gt;
&lt;li&gt;Simple: 0 gradient → Global Minimum&lt;/li&gt;
&lt;li&gt;Can be optimized efficiently&lt;/li&gt;
&lt;/ul&gt;
&lt;/span&gt;
&lt;/div&gt;
&lt;div class=&#34;col&#34;&gt;
&lt;span class=&#34;fragment &#34; &gt;
&lt;p&gt;Non-Convex Functions&lt;/p&gt;
&lt;img height=&#34;200&#34; src=&#34;img/non-convex.png&#34;&gt;
&lt;ul&gt;
&lt;li&gt;Complicated: local minima, saddle points&lt;/li&gt;
&lt;li&gt;GD can only find a local minimum&lt;/li&gt;
&lt;/ul&gt;
&lt;/span&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;hr&gt;
&lt;h2 id=&#34;what-special-properties-make-a-non-convex-function-easy&#34;&gt;What special properties make a non-convex function easy?&lt;/h2&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
Why are the objectives always non-convex?
&lt;/span&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;symmetry--non-convexity&#34;&gt;Symmetry → Non-Convexity&lt;/h3&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
Problem asks for multiple components, but the components have &lt;span style=&#34;color:green;&#34;&gt;no ordering&lt;/span&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;img height=&#34;600&#34; src=&#34;img/clustering.png&#34;&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;The neurons in a layer of neural network can be permuted and still compute the same function.&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;div class=&#34;container&#34;&gt;
&lt;div class=&#34;col&#34;&gt;
&lt;span class=&#34;fragment &#34; &gt;
  &lt;figure&gt;
  &lt;img src=&#34;img/solution-1.png&#34; alt=&#34;Solution 1&#34; height=&#34;150&#34;&gt;
  &lt;figcaption&gt;Solution (a)&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/span&gt;
&lt;/div&gt;
&lt;div class=&#34;col&#34;&gt;
&lt;span class=&#34;fragment &#34; &gt;
  &lt;figure&gt;
  &lt;img src=&#34;img/solution-2.png&#34; alt=&#34;Solution 2&#34; height=&#34;150&#34;&gt;
  &lt;figcaption&gt;Solution (b)&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/span&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;span class=&#34;fragment &#34; &gt;
  &lt;figure&gt;
  &lt;img src=&#34;img/solution-3.png&#34; alt=&#34;Solution 3&#34; height=&#34;200&#34;&gt;
  &lt;figcaption&gt;Convex Combination (a+b)/2&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/span&gt;
&lt;/div&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;if the objective is convex, then the third solution is also convex&lt;/li&gt;
&lt;li&gt;10 min mark&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;optimization-algorithms-need-to-span-stylecolorredbreak-the-symmetryspan-and-converge-to-one-of-the-equivalent-local-minima&#34;&gt;Optimization algorithms need to &lt;span style=&#34;color:red&#34;&gt;break the symmetry&lt;/span&gt; and converge to one of the (equivalent) local minima.&lt;/h3&gt;
&lt;hr&gt;
&lt;h2 id=&#34;saddle-points&#34;&gt;Saddle Points&lt;/h2&gt;
&lt;p&gt;
&lt;a href=&#34;https://academo.org/demos/3d-surface-plotter/?expression=x*x-y*y%2By%5E4%2B0.1*y&amp;amp;xRange=-1%2C%2B1&amp;amp;yRange=-1%2C%2B1&amp;amp;resolution=50&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;$$z = x^2 - y^2 + y^4 + 0.1\cdot y$$&lt;/a&gt;&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;Global minimum&lt;/li&gt;
&lt;li&gt;Local minimum&lt;/li&gt;
&lt;li&gt;Saddle points&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h2 id=&#34;symmetry--non-convexity-1&#34;&gt;Symmetry → Non-Convexity&lt;/h2&gt;
&lt;p&gt;
&lt;a href=&#34;https://academo.org/demos/3d-surface-plotter/?expression=x%5E4%2By%5E4-x%5E2-y%5E2&amp;amp;xRange=-1.5%2C%2B1.5&amp;amp;yRange=-1.5%2C%2B1.5&amp;amp;resolution=50&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;$$f(x) = -\|x\|^2 + \|x\|_4^4$$&lt;/a&gt;&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;This construction is used in independent component analysis&lt;/li&gt;
&lt;li&gt;Four local/global minima (symmetric)&lt;/li&gt;
&lt;li&gt;Connection two adjacent local minima&lt;/li&gt;
&lt;li&gt;Can we add constraints to break the symmetry?&lt;/li&gt;
&lt;li&gt;Rotate and restrict in the first quadrant.&lt;/li&gt;
&lt;li&gt;This will add new local minima.&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h2 id=&#34;locally-optimizable-functions&#34;&gt;Locally Optimizable Functions&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Local min are symmetric versions of global min
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
No high order saddle points
&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;high order saddle points have zero grads and p.s.d. hessian&lt;/li&gt;
&lt;li&gt;SGD garanteed to find the global minima&lt;/li&gt;
&lt;li&gt;first condition seems to be very strong&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;ul&gt;
&lt;li&gt;SVD/PCA&lt;/li&gt;
&lt;li&gt;Generalized Linear Model [KKKS&amp;rsquo;11] [HLS&amp;rsquo;14]&lt;/li&gt;
&lt;li&gt;Synchronization [BVS&amp;rsquo;16]&lt;/li&gt;
&lt;li&gt;Dictionary Learning [SQW&amp;rsquo;17]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;ul&gt;
&lt;li&gt;Matrix Completion [[GLM16]] [[GJZ17]]&lt;/li&gt;
&lt;li&gt;Matrix Sensing [BNS&amp;rsquo;16] [PKCS&amp;rsquo;16]&lt;/li&gt;
&lt;li&gt;MAX-CUT [MMMO&amp;rsquo;17]&lt;/li&gt;
&lt;li&gt;Tensor Decomposition [GHJY&amp;rsquo;15] [GM&amp;rsquo;16]&lt;/li&gt;
&lt;li&gt;2-Layer Neural Net [[GLM17]]&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;Matrix completion as a simple example&lt;/li&gt;
&lt;li&gt;Landscape of neural networks&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h2 id=&#34;matrix-completion&#34;&gt;Matrix Completion&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Low rank matrix $M$
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Observations: entries of $M$
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Goal: recover remaining entries
&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;img src=&#34;img/matrix-completion.png&#34; alt=&#34;Incomplete matrix&#34; height=&#34;300&#34;&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;typical application: recommendation system [RennieSrebro05]&lt;/li&gt;
&lt;li&gt;assume matrix is a product of two $n\times r$ matrix&lt;/li&gt;
&lt;li&gt;Hope to recover using $\tilde O(nr)$ observations.&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;non-convex-objective&#34;&gt;Non-Convex Objective&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Idea: Try to find the low rank factors directly $$M=U\cdot V^T$$
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Variables $X$, $Y$. Hope $X = U, Y = V, M = XY^T$
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Uniform observations $(i, j)\in\Omega$
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Minimize “loss” on observed entries $$\min f(X, Y) = \sum_{(i,j)\in\Omega}(M_{i, j} - (XY^T)_{i,j})^2$$
&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h3 id=&#34;symmetry-and-solutions&#34;&gt;Symmetry and Solutions&lt;/h3&gt;
&lt;p&gt;$$M=UV^T\qquad \min(f(X, Y):=\|M-XY^T\|^2_\Omega)$$&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Hope: $X=U, Y=V$
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Not true: many equivalent solutions: $$UV^T=URR^TV^T$$
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Saddle points: e.g. $X=Y=0$
&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;The objective behaves similar to a norm&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;span style=&#34;color:blue&#34;&gt;Theorem&lt;/span&gt;: when the number of observations is at least $\tilde{\Omega}(nr^6)$, all local minima of $f(X, Y)^*$ are global minima: satisfy $XY^T=M$.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;[&lt;a href=&#34;https://arxiv.org/abs/1605.07272&#34; title=&#34;Matrix Completion has No Spurious Local Minimum&#34;&gt;GLM16&lt;/a&gt;]: symmetric case; [&lt;a href=&#34;https://arxiv.org/abs/1704.00708&#34; title=&#34;No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis&#34;&gt;GJZ17&lt;/a&gt;]: asymmetric case&lt;/p&gt;
&lt;hr&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;span style=&#34;color:blue&#34;&gt;Corollary&lt;/span&gt;: Simple SGD can solve matrix completion from an &lt;span style=&#34;color:blue&#34;&gt;arbitrary&lt;/span&gt; starting point.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Prior work:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;convex relaxation&lt;/li&gt;
&lt;li&gt;non-convex optimization with carefully chosen starting point&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;convex relaxation: has tight $r$ $\Omega(nr poly(\log(d))$ d is #observations&lt;/li&gt;
&lt;li&gt;non-convex: $nr^2$&lt;/li&gt;
&lt;li&gt;27 min mark&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;p&gt;$X$ is a local minimum of $f(X)$ → $\nabla f(X)=0, \nabla^2f(X)\succcurlyeq 0$&lt;/p&gt;
&lt;div class=&#34;container&#34;&gt;
&lt;div class=&#34;col&#34;&gt;
&lt;span class=&#34;fragment &#34; &gt;
  &lt;p&gt;$$\nabla f(X)\ne 0$$&lt;/p&gt;
&lt;p&gt;Follow gradient reduces $f(X)$.&lt;/p&gt;
&lt;/span&gt;
&lt;/div&gt;
&lt;div class=&#34;col&#34;&gt;
&lt;span class=&#34;fragment &#34; &gt;
  &lt;p&gt;$$\lambda(\nabla^2f(X))&amp;lt;0$$&lt;/p&gt;
&lt;p&gt;Min eigendirection of Hessian reduces $f(X)$.&lt;/p&gt;
&lt;/span&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;hr&gt;
&lt;h3 id=&#34;direction-of-improvment-exists&#34;&gt;Direction of Improvment Exists&lt;/h3&gt;
&lt;p&gt;If $X$ is not global minimum&lt;/p&gt;
&lt;p&gt;Exists $\Delta$, $\langle\nabla f(X), \Delta\rangle\ne 0$ or $\nabla^2f(X)[\Delta]&amp;lt;0$&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;matrix-factorization&#34;&gt;Matrix Factorization&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Every entry is observed, want to write $M=UV^T$&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Consider symmetric case: $M=UU^T$ $$g(X):=|M-XX^T|_F^2$$
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Goal: prove local minima satisfy $XX^T=M$.
&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;$\nabla g(X)=0$&lt;span class=&#34;fragment &#34; &gt;
→ $MX=XX^TX$
&lt;/span&gt; &lt;span class=&#34;fragment &#34; &gt;
→ If $\text{span}(X)=\text{span}(M)$, $M=XX^T$
&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
$\nabla^2 g(X)\succcurlyeq0$
&lt;/span&gt;&lt;span class=&#34;fragment &#34; &gt;
→ $\text{span}(X)=\text{span}(M)$
&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
Approach in [&lt;a href=&#34;https://arxiv.org/abs/1605.07272&#34; title=&#34;Matrix Completion has No Spurious Local Minimum&#34;&gt;GLM16&lt;/a&gt;]. Need more cases to work for Matrix Completion.
&lt;/span&gt;&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;$X$ is a full-rank matrix == span(X) = span(M) (span of column vectors)&lt;/li&gt;
&lt;li&gt;for the case X is not full rank, for example X=0, gradient = 0&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Intuitively, want $X$ to go to the optimal solution $U$.
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Recall: many equivalent optimal solutions!
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Idea: find the “closest” among all optimal solutions $$\Delta = X-UR, \qquad R=\arg\min||X-UR||_F$$
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Nice property: $||\Delta\Delta^T||_F^2\le2||M-XX^T||_F^2$
&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;Alternative approach finding direction of improvement&lt;/li&gt;
&lt;li&gt;If $\Delta$ is not zero, $X$ is not global optimal&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;main-lemma-gjz17&#34;&gt;Main Lemma [&lt;a href=&#34;https://arxiv.org/abs/1704.00708&#34; title=&#34;No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis&#34;&gt;GJZ17&lt;/a&gt;]&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;span style=&#34;color:blue&#34;&gt;Lemma&lt;/span&gt;: If $\|\Delta\Delta^T\|_\Omega^2&amp;lt;3\|M-XX^T\|_\Omega^2$, then either $\langle\nabla f(X), \Delta\rangle\ne 0$ or $\nabla^2 f(X)[\Delta]&amp;lt;0$. ($\Delta$ is a direction of improvement.)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
$||\Delta\Delta^T||_F^2\le2||M-XX^T||_F^2$
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Immediate proof for matrix factorization
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
For completion, proof works as long as $||A||_\Omega\approx ||A||_F$ for $\Delta\Delta^T$ and $M-XX^T$
&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;$\Delta\Delta^T$ and $M-XX^T$ are both low rank.&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;restricted-isometry-propertyhttpsenwikipediaorgwikirestricted_isometry_property&#34;&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/Restricted_isometry_property&#34;&gt;Restricted isometry property&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;characterizes matrices which are nearly orthonormal, at least when operating on sparse vectors.&lt;/p&gt;
&lt;p&gt;Applies to asymmetric cases, matrix sensing and robust PCA.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;locally-optimizable-problems&#34;&gt;Locally optimizable problems&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Low rank matrix
&lt;ul&gt;
&lt;li&gt;SVD/PCA, Matrix Completion, Synchronization, Matrix Sensing, GLM, MAX-CUT&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Low rank tensor
&lt;ul&gt;
&lt;li&gt;Dictionary Learning, Tensor Decomposition, 2-Layer Neural Net&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;optimization-landscape-for-neural-network&#34;&gt;Optimization landscape for neural network&lt;/h2&gt;
&lt;p&gt;$$
g_W(x) = \sigma(W_p\sigma(W_{p-1}\sigma(\cdots \sigma(W_1x)\cdots)))
$$&lt;/p&gt;
&lt;p&gt;$$
f(W) = \mathbb E[\|y - g_W(x)\|^2]
$$&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;Fully connected with ReLU&lt;/li&gt;
&lt;li&gt;40 min mark&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;teacherstudent-setting&#34;&gt;Teacher/Student Setting&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Goal: Prove sth about optimization&lt;/li&gt;
&lt;li&gt;Assume there is already a good network ($W^*$) and enough samples&lt;/li&gt;
&lt;li&gt;Good solution: student mimic teacher&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;Make sure that the hypothesis class can recover the solution&lt;/li&gt;
&lt;li&gt;Focus on optimization&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;linear-networks&#34;&gt;Linear Networks&lt;/h3&gt;
&lt;p&gt;$$
g_W(x) = W_pW_{p-1}\cdots W_1x
$$&lt;/p&gt;
&lt;p&gt;[&lt;a href=&#34;https://arxiv.org/abs/1605.07110&#34;&gt;Kawaguchi 16&lt;/a&gt;], [&lt;a href=&#34;https://arxiv.org/abs/1707.02444&#34;&gt;YSJ18&lt;/a&gt;]&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
All local minima of linear neural network (with squared loss) are global*
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
With 2 layers, no higher order saddle points.
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
With 3 or more layers, has higher order saddles.
&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;All methods rely on different assumptions. Not comparable and not clear whether the assumptions are necessary&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;ul&gt;
&lt;li&gt;For a critical point, if product of all layers has rank $r$, then it is a local and global minima (if r = min(m, n)) it can also be a normal saddle point (if r &amp;lt; min(m, n)).&lt;/li&gt;
&lt;li&gt;Open problem: does local search actually find a global minimum?&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;min(m, n) the maximum rank we can get&lt;/li&gt;
&lt;li&gt;Algorithm can be trapped in higher order saddles&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;two-layer-neural-network&#34;&gt;Two-Layer Neural Network&lt;/h3&gt;
&lt;p&gt;$$
g(x) = a^T\sigma(Bx)
$$&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Wlog: rows of $B$ ($b_i$) are unit norm.
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Data: $x\sim N(0, I)$, $y$ from a teacher.
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Some more technical assumptions on a*, B*
&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;most importantly, $B^*$ is full-rank&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;bad-local-minima&#34;&gt;Bad local minima&lt;/h3&gt;
&lt;p&gt;&lt;span style=&#34;color:blue&#34;&gt;Claim: &lt;/span&gt;The objective function $f(a, B)$ has local minima that are not equivalent to the ground truth.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Observed in [&lt;a href=&#34;https://arxiv.org/abs/1711.00501&#34; title=&#34;Learning One-hidden-layer Neural Networks with Landscape Design&#34;&gt;GLM17&lt;/a&gt;]&lt;/li&gt;
&lt;li&gt;formmaly verified in [&lt;a href=&#34;https://arxiv.org/abs/1712.08968&#34; title=&#34;Spurious Local Minima are Common in Two-Layer ReLU Neural Networks&#34;&gt;Safran&amp;amp;Shamir17&lt;/a&gt;]&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;Over-parametrization appears to drastically reduce such local minima&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;landscape-design&#34;&gt;Landscape Design&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Idea: Design a new objective with no bad local min.&lt;/li&gt;
&lt;li&gt;Implicit in many previous techniques
&lt;ul&gt;
&lt;li&gt;Regularization&lt;/li&gt;
&lt;li&gt;Methods-of-moments instead of MLE&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h3 id=&#34;provable-new-objective&#34;&gt;Provable New Objective&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;span style=&#34;color:blue&#34;&gt;Theorem&lt;/span&gt;[&lt;a href=&#34;https://arxiv.org/abs/1711.00501&#34; title=&#34;Learning One-hidden-layer Neural Networks with Landscape Design&#34;&gt;GLM17&lt;/a&gt;]: Can construct an objective for two-layer neural network such that all local minima are global*&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Objective inspired by tensor decomposition
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Relies on Gaussian distribution
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Extended to symmetric input distribution* by [&lt;a href=&#34;https://arxiv.org/abs/1810.06793&#34; title=&#34;Learning Two-layer Neural Networks with Symmetric Inputs&#34;&gt;GKLW18&lt;/a&gt;]
&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;There are some assumptions for global* to be hold&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;p&gt;&lt;span style=&#34;color:blue&#34;&gt;Theorem&lt;/span&gt;[&lt;a href=&#34;https://arxiv.org/abs/1810.06793&#34; title=&#34;Learning Two-layer Neural Networks with Symmetric Inputs&#34;&gt;GKLW18&lt;/a&gt;]: For a two-layer neural network with more outputs than hidden units, if the input distribution is symmetric, there is a polynormial time algorithm that learns the neural network&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;span style=&#34;color:blue&#34;&gt;Theorem&lt;/span&gt;[&lt;a href=&#34;https://arxiv.org/abs/1909.11837&#34; title=&#34;Mildly Overparametrized Neural Nets can Memorize Training Data Efficiently&#34;&gt;GWZ19&lt;/a&gt;]: For a 2/3-layer neural network with quadratic/polynormal activations, if the inputs are in general position, and&lt;/p&gt;
&lt;p&gt;#parameters = O(1) #training samples&lt;/p&gt;
&lt;p&gt;GD can memorize training data.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;spin-glass-model&#34;&gt;Spin Glass Model&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;span style=&#34;color:blue&#34;&gt;Claim&lt;/span&gt;[&lt;a href=&#34;https://arxiv.org/abs/1412.0233&#34; title=&#34;The Loss Surfaces of Multilayer Networks&#34;&gt;CHMBL15&lt;/a&gt;]: all the local minima of a neural network have approximate equal function value&lt;/p&gt;
&lt;/blockquote&gt;
&lt;img src=&#34;img/spin-glass.png&#34; alt=&#34;Solution 3&#34; height=&#34;200&#34;&gt;
&lt;hr&gt;
&lt;h3 id=&#34;kac-rice-formula&#34;&gt;Kac-Rice Formula&lt;/h3&gt;
&lt;p&gt;$$
\int_x\mathbb E[|\det(\nabla^2 f)|\cdot \mathbf 1(\nabla^2 f\preceq0)\mathbf 1(x\in Z)|\nabla f(x)=0]p_{\nabla f(x)}
$$&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Proof idea: Count number of local min directly
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Evaluate the formula using random matrix theory
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Formal for spin-glass model (random polynormials), informal connection with NNs.
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Formal results for overcomplete tensor and tensor PCA
&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;Kac-Rice formula can be used to compute #local minima in any region&lt;/li&gt;
&lt;li&gt;In some cases, Hess and grad will be nice random matrices&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;dynamicstrajectory&#34;&gt;Dynamics/Trajectory&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Main idea: instead of analyzing the global landscape, analyze the path from a random initialization.
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Observation: path can be very short
&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;ul&gt;
&lt;li&gt;Empirical Risk/Training Error
&lt;ul&gt;
&lt;li&gt;[Du, Zhai, Poczos, Singh] Two-layer&lt;/li&gt;
&lt;li&gt;[Allen-Zhu, Li, Song] [Du, Lee, Li, Wang, Zhai] [Zou, Cao, Zhu, Gu] Multi-layer/ResNet&lt;/li&gt;
&lt;li&gt;[Allen-Zhu, Li, Song] Recurrent Neural Network&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Population Risk/Test Error
&lt;ul&gt;
&lt;li&gt;[Li Linag] Special multiclass classification&lt;/li&gt;
&lt;li&gt;[Allen-Zhu, Li, Liang] 2 or 3 layer neural network. &amp;ldquo;Kernel-like&amp;rdquo; setting, special activation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;Suppose your model is over-parametrized, it can overfit to your data&lt;/li&gt;
&lt;li&gt;Considering generalization is harder.&lt;/li&gt;
&lt;li&gt;Kernel-like means the neural net is approximable using low-degree polynomials&lt;/li&gt;
&lt;li&gt;and requires special activation&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h2 id=&#34;beyond-optimizationhttpwwwoffconvexorg20190603trajectories&#34;&gt;&lt;a href=&#34;http://www.offconvex.org/2019/06/03/trajectories/&#34;&gt;Beyond Optimization&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Accelerated methods are faster but leads to slightly worse generalization.&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;GD has an innate bias towards finding solutions with good generalization
&lt;ul&gt;
&lt;li&gt;Methods to speed up gradient descent (e.g., acceleration or adaptive regularization) can sometimes lead to worse generalization.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;We need to develop a new vocabulary (and mathematics) to reason about trajectories. This goes beyond the usual “landscape view” of stationary points, gradient norms, Hessian norms, smoothness etc&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;finding-a-global-minimum-is-not-always-good-enough-trajectoryimplicit-bias&#34;&gt;Finding a global minimum is not always good enough (trajectory/implicit bias)&lt;/h3&gt;
&lt;hr&gt;
&lt;h2 id=&#34;neural-tangent-kernelshttpwwwoffconvexorg20191003ntk&#34;&gt;&lt;a href=&#34;http://www.offconvex.org/2019/10/03/NTK/&#34;&gt;Neural Tangent Kernels&lt;/a&gt;&lt;/h2&gt;
&lt;p&gt;Optimization is easy (linear) in highly overparametrized regime&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;how-many-parameters-do-we-need&#34;&gt;How many parameters do we need?&lt;/h3&gt;
&lt;/aside&gt;
&lt;/span&gt;</description>
    </item>
    
    <item>
      <title>The EM Algorithm</title>
      <link>/slides/the-em-algorithm/</link>
      <pubDate>Mon, 27 Apr 2020 00:00:00 +0000</pubDate>
      <guid>/slides/the-em-algorithm/</guid>
      <description>&lt;h1 id=&#34;the-em-algorithm&#34;&gt;The EM Algorithm&lt;/h1&gt;
&lt;hr&gt;
&lt;h2 id=&#34;appearance&#34;&gt;Appearance&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
Gaussian mixture models
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
the Baum-Welch algorithm for HMM
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
mixed regression
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
close to Lloyd&amp;rsquo;s algorithm for k-means clustering
&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;notation&#34;&gt;Notation&lt;/h2&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
&lt;em&gt;distribution&lt;/em&gt; with parameter $\boldsymbol \theta$: $f(\cdot|\boldsymbol \theta)$ or $f_{\boldsymbol\theta}$
&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
&lt;em&gt;parametric family&lt;/em&gt;: $\mathcal F=\{f_\theta:\boldsymbol{\theta}\in\Theta\}$
&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
&lt;em&gt;random variable&lt;/em&gt; : $X\sim f(\cdot|\boldsymbol\theta)$ or $X\sim f_{\boldsymbol\theta}$
&lt;/span&gt;&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;The random variable is usually a joint of input and label&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h2 id=&#34;probabilistic-machine-learning&#34;&gt;Probabilistic Machine Learning&lt;/h2&gt;
&lt;p&gt;We observe i.i.d. samples $\mathbf x_1, \mathbf x_2, \dots, \mathbf x_n\sim\mathbb P$.&lt;/p&gt;
&lt;p&gt;Assume $\mathbb P$ belongs to $\mathcal F$ and estimate the opptimal $\theta^*$.&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;$\mathbb P$ is a data-generating distribution we don&amp;rsquo;t see&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;likelihood&#34;&gt;Likelihood&lt;/h3&gt;
&lt;p&gt;$$
\begin{aligned}
\mathcal L(\mathbf \theta;\mathbf x_1, \mathbf x_2, \dots, \mathbf x_n) &amp;amp; := f(\mathbf x_1, \mathbf x_2, \dots, \mathbf x_n|\boldsymbol \theta)\\&lt;br&gt;
&amp;amp; = \prod_{i=1}^n f(\mathbf x_i\vert\boldsymbol \theta)
\end{aligned}
$$&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;maximum-likelihood-estimate&#34;&gt;Maximum Likelihood Estimate&lt;/h3&gt;
&lt;p&gt;$$
\hat{\mathbf \theta}_{MLE}:=\underset{\mathbf \theta\in\Theta}{\text{arg max }}\mathcal L(\mathbf \theta;\mathbf x_1, \mathbf x_2, \dots, \mathbf x_n)
$$&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;Maximum a Posteriori&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h2 id=&#34;latent-variable&#34;&gt;Latent Variable&lt;/h2&gt;
&lt;p&gt;some hidden variable $Z$ that affect the observation $Y$&lt;/p&gt;
&lt;p&gt;$(Y,Z)\sim f_{\theta^*} = f(\cdot, \cdot|\mathbf \theta^*)$&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;Topics of an article&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;marginalized-likelihood&#34;&gt;Marginalized likelihood&lt;/h3&gt;
&lt;p&gt;$\mathcal L(\theta; y) = \prod_{i=1}^n\sum_{z_i\in\mathcal Z} f(y_i, z_i|\theta)$&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
Summing $z$ is usually intractable.
&lt;/span&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;am-lvm&#34;&gt;AM-LVM&lt;/h3&gt;
&lt;p&gt;Alternatively estimate $\theta^t$ and $z^t$ at time step $t$&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
assign a $z^t_i$ to each $y_i$
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
$\theta^t$: MLE on $\mathcal L(\theta; \{(y_i, z^t_i)\}^n_{i=1}$
&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;fragment &#34; &gt;
$z^{t+1}$: MAP on $f(z|y_i, \theta^t)$
&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
AM-LVM neglects all affects of $z&#39;$ that is not the most likely ones.
&lt;/span&gt;&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;Topics of an article&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h2 id=&#34;em-algorithm&#34;&gt;EM Algorithm&lt;/h2&gt;
&lt;p&gt;&amp;ldquo;assigns&amp;rdquo; $y_i$ to a value $z$ with weight $f(z|y_i, \theta^t)$&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;sampling-z&#34;&gt;Sampling $z$&lt;/h3&gt;
&lt;p&gt;$$
\log\mathcal L(\theta;y) = \log\sum_zf(y, z|\theta)
$$&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
$$ =\log\sum_zf(z|y, \theta^0)\frac{f(y, z|\theta)}{f(z|y, \theta^0)} $$
&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
$$ =\log\mathbb E_{z\sim f(\cdot|y, \theta^0)}\left[\frac{f(y, z|\theta)}{f(z|y, \theta^0)}\right] $$
&lt;/span&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;q-function&#34;&gt;$Q$-function&lt;/h3&gt;
&lt;p&gt;\begin{aligned}
&amp;amp; \log\mathbb E_{z\sim f(\cdot|y, \theta^0)}\left[\frac{f(y, z|\theta)}{f(z|y, \theta^0)}\right] \\&lt;br&gt;
\ge&amp;amp; \mathbb E_{z\sim f(\cdot|y, \theta^0)}\left[\log\frac{f(y, z|\theta)}{f(z|y, \theta^0)}\right] \qquad \text{(Jensen&amp;rsquo;s inequality)}\\&lt;br&gt;
=&amp;amp; \mathbb E_{z\sim f(\cdot|y, \theta^0)}[\log f(y, z|\theta)] \qquad \qquad (Q_y(\theta|\theta^0))\\&lt;br&gt;
&amp;amp;- \mathbb E_{z\sim f(\cdot|y, \theta^0)}[\log f(z|y, \theta^0)] \qquad (R_y(\theta^0))
\end{aligned}&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;weighted point-wise likelihood&lt;/li&gt;
&lt;li&gt;Variational distribution&lt;/li&gt;
&lt;li&gt;equals iff $\theta^0 = \theta$&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;choosing-theta-to-improve-q&#34;&gt;Choosing $\theta$ to improve $Q$&lt;/h3&gt;
&lt;p&gt;$$
\log\mathcal L(\theta^0;y) = Q_y(\theta^0|\theta^0) - R_y(\theta^0)
$$&lt;/p&gt;
&lt;p&gt;$$
\log\mathcal L(\theta;y) - \log\mathcal L(\theta^0;y) \ge Q_y(\theta|\theta^0) - Q_y(\theta^0|\theta^0)
$$&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
causes $\log\mathcal L(\theta;y)$ to improve at least as much.
&lt;/span&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;pseudo-code&#34;&gt;Pseudo Code&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;theta &amp;lt;- INITIALIZE()
for t = 1, 2, ... do
  Q &amp;lt;- E(theta)  // E-step
  theta &amp;lt;- M(theta, Q)  // M-step
end for
return theta
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h3 id=&#34;e-step-sampling-construction&#34;&gt;E-step: sampling construction&lt;/h3&gt;
&lt;p&gt;$$
Q_t(\theta|\theta^t) = \frac1n \sum_{i=1}^n\sum_{z\in\mathcal Z} f(z|y, \theta^t)\cdot \log f(y_i, z|\theta)
$$&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;this expression has $n\cdot |\mathcal Z|$ terms instead of $|\mathcal Z|^n$ terms.&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;m-step-gradient-descent&#34;&gt;M-step: gradient descent&lt;/h3&gt;
&lt;p&gt;$$
M(\theta, Q_{Y_t}) = \theta^t + \alpha_t\cdot\nabla Q_{Y_t}(\theta^t|\theta^t)
$$&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;this expression has $n\cdot |\mathcal Z|$ terms instead of $|\mathcal Z|^n$ terms.&lt;/li&gt;
&lt;li&gt;if M-step improves $Q$, then EM iteration improves likelihood&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h2 id=&#34;motivating-applications&#34;&gt;Motivating Applications&lt;/h2&gt;
&lt;p&gt;Gaussian Mixture Models:&lt;/p&gt;
&lt;p&gt;$$
f(\cdot|{\boldsymbol {\theta }})=\sum _{i=1}^{K}\phi _{i}{\mathcal {N}}({\boldsymbol {\mu _{i},\Sigma _{i}}})
$$&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  EM is of particular appeal for finite normal mixtures where closed-form expressions are possible such as in the following iterative algorithm by Dempster et al. (1977)
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;plate-notation&#34;&gt;Plate notation&lt;/h3&gt;


&lt;figure id=&#34;figure-non-bayesian-gaussian-mixture-model&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;/slides/the-em-algorithm/img/gaussian_mixture_hufdcb3d7e176a515abb9c2f1e6f0a0e69_16873_2000x2000_fit_lanczos_2.png&#34; data-caption=&#34;Non-Bayesian Gaussian mixture model&#34;&gt;


  &lt;img data-src=&#34;/slides/the-em-algorithm/img/gaussian_mixture_hufdcb3d7e176a515abb9c2f1e6f0a0e69_16873_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;580&#34; height=&#34;480&#34;&gt;
&lt;/a&gt;


  &lt;figcaption&gt;
    Non-Bayesian Gaussian mixture model
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;Smaller squares indicate fixed parameters;&lt;/li&gt;
&lt;li&gt;larger circles indicate random variables.&lt;/li&gt;
&lt;li&gt;Filled-in shapes indicate known values.&lt;/li&gt;
&lt;li&gt;The indication [K] means a vector of size K.&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;e-step&#34;&gt;E-step&lt;/h3&gt;
&lt;p&gt;$$
h_{s}^{(j)}(t)={\frac {w_{s}^{(j)}p_{s}(x^{(t)};\mu _{s}^{(j)},\Sigma _{s}^{(j)})}{\sum _{i=1}^{n}w_{i}^{(j)}p_{i}(x^{(t)};\mu _{i}^{(j)},\Sigma _{i}^{(j)})}}.
$$&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;m-step&#34;&gt;M-step&lt;/h3&gt;
&lt;p&gt;$$
w_{s}^{(j+1)}={\frac {1}{N}}\sum _{t=1}^{N}h_{s}^{(j)}(t)
$$
$$
{\displaystyle \mu _{s}^{(j+1)}={\frac {\sum _{t=1}^{N}h_{s}^{(j)}(t)x^{(t)}}{\sum _{t=1}^{N}h_{s}^{(j)}(t)}}}
$$
$$
{\displaystyle \Sigma _{s}^{(j+1)}={\frac {\sum _{t=1}^{N}h_{s}^{(j)}(t)[x^{(t)}-\mu _{s}^{(j+1)}][x^{(t)}-\mu _{s}^{(j+1)}]^{\top }}{\sum _{t=1}^{N}h_{s}^{(j)}(t)}}}
$$&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Universal Differential Equations for Scientific Machine Learning</title>
      <link>/slides/universal-differential-equations-for-scientific-machine-learning/</link>
      <pubDate>Wed, 01 Apr 2020 00:00:00 +0000</pubDate>
      <guid>/slides/universal-differential-equations-for-scientific-machine-learning/</guid>
      <description>&lt;h2 id=&#34;universal-differential-equations-for-scientific-machine-learning&#34;&gt;Universal Differential Equations for Scientific Machine Learning&lt;/h2&gt;
&lt;p&gt;Chris Rackauckas&lt;br/&gt;
Massachusetts Institute of Technology, Department of Mathematics&lt;br/&gt;
University of Maryland, Baltimore, School of Pharmacy, Center for Translational Medicine&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;the-major-advances-in-machine-learning-were-due-to-encoding-more-structure-into-the-model&#34;&gt;The major advances in machine learning were due to encoding more structure into the model&lt;/h3&gt;
&lt;p&gt;More structure = Faster and better fits from less data&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;convolutional-neural-networks-are-structure-assumptions&#34;&gt;Convolutional Neural Networks Are Structure Assumptions&lt;/h3&gt;


&lt;figure &gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;/slides/universal-differential-equations-for-scientific-machine-learning/img/cnn_hu2802cfbd5c89b654c13105493c7e4938_300251_2000x2000_fit_lanczos_2.png&#34; &gt;


  &lt;img data-src=&#34;/slides/universal-differential-equations-for-scientific-machine-learning/img/cnn_hu2802cfbd5c89b654c13105493c7e4938_300251_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;1206&#34; height=&#34;435&#34;&gt;
&lt;/a&gt;


&lt;/figure&gt;

&lt;hr&gt;
&lt;h3 id=&#34;what-is-the-structure-of-science&#34;&gt;What is the structure of science?&lt;/h3&gt;
&lt;hr&gt;
&lt;h3 id=&#34;ecology-example-lotka-volterra-equation&#34;&gt;Ecology Example: Lotka-Volterra Equation&lt;/h3&gt;
&lt;p&gt;$$
\frac{d🐇}{dt} = \alpha 🐇 - \beta 🐇 🐺
$$&lt;/p&gt;
&lt;p&gt;$$
\frac{d🐺}{dt} = \delta 🐇 🐺 - \gamma 🐺
$$&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;scientific-mlal-is-domain-models-with-integrated-machine-learning&#34;&gt;Scientific ML/AL is Domain Models with Integrated Machine Learning&lt;/h3&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;Models are these almost correct differential equations&lt;/li&gt;
&lt;li&gt;We have to augment the models with the data we have&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h3 id=&#34;mechanistic-vs-non-mechanistic-models&#34;&gt;Mechanistic vs Non-Mechanistic Models&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Differential equations&lt;/strong&gt; describe mechanisms/structure and let the equations naturally evolve from this description
&lt;ul&gt;
&lt;li&gt;$🐇&amp;rsquo;(t) = \alpha 🐇(t)$ encodes &amp;ldquo;the rate at which the population is growing depends on the current number of rabbits&amp;rdquo;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Machine learning models&lt;/strong&gt; specify a learnable black box, where with the right parameters they can fit any nonlinear function.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h3 id=&#34;universal-approximation-theorem&#34;&gt;Universal Approximation Theorem&lt;/h3&gt;
&lt;p&gt;Neural networks can get $\epsilon$ close to any $R^n\rightarrow R^m$ function&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Neural networks are just function expansions, fancy Taylor Series like things which are good for computing and bad for analysis&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;neural-networks--nonlinear-function-approximation&#34;&gt;Neural Networks = Nonlinear Function Approximation&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Polynomial: $e^x = a_1 + a_2x + a_3x^2 + \cdots$&lt;/li&gt;
&lt;li&gt;Nonlinear: $e^x = 1 + \frac{a_1\tanh(a_2)}{a_3x-\tanh(a_4x)}$&lt;/li&gt;
&lt;li&gt;Neural Network: $e^x\approx W_3\sigma(W_2\sigma(W_1x+b_1) + b_2) + b_3$&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Others: Fourier/Chebyshev Series, Tensor product spaces, sparse grid, RBFs, etc.&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;neural-networks-are-universal-approximators-which-work-well-in-high-dimensions&#34;&gt;Neural Networks are universal approximators which work well in high dimensions&lt;/h3&gt;
&lt;p&gt;Neural networks overcome &amp;ldquo;the curse of dimensionality&amp;rdquo;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;universal-differential-equations&#34;&gt;Universal Differential Equations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Replace the user-defined structure with a neural network, and learn the nonlinear function for the structure&lt;/li&gt;
&lt;li&gt;Neural ordinary differential equation: $u&amp;rsquo; = f(u, p, t)$. Let $f$ be a neural network.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;fragments&#34;&gt;Fragments&lt;/h2&gt;
&lt;p&gt;Make content appear incrementally&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;{{% fragment %}} One {{% /fragment %}}
{{% fragment %}} **Two** {{% /fragment %}}
{{% fragment %}} Three {{% /fragment %}}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Press &lt;code&gt;Space&lt;/code&gt; to play!&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;fragment &#34; &gt;
One
&lt;/span&gt;
&lt;span class=&#34;fragment &#34; &gt;
&lt;strong&gt;Two&lt;/strong&gt;
&lt;/span&gt;
&lt;span class=&#34;fragment &#34; &gt;
Three
&lt;/span&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;A fragment can accept two optional parameters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;class&lt;/code&gt;: use a custom style (requires definition in custom CSS)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;weight&lt;/code&gt;: sets the order in which a fragment appears&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id=&#34;speaker-notes&#34;&gt;Speaker Notes&lt;/h2&gt;
&lt;p&gt;Add speaker notes to your presentation&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-markdown&#34;&gt;{{% speaker_note %}}
- Only the speaker can read these notes
- Press `S` key to view
{{% /speaker_note %}}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Press the &lt;code&gt;S&lt;/code&gt; key to view the speaker notes!&lt;/p&gt;
&lt;aside class=&#34;notes&#34;&gt;
  &lt;ul&gt;
&lt;li&gt;Only the speaker can read these notes&lt;/li&gt;
&lt;li&gt;Press &lt;code&gt;S&lt;/code&gt; key to view&lt;/li&gt;
&lt;/ul&gt;
&lt;/aside&gt;
&lt;hr&gt;
&lt;h2 id=&#34;themes&#34;&gt;Themes&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;black: Black background, white text, blue links (default)&lt;/li&gt;
&lt;li&gt;white: White background, black text, blue links&lt;/li&gt;
&lt;li&gt;league: Gray background, white text, blue links&lt;/li&gt;
&lt;li&gt;beige: Beige background, dark text, brown links&lt;/li&gt;
&lt;li&gt;sky: Blue background, thin dark text, blue links&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;ul&gt;
&lt;li&gt;night: Black background, thick white text, orange links&lt;/li&gt;
&lt;li&gt;serif: Cappuccino background, gray text, brown links&lt;/li&gt;
&lt;li&gt;simple: White background, black text, blue links&lt;/li&gt;
&lt;li&gt;solarized: Cream-colored background, dark green text, blue links&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;

&lt;section data-noprocess data-shortcode-slide
  
      
      data-background-image=&#34;/img/boards.jpg&#34;
  &gt;

&lt;h2 id=&#34;custom-slide&#34;&gt;Custom Slide&lt;/h2&gt;
&lt;p&gt;Customize the slide style and background&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-markdown&#34;&gt;{{&amp;lt; slide background-image=&amp;quot;/img/boards.jpg&amp;quot; &amp;gt;}}
{{&amp;lt; slide background-color=&amp;quot;#0000FF&amp;quot; &amp;gt;}}
{{&amp;lt; slide class=&amp;quot;my-style&amp;quot; &amp;gt;}}
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h2 id=&#34;custom-css-example&#34;&gt;Custom CSS Example&lt;/h2&gt;
&lt;p&gt;Let&amp;rsquo;s make headers navy colored.&lt;/p&gt;
&lt;p&gt;Create &lt;code&gt;assets/css/reveal_custom.css&lt;/code&gt; with:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&#34;language-css&#34;&gt;.reveal section h1,
.reveal section h2,
.reveal section h3 {
  color: navy;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;hr&gt;
&lt;h1 id=&#34;questions&#34;&gt;Questions?&lt;/h1&gt;
&lt;p&gt;
&lt;a href=&#34;https://spectrum.chat/academic&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Ask&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;
&lt;a href=&#34;https://sourcethemes.com/academic/docs/managing-content/#create-slides&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Documentation&lt;/a&gt;&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Blendmask</title>
      <link>/talk/2020-03-blendmask/</link>
      <pubDate>Fri, 20 Mar 2020 15:18:19 +0930</pubDate>
      <guid>/talk/2020-03-blendmask/</guid>
      <description>&lt;p&gt;I give a talk on BlendMask 
&lt;a href=&#34;https://live.bilibili.com/3344545&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;here&lt;/a&gt; at 20:00 Beijing Time (UTC+8) 24/03/2020. You can download the slides 
&lt;a href=&#34;https://cloudstor.aarnet.edu.au/plus/s/mSgeji3PQiD84OG&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;iframe src=&#34;//player.bilibili.com/player.html?aid=100226765&amp;bvid=BV1G7411D7j5&amp;cid=171070726&amp;page=1&#34; scrolling=&#34;no&#34; border=&#34;0&#34; frameborder=&#34;no&#34; framespacing=&#34;0&#34; allowfullscreen=&#34;true&#34; style=&#34;height:100%&#34;&gt; &lt;/iframe&gt;</description>
    </item>
    
    <item>
      <title>Four Papers Got Accepted at CVPR 2020</title>
      <link>/post/2020-03-01-cvpr-2020/</link>
      <pubDate>Fri, 28 Feb 2020 17:36:41 +1030</pubDate>
      <guid>/post/2020-03-01-cvpr-2020/</guid>
      <description>&lt;p&gt;Here are the papers got accepted with all authors listed (,* means equal contribution). Two papers on instance-level perception tasks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href=&#34;https://arxiv.org/abs/2001.00309&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation&lt;/a&gt; by Hao Chen*, Kunyang Sun*, Zhi Tian, Chunhua Shen, Yongming Huang, Youliang Yan (oral)&lt;/li&gt;
&lt;li&gt;
&lt;a href=&#34;https://arxiv.org/abs/2002.10200&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network&lt;/a&gt; by Yuliang Liu*, Hao Chen*, Chunhua Shen, Tong He, Lianwen Jin, Liangwei Wang (oral)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Two papers on Neural Architecture Search:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href=&#34;https://arxiv.org/abs/1906.04423&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;NAS-FCOS: Fast Neural Architecture Search for Object Detection&lt;/a&gt; by Ning Wang*, Yang Gao*, Hao Chen*, Peng Wang, Zhi Tian, Chunhua Shen, Yanning Zhang&lt;/li&gt;
&lt;li&gt;
&lt;a href=&#34;https://arxiv.org/abs/1909.08228&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Memory-Efficient Hierarchical Neural Architecture Search for Image Denoising&lt;/a&gt; by Haokui Zhang, Ying Li, Hao Chen, Chunhua Shen&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
    <item>
      <title>Bayesian Networks for Perception</title>
      <link>/drafts/2020-02-28-bayesian-networks-for-perception/</link>
      <pubDate>Fri, 28 Feb 2020 17:31:22 +1030</pubDate>
      <guid>/drafts/2020-02-28-bayesian-networks-for-perception/</guid>
      <description>&lt;p&gt;Finally I got a chance to talk about Bayesian Deep Learning. Although being an important topic and a classic task, we haven&amp;rsquo;t seen enough attentions here following the breakthoughs in deep learning field. Probably because of its shortcommings in inference and difficulties in explanation, many people a still standing by. Recently I am very happy to find this field moving in a healthy direction and here are some very useful papers discussing related stuffs.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href=&#34;https://openreview.net/forum?id=BJxI5gHKDr&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Pitfalls of In-Domain Uncertainty Estimation and Ensembling in Deep Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href=&#34;https://arxiv.org/abs/2001.10995&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;The Case for Bayesian Deep Learning&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href=&#34;https://arxiv.org/abs/2002.02405&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;How Good is the Bayes Posterior in Deep Neural Networks Really?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this post I would like to first give a quick recap to Bayesian Deep Learning in general and discuss what can be done in object detection.&lt;/p&gt;
&lt;h2 id=&#34;introduction&#34;&gt;Introduction&lt;/h2&gt;
&lt;p&gt;
&lt;a href=&#34;https://papers.nips.cc/paper/7141-what-uncertainties-do-we-need-in-bayesian-deep-learning-for-computer-vision.pdf&#34; title=&#34;What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Kendall&lt;/a&gt; introduced a general practice for uncertainty estimation in computer vision. The uncertainity of models can be categorized into two sources. &lt;em&gt;Aleatoric&lt;/em&gt; uncertainty captures noise inherent in the observations and &lt;em&gt;epistemic&lt;/em&gt; uncertainty accounts for uncertainty in the model.&lt;/p&gt;
&lt;p&gt;We can estimate epistemic uncertainty by placing a prior on model weights, or aleatoric uncertainty by putting a prior on model outputs. Bayesian Neural Networks replace neural network weights with the posterior distribution of them.&lt;/p&gt;
&lt;h2 id=&#34;epistemic-uncertainty-with-bayesian-neural-networks&#34;&gt;Epistemic Uncertainty with Bayesian Neural Networks&lt;/h2&gt;
&lt;h3 id=&#34;mcmc-approaches&#34;&gt;MCMC Approaches&lt;/h3&gt;
&lt;p&gt;Performing this inference often includes a Monte Carlo approximation. These methods samples the model weights from the estimated distribution and ensemble the predictions of multiple samples. Thus, evaluating the posterior requires evaluating an ensemble of models:&lt;/p&gt;
&lt;p&gt;$$
\hat p(y_i|x_i)\approx\int p(y_i|x_i,\omega)q_m(\omega) d_\omega\simeq \frac1K\sum_{k=1}^K p(y_i|x_i, \omega_k),\qquad \omega_k\sim q_m(\omega).
$$&lt;/p&gt;
&lt;p&gt;We can also use a variational distribution $q(\omega; \theta)$ to approximate the posterior $p(\omega|x, y)$. Introducing a prior $p(\omega)$ and applying Bayesian rules, we get the objective known as the evidence lower bound (ELBO):&lt;/p&gt;
&lt;p&gt;$$
\theta^* = \underset{\theta}{\mathrm{arg,max}} {\mathbb E_{\omega\sim q}[\log p(y|x, \omega)] - D_{\text{KL}}[q(\omega;\theta)|p(\omega)]}.
$$&lt;/p&gt;
&lt;p&gt;The first term is reconstruction and the second regularization. The KL term can be represented explicitly if we choose simple form for the variational distribution. And we can use MC approximations for the reconstruction term:&lt;/p&gt;
&lt;p&gt;$$
\mathbb E_{\omega\sim q}[\log p(y|x, \omega)]\simeq \frac1K\sum_{k=1}^K \log p(y|x, \omega_k), \qquad \omega_k\sim q(\omega; \theta)
$$&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Sampling distribution $q_m$&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Deep Ensemble&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$\frac1S\sum_{s=1}^S\delta(\omega - \omega_s)$&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mean-Field VI&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$\mathcal N(\omega\vert\mu, \text{diag}(\sigma^2))$&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MC dropout&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Dropout distribution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SWAG&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FGE&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cSGLD&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TDA&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-SWAG&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;SWAG + Deep Ensemble&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;such as in 
&lt;a href=&#34;https://arxiv.org/abs/1506.02142&#34; title=&#34;Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;MC dropout&lt;/a&gt;, 
&lt;a href=&#34;https://arxiv.org/abs/1902.02476&#34; title=&#34;A Simple Baseline for Bayesian Uncertainty in Deep Learning&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;SWA-Gaussian&lt;/a&gt; etc&lt;/p&gt;
&lt;p&gt;Approximation and acceleration of ensembling has been studied.&lt;/p&gt;
&lt;h3 id=&#34;deterministic-vi-approaches&#34;&gt;Deterministic VI Approaches&lt;/h3&gt;
&lt;p&gt;SVI such as in 
&lt;a href=&#34;https://arxiv.org/abs/1505.05424&#34; title=&#34;Weight Uncertainty in Neural Networks&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;[Blundell 2015]&lt;/a&gt; is difficult to get to work for large dataset such as ImageNet and complex models.&lt;/p&gt;
&lt;p&gt;
&lt;a href=&#34;https://arxiv.org/abs/1810.03958&#34; title=&#34;Deterministic Variational Inference for Robust Bayesian Neural Networks&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;[Wu et al. 2018]&lt;/a&gt; assume the pre-ReLU activations are Gaussian and deduce closed form posterior of the network output when the nonlinearity is Heaiside or ReLU. Even though sometimes we can only afford to compute the diagonal entries $\operatorname{Cov}(h_j, h_j)$. The empirical result is acceptable.&lt;/p&gt;
&lt;h3 id=&#34;yes-but-did-it-work&#34;&gt;Yes, but Did It Work?&lt;/h3&gt;
&lt;p&gt;
&lt;a href=&#34;https://arxiv.org/abs/2002.02405&#34; title=&#34;How Good is the Bayes Posterior in Deep Neural Networks Really?&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;[Wenzel et al. 2020]&lt;/a&gt; shows that Bayesian posterior can give an inferior prediction than MAP or VI. 
&lt;a href=&#34;https://statmodeling.stat.columbia.edu/2020/02/13/how-good-is-the-bayes-posterior-for-prediction-really/&#34; title=&#34;How good is the Bayes posterior for prediction really?&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;[Yao]&lt;/a&gt; suggests that such models could benefits from informative priors instead of treating networks as a black box.&lt;/p&gt;
&lt;p&gt;The exact sampling from a posterior in a deep neural network is infeasible. Current sampling methods can be inaccurate and hard to defend. Even HMC cannot serve as a gold standard because it suffers from all multimodality and non-log-convexity 
&lt;a href=&#34;https://statmodeling.stat.columbia.edu/2020/02/13/how-good-is-the-bayes-posterior-for-prediction-really/&#34; title=&#34;How good is the Bayes posterior for prediction really?&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;[Yao]&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Also, it is hard to say whether functional diversity can be captured by sampling around one mode 
&lt;a href=&#34;https://arxiv.org/abs/2002.08791&#34; title=&#34;Bayesian Deep Learning and a Probabilistic Perspective of Generalization&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;[Wilson 2020]&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Generally, I don&amp;rsquo;t consider these methods scalable enough for more challenging perception tasks on large scale dataset. The uncertainty consistency for these methods can be degraded for different datasets [[]].&lt;/p&gt;
&lt;h2 id=&#34;aleatoric-uncertainty-for-panoptic-segmentation&#34;&gt;Aleatoric Uncertainty for Panoptic Segmentation&lt;/h2&gt;
&lt;p&gt;We consider an object detection problem with a dataset&lt;/p&gt;
&lt;p&gt;For real-time applications, we cannot afford expensive Monte Carlo estimations or covariance estimation.&lt;/p&gt;
&lt;p&gt;The only thing we can afford for epistemic uncertainty is probably making only the last layer Bayesian (
&lt;a href=&#34;https://openreview.net/forum?id=SyYe6k-CW&#34; title=&#34;An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;[Riquelme 2018]&lt;/a&gt;). So I would prefer not to touch Bayesian Neural Network for object detection.&lt;/p&gt;
&lt;p&gt;Instead I will just cover a simple model for aleatoric uncertainty.&lt;/p&gt;
&lt;p&gt;I choose to adopt heteroscedastic aleatoric uncertainty introduced in 
&lt;a href=&#34;https://papers.nips.cc/paper/7141-what-uncertainties-do-we-need-in-bayesian-deep-learning-for-computer-vision.pdf&#34; title=&#34;What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;[Kendall 2017]&lt;/a&gt;. For a regression problem given input pair $\mathbf X$, $\mathbf Y$.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>AdelaiDet</title>
      <link>/project/adet/</link>
      <pubDate>Fri, 28 Feb 2020 10:43:16 +1030</pubDate>
      <guid>/project/adet/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Faster and Finer Instance Segmentation With Blendmask</title>
      <link>/post/2020-01-04-faster-and-finer-instance-segmentation-with-blendmask/</link>
      <pubDate>Sat, 04 Jan 2020 20:14:56 +1030</pubDate>
      <guid>/post/2020-01-04-faster-and-finer-instance-segmentation-with-blendmask/</guid>
      <description>&lt;p&gt;Update 01/05/2020:&lt;/p&gt;
&lt;p&gt;I have uploaded the CVPR Spotlight video to YouTube.&lt;/p&gt;

&lt;div style=&#34;position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;&#34;&gt;
  &lt;iframe src=&#34;https://www.youtube.com/embed/MfbbQkFAkHA&#34; style=&#34;position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;&#34; allowfullscreen title=&#34;YouTube Video&#34;&gt;&lt;/iframe&gt;
&lt;/div&gt;

&lt;hr&gt;
&lt;p&gt;Update 20/03/2020:&lt;/p&gt;
&lt;p&gt;I give a talk on BlendMask 
&lt;a href=&#34;https://live.bilibili.com/3344545&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;here&lt;/a&gt; at 20:00 Beijing Time (UTC+8) 24/03/2020. You can download the slides 
&lt;a href=&#34;https://cloudstor.aarnet.edu.au/plus/s/mSgeji3PQiD84OG&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;I want to briefly highlight our recent paper on instance segmentation:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hao Chen, Kunyang Sun, Zhi Tian, Chunhua Shen, Yongming Huang, Youliang Yan (2020) 
&lt;a href=&#34;https://arxiv.org/abs/2001.00309&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The motivation behind this paper is to proposal a general framework for instance-level tasks to reduce the per-instance computation in two-stage methods which could slows down the inference in complex senarios.&lt;/p&gt;
&lt;h2 id=&#34;background&#34;&gt;Background&lt;/h2&gt;
&lt;p&gt;Instance-level tasks such as instance segmentation, keypoint detection, tracking etc. all shares a similar procedure, detect-then-segment. That is, first use an object detection network to generate instance proposals and then for each instance, use a sub-network to predict the instance-level results. The advantange of this method against naive dense prediction is that for instances of different sizes, the features for the second stage is aligned (see 
&lt;a href=&#34;https://arxiv.org/abs/1909.00169&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;this review by Oksuz et. al.&lt;/a&gt;). Furthermore, in the second stage, only possible foreground features are computed in the second stage, which is more efficient and the sample imbalance problem is somehow mitigated (see 
&lt;a href=&#34;https://arxiv.org/abs/1708.02002&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Lin et. al.&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;But the second-stage computation can be costly if we need highly detailed predictions (such as 
&lt;a href=&#34;http://densepose.org/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;DensePose&lt;/a&gt; and high resolution instance segmentation like 
&lt;a href=&#34;https://arxiv.org/abs/1912.08193&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;PointRend&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;In BlendMask, we simplify the instance segmentation head of Mask R-CNN from a four-layer ConvNet to a tensor-product operation (called Blend) by reusing a densely predicted global segmentation mask. The framework resembles 
&lt;a href=&#34;https://arxiv.org/abs/1904.02689&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;YOLACT&lt;/a&gt; with a redesigned top module (called attention). We are able to achieve 10ms+ speedup while improving the mask AP for instance segmentation. One advantage of BlendMask is that &lt;em&gt;we can increase the instance output resolution almost for free&lt;/em&gt;.&lt;/p&gt;
&lt;h2 id=&#34;top-down-meets-bottom-up-middle-out&#34;&gt;Top-down Meets Bottom-up (Middle-Out?)&lt;/h2&gt;
&lt;p&gt;Without loss of generality, we build BlendMask upon 
&lt;a href=&#34;https://arxiv.org/abs/1904.01355&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;FCOS&lt;/a&gt;, a widely adopted one-stage object detection framework, which by the way has a very supportive and active 
&lt;a href=&#34;https://github.com/tianzhi0549/FCOS&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;github repo&lt;/a&gt;. For instance segmentation, we add two modules, namely bottom and top to FCOS. These two modules are lightweight and flexible, allowing BlendMask to be incorporated into most object detection models.&lt;/p&gt;
&lt;p&gt;The nomenclature of BlendMask top and bottom modules is adopted from the top-down and bottom-up methodologies in instance detection. Top-down approaches rely on high-level features to predict the entire instance, for example predicting bounding box offsets with final prediction layers of one-stage object detectors (
&lt;a href=&#34;https://pjreddie.com/darknet/yolo/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;YOLO&lt;/a&gt;, FCOS etc.). Bottom-up approaches ensemble local predictions, grouping local pixels or keypoints into instances (
&lt;a href=&#34;https://arxiv.org/abs/1708.02551&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;embedding based instance segmentation&lt;/a&gt;, 
&lt;a href=&#34;https://arxiv.org/abs/1812.08008&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;OpenPose&lt;/a&gt; etc.)&lt;/p&gt;
&lt;p&gt;The key trade-off here is the receptive field size. With large receptive field, top-down approaches excel in identifying instances but the fine-grained details are often lost. On the contrary, bottom-up approaches retains high-resolution local information but usually have trouble grouping. (Bottom-up instance segmentation methods typically fall behind two-stage ones, except the recent 
&lt;a href=&#34;https://arxiv.org/abs/1912.04488&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;SOLO&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;It is naturally for us to consider merging these two approaches. YOLACT does exactly that. It utilizes a vector of mixture coefficients as the top module to linearly combine along the channels of the bottom module, a group of prototypes.&lt;/p&gt;
&lt;p&gt;Can we go one step further? To separate overlapping instances, it is important for the local features to encode relative positions. YOLACT training procedure does not handle this part explicitly. And the top module is too simple that cannot provide enough instance level information.&lt;/p&gt;
&lt;p&gt;We make the top module more expressive by encoding the instance pose information. The idea is remotely relative to 
&lt;a href=&#34;https://arxiv.org/abs/1603.08678&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;InstanceFCN&lt;/a&gt; and 
&lt;a href=&#34;https://arxiv.org/abs/1611.07709&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;FCIS&lt;/a&gt;, which encode relative position information by spliting each instance into $K\times K$ tiles. The final segmentation is cropped from $K\times K$ feature maps and combined.&lt;/p&gt;


&lt;figure id=&#34;figure-instancefcn&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;/post/2020-01-04-faster-and-finer-instance-segmentation-with-blendmask/images/instancefcn_hua7129555c89ee04bf4c8118d373c2fb5_688506_2000x2000_fit_lanczos_2.png&#34; data-caption=&#34;InstanceFCN&#34;&gt;


  &lt;img data-src=&#34;/post/2020-01-04-faster-and-finer-instance-segmentation-with-blendmask/images/instancefcn_hua7129555c89ee04bf4c8118d373c2fb5_688506_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;1806&#34; height=&#34;558&#34;&gt;
&lt;/a&gt;


  &lt;figcaption&gt;
    InstanceFCN
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;We make this process parametric by using self-attention instead of hard one-hot weights, and contiuous, using bilinear upsampling for the attention.&lt;/p&gt;


&lt;figure id=&#34;figure-blender-module&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;/post/2020-01-04-faster-and-finer-instance-segmentation-with-blendmask/images/blender_hu8e5eb99c5fd200f364ae0cf32f362fee_307437_2000x2000_fit_lanczos_2.png&#34; data-caption=&#34;Blender module&#34;&gt;


  &lt;img data-src=&#34;/post/2020-01-04-faster-and-finer-instance-segmentation-with-blendmask/images/blender_hu8e5eb99c5fd200f364ae0cf32f362fee_307437_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;1238&#34; height=&#34;617&#34;&gt;
&lt;/a&gt;


  &lt;figcaption&gt;
    Blender module
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;The blender module effectively reduces the channel size of YOLACT protonet, from 32 to 4, and produces better masks.&lt;/p&gt;
&lt;p&gt;Here is a live view of the blending process:&lt;/p&gt;
&lt;img src=&#34;images/teaser.gif&#34; style=&#34;width: 400px;&#34;/&gt;
&lt;h2 id=&#34;qualitative-and-quantitative-results&#34;&gt;Qualitative and Quantitative Results&lt;/h2&gt;
&lt;p&gt;Our model produces higher quality masks than Mask R-CNN, especially in the following cases:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Large objects with complex shapes (Horse ears, human poses). Mask R-CNN fails to provide sharp borders.&lt;/li&gt;
&lt;li&gt;Objects in separated parts (tennis players occluded by nets, trains divided by poles). Mask R-CNN tends to include occlusions as false positive or segment targets into separate objects.&lt;/li&gt;
&lt;li&gt;Overlapping  objects  (riders,  crowds,  drivers). Mask R-CNN gets uncertain on the borders and leaves larger false negative regions. Sometimes, it assigns parts to the wrong objects, such as the last example in the first row.&lt;/li&gt;
&lt;/ul&gt;


&lt;figure id=&#34;figure-qualitative-results&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;/post/2020-01-04-faster-and-finer-instance-segmentation-with-blendmask/images/qualitative_hue239cd320bbde3680d3001896f2665c6_3298788_2000x2000_fit_lanczos_2.png&#34; data-caption=&#34;Qualitative results&#34;&gt;


  &lt;img data-src=&#34;/post/2020-01-04-faster-and-finer-instance-segmentation-with-blendmask/images/qualitative_hue239cd320bbde3680d3001896f2665c6_3298788_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;2668&#34; height=&#34;1262&#34;&gt;
&lt;/a&gt;


  &lt;figcaption&gt;
    Qualitative results
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;Our model surpasses Mask R-CNN in AP while being more efficient. Furthermore, it is very natural to generalize our model to other instance-level tasks such as panoptic segmentation and tracking.&lt;/p&gt;


&lt;figure id=&#34;figure-quantative-results&#34;&gt;


  &lt;a data-fancybox=&#34;&#34; href=&#34;/post/2020-01-04-faster-and-finer-instance-segmentation-with-blendmask/images/quantitative_hu26defce38408bc8784c0b3e7a96dac7d_652005_2000x2000_fit_lanczos_2.png&#34; data-caption=&#34;Quantative results&#34;&gt;


  &lt;img data-src=&#34;/post/2020-01-04-faster-and-finer-instance-segmentation-with-blendmask/images/quantitative_hu26defce38408bc8784c0b3e7a96dac7d_652005_2000x2000_fit_lanczos_2.png&#34; class=&#34;lazyload&#34; alt=&#34;&#34; width=&#34;2516&#34; height=&#34;1110&#34;&gt;
&lt;/a&gt;


  &lt;figcaption&gt;
    Quantative results
  &lt;/figcaption&gt;


&lt;/figure&gt;

&lt;p&gt;Similar to  Mask R-CNN, we use RoIPooler to locate instances and extract features. We reduce the running time by moving the computation of R-CNN heads before the RoI sampling to generate position-sensitive feature maps. Repeated mask representation and computation for overlapping proposals are avoided.&lt;/p&gt;
&lt;p&gt;Another advantage of BlendMask is that it can produce higher quality masks, since our output resolution is not restricted by the top-level sampling. Increasing the RoIPooler resolution of Mask R-CNN will introduce the following problem. The head computation increases quadratically with respect to the RoI size. Larger RoIs requires deeper head structures. Different from dense pixel predictions, RoI foreground predictor has to be aware  of  whole  instance-level information to distinguish foreground from other over-lapping instances. Thus, the larger the feature sizes are, the deeper sub-networks is needed.&lt;/p&gt;
&lt;p&gt;Here is a demo video with BlendMask.&lt;/p&gt;

&lt;div style=&#34;position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;&#34;&gt;
  &lt;iframe src=&#34;https://www.youtube.com/embed/E-gXL-eIPCw&#34; style=&#34;position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;&#34; allowfullscreen title=&#34;YouTube Video&#34;&gt;&lt;/iframe&gt;
&lt;/div&gt;

&lt;p&gt;For more results, please see 
&lt;a href=&#34;https://arxiv.org/abs/2001.00309&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;our paper&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Deep Learning for Source Code Modeling and Generation: Models, Applications and Challenges</title>
      <link>/publication/le-2020-deep/</link>
      <pubDate>Thu, 02 Jan 2020 00:00:00 +0000</pubDate>
      <guid>/publication/le-2020-deep/</guid>
      <description></description>
    </item>
    
    <item>
      <title>BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation</title>
      <link>/publication/chen-2020-blendmask/</link>
      <pubDate>Wed, 01 Jan 2020 00:00:00 +0000</pubDate>
      <guid>/publication/chen-2020-blendmask/</guid>
      <description></description>
    </item>
    
    <item>
      <title>NAS - Where Are We Now</title>
      <link>/post/2019-12-04-nas-where-are-we-now/</link>
      <pubDate>Wed, 04 Dec 2019 20:06:00 +1030</pubDate>
      <guid>/post/2019-12-04-nas-where-are-we-now/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;First off this ain&amp;rsquo;t no diss record&lt;br/&gt;
This for some of my homies that were misrepresented&lt;/p&gt;
&lt;p&gt;&amp;ndash; &lt;cite&gt; Nas, Where Are They Now. Hip Hop is Dead, 2006. &lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;For the past year and a half, I have been working on Neural Architecture Search (NAS). The idea of automatically designing neural networks for specific tasks is enticing for both practitioners and theorists. In production, NAS extends the scope of network pruning/compression and can benefits on chip energy saving modeling, etc. In research, NAS has raised new questions and challenges for convergence and generalization analysis, since it requires rapid and accurate structure evaluation.&lt;/p&gt;
&lt;p&gt;To quickly recap what&amp;rsquo;s going on with NAS, I suggest reading 
&lt;a href=&#34;https://drsleep.github.io/NAS-at-CVPR-2019/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Vladimir&amp;rsquo;s post&lt;/a&gt;. A curated list of literature on NAS is maintained 
&lt;a href=&#34;https://www.automl.org/automl/literature-on-neural-architecture-search/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In this post, I will cast NAS as a bi-level optimization problem. We want to minimize some function $f$, to achieve optimal accuracy or some complex objective considering speed-accuracy tradeoff, with respect to some hyperparameter $h$, in our case, the network structure. To simplify the analysis, we assume $h$ takes form of a sequence with length $L$ and vocabulary size $K$.&lt;/p&gt;
&lt;p&gt;$$
\min_{h, z} f(z;h)\qquad s.t. \quad z = \operatorname{argmax}_{\theta_h} f(\theta_h;h).
$$
Two major problems NAS deals with are&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Inner loop is slow. We have to train a network with structure $h$.&lt;/li&gt;
&lt;li&gt;Since there is no explicit derivative, we cannot optimize $f(h)$ directly.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id=&#34;nas-with-variational-optimization&#34;&gt;NAS with Variational Optimization&lt;/h2&gt;
&lt;p&gt;Straightforwardly, we can solve these two problems one by one. First, we minimize the upper bound of our objective:&lt;/p&gt;
&lt;p&gt;$$
\min_h f(h)\le \min_\alpha \mathbb E_{h\sim p_{\alpha}(h)}[f(h)],
$$&lt;/p&gt;
&lt;p&gt;where $p(h|\alpha)$ can be parametrized by a sequential network, of which the gradient becomes tractable:
$$
\nabla_\alpha \mathbb E_{p_\alpha(h)}[f(h)] = \mathbb E_{p_\alpha (h)}[f(h)\nabla_\alpha \log {p_\alpha}(h)].
$$&lt;/p&gt;
&lt;p&gt;This is the REINFORCE algorithm used by 
&lt;a href=&#34;https://arxiv.org/abs/1611.01578&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Zoph and Le&lt;/a&gt;. The gradient estimation can be made more efficient with PPO as in 
&lt;a href=&#34;https://arxiv.org/abs/1707.07012&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;their later work&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In NAS, sample efficiency is a bigger issue than in normal reinforcement learning tasks. Because training a network can be as costly as it can get to evaluate a single action. In other words, we prefer lower variance searching algorithms than lower bias ones. This is the reason I don&amp;rsquo;t consider using evolutionary strategy or random search (such as hyperband) for NAS, which ususally requires more samples. According to my experience, to find a good architecture with length $L=20$ and $K=7$ takes about 3,000 samples with REINFORCE and 1,500 with PPO.&lt;/p&gt;
&lt;p&gt;Speeding up sample evaluation is definitely important. Typically, a proxy task is designed, which includes training a smaller model with smaller input resolution and less iterations. Some other tricks are analyzed by 
&lt;a href=&#34;https://arxiv.org/abs/1810.10804&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Nekrasove et al.&lt;/a&gt; However, all these tricks introduce biases to the evaluation. &lt;em&gt;It is a good practice to analyse the generalization quality of the proxy tasks to the target task.&lt;/em&gt;&lt;/p&gt;
&lt;h2 id=&#34;nas-with-discrete-structure-learning&#34;&gt;NAS with Discrete Structure Learning&lt;/h2&gt;
&lt;p&gt;Another solution to the two problems is to consider them as one and solve them in one shot. The idea is to consider the structure parameters $h$ as a part of the network and one-shot the search by performing a network optimization, usually with SGD.&lt;/p&gt;
&lt;p&gt;
&lt;a href=&#34;https://arxiv.org/abs/1806.09055&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;DARTS&lt;/a&gt; uses a continous relaxation $h\approx \sigma(\alpha)$ on the operations,
$$
\nabla_\alpha \mathbb E_{p_\alpha(h)}[f(h)]\approx\nabla_\alpha f(\sigma(\alpha))
$$
where $\sigma$ is softmax activation. Although biased, This is reasonable considering the popular 
&lt;a href=&#34;https://arxiv.org/abs/1803.03635&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Lottery Ticket Hypothesis&lt;/a&gt;. (I will comeback to this part later.) However, I consider the connection learning part to be ad hoc, simply selecting the highest two activations, to follow the cell-based search space in [
&lt;a href=&#34;%28https://arxiv.org/abs/1611.01578%29&#34;&gt;Zoph and Le&lt;/a&gt;].&lt;/p&gt;
&lt;p&gt;There are still a lot of unanswered questions. Is this approximation error bounded? How can we avoid overfitting? We don&amp;rsquo;t even bother developing more accurate gradient computation including inverse Hessian for the second-order optimization, probably because of the accurate gradient does not leads to better result because of this bias.&lt;/p&gt;
&lt;p&gt;This challenging questions require better understanding of the optimization mechanisms and properties, e.g. how to early stop? how does training affect generalization?&lt;/p&gt;
&lt;p&gt;Another possible fix to this biased estimation is discrete latent structure learning. [
&lt;a href=&#34;https://openreview.net/forum?id=rylqooRqK7&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Xie et al.&lt;/a&gt;] uses Gumbel-softmax trick to reduce this bias.
$$
\nabla_\alpha \mathbb E_{p_\alpha(h)}[f(h)]\approx \mathbb E_{p(u)}\nabla_\alpha f(\sigma(z/t));\quad z:=\log\frac{\alpha}{1-\alpha} + \log\frac{u}{1-u};\quad u\sim\operatorname{Uniform}(0, 1).
$$
A problem with this trick is that the variance goes to infinity as bias gets closer to $0$, which is controlled by the temperature $t$. I am interested to see someone combine this trick with control variates, such as in 
&lt;a href=&#34;https://github.com/duvenaud/relax&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;relax&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>ABCNet</title>
      <link>/project/becon/</link>
      <pubDate>Mon, 28 Oct 2019 10:45:41 +1030</pubDate>
      <guid>/project/becon/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Memory-Efficient Hierarchical Neural Architecture Search for Image Denoising</title>
      <link>/publication/zhang-2019-ir/</link>
      <pubDate>Sun, 01 Sep 2019 00:00:00 +0000</pubDate>
      <guid>/publication/zhang-2019-ir/</guid>
      <description></description>
    </item>
    
    <item>
      <title>NAS-FCOS: Fast neural architecture search for object detection</title>
      <link>/publication/wang-2019-fcos/</link>
      <pubDate>Wed, 01 May 2019 00:00:00 +0000</pubDate>
      <guid>/publication/wang-2019-fcos/</guid>
      <description></description>
    </item>
    
    <item>
      <title>NAS-FCOS</title>
      <link>/project/nas-fcos/</link>
      <pubDate>Sun, 28 Apr 2019 10:55:43 +1030</pubDate>
      <guid>/project/nas-fcos/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Architecture Search of Dynamic Cells for Semantic Video Segmentation</title>
      <link>/publication/nekrasov-2019-architecture/</link>
      <pubDate>Mon, 01 Apr 2019 00:00:00 +0000</pubDate>
      <guid>/publication/nekrasov-2019-architecture/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Adversarial learning of structure-aware fully convolutional networks for landmark localization</title>
      <link>/publication/chen-2019-adversarial/</link>
      <pubDate>Tue, 01 Jan 2019 00:00:00 +0000</pubDate>
      <guid>/publication/chen-2019-adversarial/</guid>
      <description></description>
    </item>
    
    <item>
      <title>FCOS: Fully convolutional one-stage object detection</title>
      <link>/publication/tian-2019-fcos/</link>
      <pubDate>Tue, 01 Jan 2019 00:00:00 +0000</pubDate>
      <guid>/publication/tian-2019-fcos/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Light-weight hybrid convolutional network for liver tumour segmentation</title>
      <link>/publication/zhang-2019-light/</link>
      <pubDate>Tue, 01 Jan 2019 00:00:00 +0000</pubDate>
      <guid>/publication/zhang-2019-light/</guid>
      <description></description>
    </item>
    
    <item>
      <title>nas-segm-pytorch</title>
      <link>/project/nas-segm/</link>
      <pubDate>Sun, 28 Oct 2018 11:02:45 +1030</pubDate>
      <guid>/project/nas-segm/</guid>
      <description></description>
    </item>
    
    <item>
      <title>Fast neural architecture search of compact semantic segmentation models via auxiliary cells</title>
      <link>/publication/nekrasov-2019-fast/</link>
      <pubDate>Mon, 01 Oct 2018 00:00:00 +0000</pubDate>
      <guid>/publication/nekrasov-2019-fast/</guid>
      <description></description>
    </item>
    
    <item>
      <title>On Optimization in Deep Learning</title>
      <link>/post/2016-09-07-on-optimization-in-deep-learning/</link>
      <pubDate>Wed, 07 Sep 2016 19:58:13 +1030</pubDate>
      <guid>/post/2016-09-07-on-optimization-in-deep-learning/</guid>
      <description>&lt;p&gt;This is an old post which may not fit into modern view. Some recent finding such as lottery ticket theory is not covered in this post.&lt;/p&gt;
&lt;p&gt;There are at least exponentially many global minima for a neural net. Since permuating the nodes in one layer does not change the loss. Finding such points is not easy. Before certain techniques such as momentum came out, those nets were considered impossible to learn.&lt;/p&gt;
&lt;p&gt;Thanks to the constantly envolving hardwares and libraries, we do not have to worry about training time &lt;em&gt;that much&lt;/em&gt; at least for convnets. Empirically, the non-convexity of neural nets seems not to be an issue. In practice, SGD works pretty well in optimizing very large networks even though the problem is proved to be NP-hard. However, researchers never stop studying the loss surface of deep neural nets and searching for better optimization strategies.&lt;/p&gt;
&lt;p&gt;
&lt;a href=&#34;https://arxiv.org/abs/1605.07110&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;This paper&lt;/a&gt; has been renewed on ArXiv recently, which leads me to 
&lt;a href=&#34;https://news.ycombinator.com/item?id=11765111&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;this discussion&lt;/a&gt;. Following are what I find interesting.&lt;/p&gt;
&lt;h2 id=&#34;why-sgd-works&#34;&gt;Why SGD works?&lt;/h2&gt;
&lt;p&gt;[Choromaska et al, AISTATS&amp;rsquo;15] (also [Dauphin et al, ICML&amp;rsquo;15] use tools from Statistical Physics to explain the behavior of stochastic gradient methods when training deep neural networks. This offers a macroscopic explanation of why SGD &amp;ldquo;works&amp;rdquo;, and gives a characterization of the network depth. The model is strongly simplified, and convolution is not considered.&lt;/p&gt;
&lt;h3 id=&#34;saddle-points&#34;&gt;Saddle points&lt;/h3&gt;
&lt;p&gt;We start from discussing saddle points, the vast majority of critical points on the error surfaces of neural networks.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Here we argue, &amp;hellip; that a deeper and more profound difficulty originates from the proliferation of saddle points, not local minima, especially in high dimensional problems of practical interest. Such saddle points are surrounded by high error plateaus that can dramatically slow down learning, and give the illusory impression of the existence of a local minimum.&lt;/p&gt;
&lt;p&gt;&amp;ndash; &lt;cite&gt; Dauphin et al, 
&lt;a href=&#34;http://arxiv.org/abs/1406.2572&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Identifying and attacking the saddle point problem in high-dimensional non-convex optimization&lt;/a&gt; &lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The authors introduce saddle-free Newton method which requires the estimation of Hessian. They connect the loss function of a deep net to a high-dimensional Gaussian random field. They show that critical points with high training error are exponentially likely to be saddle points with many negative directions, and all local minima are likely to have error that is very close to that of the global minimum. (Described in 
&lt;a href=&#34;https://arxiv.org/abs/1611.01838&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Entropy-SGD: Biasing Gradient Descent Into Wide Valleys&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;The convergence of gradient descent is affected by the proliferation of saddle points surrounded by high error plateaus &amp;mdash; as opposed to multiple local minima.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The time spent by diffusion is inversely proportional to the smallest negative eigenvalue of the Hessian at a saddle point&lt;/p&gt;
&lt;p&gt;&amp;ndash; &lt;cite&gt;Kramer&amp;rsquo;s law&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;It is believed that for many problems including learning deep nets, almost all local minimum have very similar function value to the global optimum, and hence finding a local minimum is good enough.&lt;/p&gt;
&lt;p&gt;&amp;ndash; &lt;cite&gt; Rong Ge, 
&lt;a href=&#34;http://www.offconvex.org/2016/03/22/saddlepoints/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Escaping from Saddle Points&lt;/a&gt; &lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;As the model grows deeper, local minima have loss closer to global minima. On the other hand, we do not care about global minimum because it often leads to overfitting.&lt;/p&gt;
&lt;p&gt;Saddle points exist along the paths between local minima, most objective functions have exponentially many of those. However, first order optimization algorithms may get stuck at saddle points. Strict saddle points can be escaped and global minima can be achieved in polynomial time (
&lt;a href=&#34;http://arxiv.org/abs/1503.02101&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Ge et al., 2015&lt;/a&gt;). Stochastic gradient introduces noise and help to push the current point away from saddle points.&lt;/p&gt;
&lt;p&gt;Non-convex problems can have &amp;lsquo;&amp;lsquo;degenerate saddle points&amp;rsquo;&#39;, whose Hessian is p.s.d. and have 0 eigenvalues. The performance of SGD on these kind of tasks is still not well studied.&lt;/p&gt;
&lt;p&gt;To conclude this part, AFAIK, we should care more about escaping from saddle point. And gradient based methods can do a better job than second-order methods in practice.&lt;/p&gt;
&lt;h3 id=&#34;spin-glass-hamiltonian&#34;&gt;Spin-glass Hamiltonian&lt;/h3&gt;
&lt;p&gt;See 
&lt;a href=&#34;https://charlesmartin14.wordpress.com/2015/03/25/why-does-deep-learning-work/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Charles Martin: Why Does Deep Learning Works?&lt;/a&gt; Both papers mentioned above use ideas from statistical physics and spin-glass models.&lt;/p&gt;
&lt;p&gt;Statistical physicists refer to $H_x(y)\equiv-\ln p(y|x)$ as the &lt;strong&gt;Hamiltonian&lt;/strong&gt;, quantifying the energy of $y$ given the parameter $x$. And $\mu\equiv -\ln p$ as &lt;strong&gt;self-information&lt;/strong&gt;. We can rewrite Bayes&amp;rsquo; formula as:&lt;/p&gt;
&lt;p&gt;$$
p(y) = \sigma(-H(y)-\mu)
$$&lt;/p&gt;
&lt;p&gt;We can see the features yield by a neural net as Hamiltonian and the softmax computes the classification probability.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;The long-term behavior of certain neural network models are governed by the statistical mechanism of infinite-range Ising spin-glass Hamiltonians&lt;/p&gt;
&lt;p&gt;&amp;ndash; &lt;cite&gt; LeCun et. al., 
&lt;a href=&#34;https://arxiv.org/abs/1412.0233&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;The Loss Surfaces of Multilayer Networks, 2015&lt;/a&gt; &lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In this paper, he tries to explain the optimization paradigm with spin-glass theory.&lt;/p&gt;
&lt;h3 id=&#34;implicit-bias-in-sgd&#34;&gt;Implicit Bias in SGD&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href=&#34;https://arxiv.org/abs/1611.01838&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Chaudhari&lt;/a&gt; proposed a surrogate loss that explicitly biases SGD dynamics towards flat local minima. The corresponding algorithm relates closely to stochastic gradient Langevin dynamics.&lt;/li&gt;
&lt;li&gt;Another interpretation is that SGD performs Variational Inference (VI).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;what-does-the-minima-look-like&#34;&gt;What does the minima look like?&lt;/h2&gt;
&lt;p&gt;Take for example the concept of mode connectivity (
&lt;a href=&#34;https://arxiv.org/abs/1802.10026&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Garipov et al, 2018&lt;/a&gt;): it seems that the modes found by SGD using different random seeds are not just isolated basins, but they are connected by smooth valleys along which the training and test error are low.&lt;/p&gt;
&lt;h3 id=&#34;no-poor-local-minima&#34;&gt;No poor local minima&lt;/h3&gt;
&lt;p&gt;
&lt;a href=&#34;https://arxiv.org/abs/1412.6544&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Research at Google and Stanford&lt;/a&gt; confirms that the Deep Learning Energy Landscapes appear to be roughly convex. A bolder hypothesis is that deep networks are spin funnels. And as the net gets larger, the funnel gets sharper. If this is true, our major concern should be to avoid over-training rather than the convexity of the network.&lt;/p&gt;
&lt;p&gt;Finally we arrive at the paper itself. Nets are optimized well by local gradient methods and seems not to be affected by local minima. The author claims that every local minimum is a global minimum and &amp;ldquo;bad&amp;rdquo; saddle points (degenerated ones) exists for deeper nets. Thm 2.3 gives clear result on linear networks.&lt;/p&gt;
&lt;p&gt;The main result Thm 3.2 generalizes 
&lt;a href=&#34;https://arxiv.org/abs/1412.0233&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Choromanska et al, 2015&lt;/a&gt;&amp;lsquo;s idea for nonlinear network relies on 4 (seemingly strong) assumptions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The dimensionality of the output is smaller than the input.&lt;/li&gt;
&lt;li&gt;The inputs are random and decorrelated.&lt;/li&gt;
&lt;li&gt;A connection in the network is activated or not is random with the same probability of success across the network. (ReLU thresholding happens randomly.)&lt;/li&gt;
&lt;li&gt;The network activations are independent of the input, the weights and each other.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;They relax the majority of the asssumptions, which is very promising, but leave a weaker condition A1u-m and A5u-m (
&lt;a href=&#34;https://www.reddit.com/r/MachineLearning/comments/4ktqeu/160507110_deep_learning_without_poor_local_minima/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;from reddit post&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;Recently DeepMind came up with 
&lt;a href=&#34;https://arxiv.org/abs/1611.06310&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;another paper&lt;/a&gt; claiming the assumptions are too strong for real data. And devised counter examples with finite datatets for rectified MLPs. For finite sized models/datasets, one does not have a globally good behavior of learning regardless of the model size.&lt;/p&gt;
&lt;p&gt;Even though deep learning energy landscapes appear to be roughly convex, or as this post referred to, local minimal free, a deep model has to include more engineering details to aid its convergence. Problems such as covariance shift and overfitting still have to be handled by engineering techniques.&lt;/p&gt;
&lt;h3 id=&#34;arriving-on-flatter-minima&#34;&gt;Arriving on flatter minima&lt;/h3&gt;
&lt;blockquote&gt;
&lt;p&gt;large-batch methods tend to converge to sharp minimizers of the training and testing functions &amp;ndash; and that sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation.&lt;/p&gt;
&lt;p&gt;&amp;ndash; &lt;cite&gt; 
&lt;a href=&#34;https://stanstarks.github.io/tw5/#On%20Large-Batch%20Training%20for%20Deep%20Learning%3A%20Generalization%20Gap%20and%20Sharp%20Minima&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima&lt;/a&gt; &lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href=&#34;https://arxiv.org/abs/1802.06175&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;An Alternative View: When Does SGD Escape Local Minima?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;should-2-nd-order-methods-ever-work&#34;&gt;Should 2-nd order methods ever work?&lt;/h2&gt;
&lt;p&gt;Basiclly no. Because the Hessian vector product require very low variance estimation, which leads to batch size larger than 1000. But 
&lt;a href=&#34;https://www.reddit.com/r/MachineLearning/comments/599wbr/project_i_accidentally_wrote_a_quasinewton_lbfgs/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;some rare cases&lt;/a&gt; happen when 2nd order methods with small batch size works.&lt;/p&gt;
&lt;h2 id=&#34;gradient-starvation&#34;&gt;Gradient Starvation&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;a href=&#34;https://arxiv.org/abs/1809.06848&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;On the Learning Dynamics of Deep Neural Networks&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;Some features will dominate the gradient and sheding other equally important features.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
  </channel>
</rss>