Ohad was nice enough to provide the basic background, which I reproduce here.

A **neuron** is a function $f:\R^d\to\R$ such that
$f(x)=\sigma(\sum_{i\in[d]}w+ix_i+b)$,
where $\sigma:\R\to\R$ is a non-linear **activation** function.
Examples include the threshold $\sigma(z)=1$ if and only if $z > 0$,
the sigmoid $\sigma(z)=1/(1+\exp(-z))$, and $\sigma(z)=\max(0,z)$.
The activation rule is selected based on a variety of considerations,
which expend beyond the expressibility of networks that use it.

A **layer** (in a neural network)
is a mapping of the form $x\mapsto \sigma(Wx+b)$,
where $W$ is a matrix, $b$ a vector, and $\sigma$ is extended to vectors.
A **network** is a sequence of layers, followed by an arbitrary linear
function.
Note that this is a network that uses the activation rule $\sigma$,
and its width is the maximal length of vectors in the layers,
and depth is the number of layers.

A known result from late 1980s states that if $\sigma$ satisfies some mild conditions (e.g., non-constant, bounded, continuous), then a depth-two network (using it) can approximate every continuous function $f$ over $[0,1]^d$ in a sense that such a network $N$ of width $w(d,\eps)$ satisfies $$\sup_{x\in[0,1]^d}\{|N(x)-f(x)|\} < \eps$$ However, $w$ grows exponentially with $d$. (It seems that it is polynomial in $1/\eps$.)

The current result shows that depth-three networks can do better, provided that the activation rule satisfies some mild conditions (which refer to its expressibility for $d=1$). Specifically, they show a function $g:\R^d\to\R$ that is "radial" (i.e., $g(x)=g'(\|x\|)$ for some $g':\R\to\R$) and a distribution $\mu$ supported by the ball of radius $O(\sqrt{d})$, such that $g$ is expressible by depth-three networks of $\poly(d)$ size but cannot be approximated (wrt the distribution $\mu$) by depth two networks of size $\exp(o(d))$. Specifically, for every such network $N$, $$\Exp_{x\sim\mu}[(g(x)-N(x))^2] = \Omega(1)$$

On input $x$, the depth-three circuit first approximates $\|x\|$ (by depth two), and then computes $g'$ (by the hypothesis re $d=1$). The lower bound for depth-two uses Fourier analysis and is based on a structural result regarding the F-trans of functions computed by depth two networks.

We show that there are simple functions on Rd, expressible by small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, to more than a certain constant accuracy, unless its width is exponential in the dimension. The result holds for most continuous activation functions, including rectified linear units and sigmoids, and is a formal demonstration that depth -- even if increased by 1 -- can be exponentially more valuable than width for standard feedforward neural networks. Moreover, compared to related results in the context of Boolean functions, our result requires fewer assumptions, and the proof techniques and construction are very different.

See arxiv 1512.03965.

Back to list of Oded's choices.