Approximation Theory of Neural Network

1.Families of Neural Network Functions

Set

w^{[l]}\in R^{n_l \times n_{l-1}},l=1,2,...,L

b^{[l]}\in R^{n_l},l=1,2,...,L

which is called weights matrix and output of l-layer.

neural network (iterative system):

\begin{cases}y^{[1]}=x
 \\ z^{[l]}=w^{[l]}y^{[l-1]}+b^{[l]}
 \\ y^{[l]}=\sigma (z^{[l]})
 \\ f(x)=y^{[L]}

\end{cases}

where ,which is called neural network function.

Definition. is consisted of all neural network functions defined on , where means the neural network with L layers and neurons in layer.

2.Universal Approximation Theories (UATs)

Weierstrass theorem is important in mathematical analysis, so we want to know there is whether an analogous theorem in ANN, i.e. use neural network functions to approximate other functions.

Fix Depth and Adjust Width

Firstly, we consider the ANN with only one hidden layer.

Theorem.2.1(Cybenko.1989)

Let be any continuous sigmoidal function. Then finite sums of the form

G(x)=\sum_{j=1}^{n}\alpha_{j} \sigma(y_{j}^T x+\theta_{j})

are dense in .

Remark.1

is the single hidden-layer NN function,where

, , ,

, ,

Remark.2

Theorem.1 shows that any continuous functions can be uniformly approximate by a continuous NN only have one internal hidden layer and with an arbitrary continuous sigmoidal nonlinearity.

Remark.3

The approximating properties shown by Theorem.1 is powerful, but we have focused on existence. T.P.Chen give a constructive proof.(1990)

Remark.4

The condition in Theorem.1 can be optimized, and the conclusion can be generalized:

'continuous sigmoidal function' 'bounded sigmoidal function'.(T.P.Chen.1990)

.(T.P.Chen.1990)

Corollary.2.1

is dense in

is dense in ( dense in and )

We always use sigmoidal function as the activation function, so we study the characteristic of the activation function to prepare for approximating functions by other activation functions.

Definition

If a function satisfies that all the linear combinations are dense in every , then is called a Tauber-Wiener(TW) function. (which can be regarded as the expansion of continuous function)

: all the Tauber-Wiener functions.

: Schawartz functions in tempered distribution theory, i.e., rapidly decreasing and infinitely differentiable functions.

: Tempered distributions, i.e. linear continuous functionals defined on .

property of TW function

If is bounded sigmoidal function, then .(T.P.Chen.1995)

Theorem.2.2(T.P.Chen.1995)

Suppose that is a continuous function, and , then , if and only if is not a polynomial.

Theorem.2.3(T.P.Chen.1995)

Suppose that is a compact set in , is a compact set in , , then for any , there exist a positive integer , real number , vectors , and constants s.t.

|f(x)-\sum _{i=1}^{N}c_{i}(f)g(\omega _{i}x+\theta _{i})|<\epsilon

holds for all and . Moreover, each is a linearly continuous functional defined on .

Remark.1

Show that function can be qualified as an activation function, which describes the activation function.

Remark.2

Combined with Thm.1, we have that any non-polynomial continuous function in is an activation function.

All above works are concerned with approximation to continuous functions defined on a compact set in , then we discuss the problem of approximation to functionals by neural networks.

Theorem.2.4(T.P.Chen.1995)

Suppose that , is a , is a compact set, is a compact set in , is a continuous functional defined on , then for any , there are an positive integer, points , and real constants s.t.

\left|f(u)-\sum_{i=1}^{N} c_{i}g\left(\sum_{j=1}^{m}\xi _{i,j}u(x_{j})+\theta _{j}\right)\right|<\epsilon

holds for all .

Remark.1

can be called the neural network functional on .

Remark.2

Nonlinear continuous operators also can be approximated.(T.P.Chen.1995)

Fix Width and Adjust Depth

Firstly, U.A.T. is very important, next theorem can be regarded as a dual version of the classic universal approximation theorem.

Definition

A fully-connected neural network is a layered neural network where there exists a connection between every two nodes in adjacent layers.

Definition

represents the class of functions described by feedforward neural networks with n neurons in the input layer, m neurons in the output layer, and an arbitrary number of hidden layers, each with k neurons with activation function .

Theorem.2.5(Zhou Lu.2017)

For any Lebesgue-Integrable function and any , there exists a fully-connected ReLU network with width s.t. the function represented by this network satisfies

\int _{R^{n}}|f(x)-F_{\mathcal{A}}|dx<\epsilon

Remark.1

The classical depth-bounded theorem considers continuous functions on a compact domain and use distance; this width-bounded theorem deals with Lebesgue-Integrable functions on the whole Euclidean space and therefore use distance.

Remark.2

This theorem implies that there is phase transition for the expressive power of ReLU network as the width of the network varies across n, the input dimension.

Remark.3

This theorem can be improved(Kidger.2020):

is dense in with respect to the uniform norm, where could be nowhere differentiable;

If is the ReLU, we have is dense in with respect to the usual norm.

also can be any polynomial.

Theoretical understanding of the strength of depth starts from analyzing the depth efficiency, by proving the existence of deep neural networks that cannot be realized by any shallow network whose size is exponentially larger. However, we argue that even for a comprehensive understanding of the depth itself, one needs to study the dual problem of width efficiency.

Theorem.2.6(Zhou Lu.2017)

Let n be the input dimension. For any integer , there exist represented by a ReLU neural network with width and depth , s.t. for any constant , there exists and for any function represented by a ReLU neural network with width and depth , the following inequality holds:

\int _{R^{n}}|F_{\mathcal{A}}-F_{\mathcal{B}}|dx\ge \epsilon

Remark.1

This theorem states that there are networks such that reducing width requires increasing in the size to compensate, which is similar to that of depth qualitatively. However, at the quantitative level, depth efficiency enjoys exponential lower bound, while for width Thm.2.6 is a polynomial lower bound. If a corresponding polynomial upper bound can be proven, we can say depth plays a more important role in efficiency, but such a polynomial lower bound still means that depth is not strictly stronger than width in efficiency.

3.Rate of Convergence

How the approximation error is related to the number of nodes in the network?

We analyze the convergence rate of conventional method first.

Theorem.3.1

Suppose is the approximation of (by polynomial, or piecewise polynomial, or truncation Fourier series), then

\left \|f-f_{m}\right \|_{L^{2}(D)}\le c_{0}m^{-\frac{\alpha}{d}}\left \|f\right \|_{H^{\alpha}(D)}

Remark.1

If we set , then . Let , then , that is the curse of dimensionality.

Consider approximation of neural network function.

Definition

is the set of functions s.t. , where is defined on and has Fourier representation: for some complex-valued function for which is integrable, and .

Definition

is the set of functions on for which the representation holds for for some complex-valued measure for which is finite, where is the magnitude distribution corresponding to .

Definition

is the set of all functions in s.t. for some representing on : , where .

Theorem.3.2(Barron.1993)

For every function , every sigmoidal function , every probability measure , and every , there exists a linear combination of sigmoidal functions of the form , s.t.

\int _{B}(f(x)-f_{n}(x))^{2}\mu (dx)\le \frac{(2C)^{2}}{n}

And satisfy and .

Remark.1

The order of convergence have no connection with dimension.

Definition

Theorem.3.3(Barron.1993)

Suppose is a bounded set, is sigmoidal function with , for every probability measure s.t.

\left \|\bar{f}-f_{n} \right \|_{L^{2}(B,d\mu)}\le 2C(\frac{1}{\sqrt{n}}+\delta _{\tau})

where and .

Remark.1

If is the unit step function, then for all , and Theorem.3.3 reduces to Theorem.3.2.

If is the logistic function , then for . Setting yields . Therefore, if we set , then from Theorem.3.3 we have .

4.Barron Norms and Spaces

Sobolev space is not suitable enough, so we consider new function space.

Suppose

u:\Omega \subset R^{d}\rightarrow R,(n_{L}=1)

u_{E}:R^{d}\rightarrow R^{m}, u_{E}|_{\Omega}=u

\hat{u}_{E}(w)=\mathcal{F}(u_{E})(w)=(2\pi)^{-\frac{d}{2}}\int _{R^{d}}e^{-ix\cdot w}u_{E}(x)dx

We define Barron norm and space:

\left \|u \right \| _{B^{s}(\Omega)}=inf_{u_{E}|_{\Omega}=u}\int _{R^{d}}(1+|w|)^{s}|\hat{u}_{E}(w)|dw

B^{s}(\Omega)=\{v\in L^{1}_{loc}(\Omega);\left \|v \right \| _{B^{s}(\Omega)}<\infty\}

Remark

is Banach Space.

, set

is called Barron space.

5.Properties of Barron space and functions

Barron function can be approximate by single hidden-layer neural network function.

Theorem.5.1

Suppose , then

inf_{v\in S^{[\vec{m}]}_{3}}\left\|u-v\right\|_{L^{2}(\Omega)}\lesssim \frac{\left\|u\right\|_{B(\Omega)}}{\sqrt{n}}

holds for every given non-polynomial activation function.

Theorem.5.2

Suppose , then

inf_{v\in S^{[\vec{m}]}_{3}}\left\|u-v\right\|_{L^{\infty}(\Omega)}\lesssim \left\|u\right\|_{B(\Omega)}\sqrt{\frac{d+1}{n}}

and

inf_{v\in S^{[\vec{m}]}_{3}}\left\|u-v\right\|_{L^{\infty}(\Omega)}\lesssim \left\|f\right\|_{B(\Omega)}\frac{\sqrt{\log{n}}}{n^{1/2+1/2d}}

holds for every given non-polynomial activation function.

Remark

s.t. , which means the approximation can not be improved.

Where is the dimension of input vector.

Then we consider inverse approximation.

Theorem.5.3

If with , and in , then and .

Remark

This theorem implies the closeness of Barron space.、

Theorem.5.4

Suppose , then

inf_{v\in S^{[\vec{m}]}_{3}}\left\|u-v\right\|_{H^{k}(\Omega)}\le \frac{\left\|u\right\|_{B^{k+1}(\Omega)}}{\sqrt{n}}

Remark

When , the proof is constructive, and can be replaced by .

Theorem.5.5(Barron)

Suppose integer is bounded, then there is

\left\|v\right\|_{H^{k}(\Omega)}\lesssim \left\|v\right\|_{B^{k}(\Omega)}\lesssim \left\|v\right\|_{H^{k+d/2+\epsilon}(\Omega)}

Remark

This theorem implies that embeds in , which means .

Theorem.5.6(W.N.E.2021)

Every Barron function is at least Lipschitz-continuous in , and

Next we give some simple facts(W.N.E.2021):

Increase of Barron functions at infinity is no more than linear.

3-layers neural network function with finite neurons has compact support set in .

is Barron function..

nonzero Barron function and s.t. .

when and is an algebra.

s.t. .

6.Measure-theoretic definition of Barron function

Set is a measure space, if , then is called a probability space.

, every measurable function is called a random variable.

, which is expectation.

distribution function of .

distribution probability measure on probability space .

Recall the single hidden-layer neural network with n neurons in hidden-layer:

, where .

Observation: the right term is analogous to Riemann sum or Monte Carlo sum, so we suspect: when , if exist limit, it could be integral form

f_{\rho}(x)=\int _{\Omega}a\sigma(\vec{b}^{T}\cdot x+c)\rho(da,db,dc),x\in S_{1}^{d} or R^{d}

where is Radon probability space, is Radon measure on .

Radon measure is a Borel measure and satisfies :

Local finiteness: is finite for every compact set .

Regularity: is regular measure.

Remark

and is not one-to-one.

Definition

\left\|f\right\|_{B_{2}(\mathcal{P})}=inf_{\rho \in S_{f}}\sqrt{E_{\rho}[a^{2}(\left\|\vec{b}\right\|_{1}+|c|)^{2}]}=inf_{\rho \in S_{f}}[(\int _{\rho\in S_{f}}a^{2}(\left\|\vec{b}\right\|_{1}+|c|)^{2}d\rho)]^{\frac{1}{2}}

where , is probability measure on , and satisfies .Then we have

B_{2}=B_{2}(\mathcal{P})=\{f\in C^{0}(R^{d},\mathcal{P}orL^{1}(R^{d})):\left\|f\right\|_{B_{2}(\mathcal{P})}\le \infty\}

which is called Barron-2 space.

7.Index of Barron Space

Let , define

\left\|f\right\|_{B_{p}}=inf_{\rho \in S_{f}}(E_{\rho}[|a|^{p}(\left\|\vec{b}\right\|_{1}+|c|)^{p}]^{\frac{1}{p}})=inf_{\rho \in S_{f}}[(\int _{R\times R^{d}\times R}|a|^{p}(\left\|\vec{b}\right\|_{1}+|c|)^{p}d\rho)]^{\frac{1}{p}}

and .

Lemma.7.1(W.N.E.2021)

Theorem.7.2(W.N.E.2021)

and , therefore

Remark

We have defined , which is called , then B^{t}_{D}\ne B^{s}(D),(t>s)<span data-formula=".

In fact: " aria-hidden="true">B^{1}(D)=B_{1}(D)=...=B_{\infty}(D)<span data-formula=".

8.Nonlinear Approximation Theory

Set " aria-hidden="true">(V,\left|\cdot\right|_{V}):{\phi {j}}{j=1}^{N}\subset V<span data-formula=" are linearly independent.

" aria-hidden="true">V_{N}=span{\phi {j};j=1,2,...,N}\Rightarrow dim(V{N})=N and V_{N}\subset V<span data-formula=".

Linear Approximation

a.(optimal approximation) for " aria-hidden="true">f\in Vf^{*}{N}\in V{N}<span data-formula=" s.t.

\left\|f-f^{*}_{N}\right\|_{V} \lesssim \left\|f-f_{N}\right\|_{V}, \forall f_{N}\in V_{N}.

b.(convergence and order of convergence) We want to know " aria-hidden="true">lim_{N\rightarrow}inf_{f_{N}\in V_{N}}\left|f-f_{N}\right|{V}=0?inf{f_{N}\in V_{N}}\left|f-f_{N}\right|_{V} \lesssim N^{-\gamma},\forall \gamma >0?<span data-formula="

c. suppose " aria-hidden="true">V_{1}\subset Vf<span data-formula=" result in b. is right?

d. If " aria-hidden="true">lim_{N\rightarrow}inf_{f_{N}\in V_{N}}\left|f-f_{N}\right|_{V}=0f<span data-formula=" possess?

Examples

spectral method: " aria-hidden="true">V_{N}N_{th}-orderD\subset R^{d}V_{N}N_{th}-orderD\subset R_{d}<span data-formula="}.

finite element method: " aria-hidden="true">V_{N}r_{th}D\subset R^{d}<span data-formula="}.

drive element: " aria-hidden="true">V_{N}N_{th}-orderD\subset R^{d}<span data-formula="}.

Remark

Advantage: they can approximate any function; Disadvantage: rate of convergence is slow, and exist course of dimensionality.

Nonlinear Approximation

Set " aria-hidden="true">{V_{j}}{j\ge 0}V{0}\subset V_{1}\subset ... \subset V_{n}\subset ...cV_{j}=V_{j},\forall c\in RV_{n}+V_{n}={x+y;x,y\in V_{n}}\subset V_{rn},r\in N<span data-formula=". Then consider questions a.b.c.d..

Examples

N-term approximation: given " aria-hidden="true">S={\psi_{k}}{k\ge 1}V{N}={g_{N}=\sum {k\in Gamma}c{k}\psi_{k};card_{c_{k}\in R}(\Gamma)\le N,\Gamma={1,2,3,...}}f\approx f_{N}V_{N}\subset V_{N+1},V_{N}+V_{N}\subset V-{2N}<span data-formula=".

free spline approximation: given " aria-hidden="true">S_{k}(\mathcal{T}{h})k{th}-order\mathcal{T}{h}s|{[x_{j},x_{j+1}]}\in P_{k},s\in C^{k-1}([0,1])<span data-formula=".

Define " aria-hidden="true">X_{n}=S(n,k)={s;\exists \mathcal{T}{h} s.t. s\in S{k}(\mathcal{T}{h}).#(\mathcal{T}{h})\le n+1}X_{n}\subset X_{n+1},X_{n}+X_{n}\subset X_{2n}<span data-formula=".

Nodes are not fixed, but whose number is fixed, so we can find more suitable approximation.

Example

" aria-hidden="true">f(x)=x^{\alpha},0<x<1,0<\alpha<1<span data-formula=".

" aria-hidden="true">\alpha {th}-orderx{i}=\frac{i}{n}<span data-formula=".

" aria-hidden="true">\alpha {th}-ordert{i}=(\frac{i}{n})^{\frac{1}{\alpha}}<span data-formula="

" aria-hidden="true">E_{n}{L}(f_{i})_{\infty}=\mathcal{O}(\frac{1}{n{\alpha}}),E_{n}^{NL}(f_{i})_{\infty}=\mathcal{O}(\frac{1}{n})<span data-formula=".

Dictionary Learning

Set " aria-hidden="true">(V,\left|\cdot\right|)D\subset VV_{N}={f_{N}(x)=\sum_{k=1}^{N}c_{k}\phi_{k}(x);c_{k}\in R,\phi_{k}\in D},(\vec{c},\vec{\phi})=argmin_{\vec{c},\vec{\phi}}\left|f-f_{N}\right|{V}=\epsilon{L,f,N}<span data-formula=".

example

Consider 3 layers ANN, there is " aria-hidden="true">D_{3}={\phi(x)=\tau(a\cdot x+b);a\in R^{d},b\in R},V_{N}={f_{N}(x)=\vec{c}\circ \vec{\phi}(x)=\sum_{k=1}^{N}c_{k}\phi(x);c_{k}\in R,\phi_{k}=\tau(a_{k}x+b_{k})\in D_{3}}<span data-formula="

Therefore, the approximation of neural network can be regarded as dictionary learning.

Theorem.8.1.(Z.W.S.2019)

a. If " aria-hidden="true">L=3 \Rightarrow \epsilon_{L,f,N}=\mathcal{O}(N^{-\eta})L=4 \Rightarrow \epsilon_{L,f,N}=\mathcal{O}(N^{-2\eta})<span data-formula=".

b. " aria-hidden="true">L=5 \Rightarrow \epsilon_{L,f,N}=\mathcal{O}(N^{-\frac{2\alpha}{d}})<span data-formula=".

c. When " aria-hidden="true">L\ge 6\epsilon_{L,f,N}<span data-formula=" can not improve.

d. Deep narrow neural networks is ineffective than shallow wide neural networks.

" aria-hidden="true"> can not improve.

d. Deep narrow neural networks is ineffective than shallow wide neural networks.

本文章使用limfx的vscode插件快速发布