This article provides a short proof for the global convergence of GD in training deep ReLU networks. For arbitrary labels, it is shown that a linear, quadratic or cubic width suffices to prove the result (depending on the initilization).
Earlier works showed that gradient descent (GD) can find a global solution when all the hidden layers are polynomially wide. However this condition makes neural networks operate in a kernel regime. This article shows that global convergence can be proved for deep pyramidal nets -- a much more empirically relevant architecture where only the first hidden layer needs to be wide and the remaining layers have constant and non-increasing widths. For this pyramidal network of constant widths, GD provably moves the feature representations of the network by at least Ω(1), and hence training goes beyond the NTK/lazy regime where this change is typically o(1).
We cast the optimization problem of a polynomial network as a nonlinear eigenvalue problem. We study the uniqueness and global optimality of the solution, and propose a generalized power method to solve it.
For shallow networks, it is shown that having N+1 hidden neurons is both necessary and sufficient for the training loss function of neural networks to have connected sublevel sets. For deeper architecture, this condition is shown to be sufficient. However, whether it is necessary or not for multilayer networks is still an open problem.
This article gives a condition for which certain points in parameter space can be connected by a continuous path along which there are no barriers or jumps in the loss landscape. This sounds similar to results on connected sublevel sets, but it is weaker in the sense that the connectivity is only shown for a subset of solutions. At the same time, the requirement on over-parameterization is also weaker. Empirically, it is found that the provided condition can capture well the mode connectivity phenomenon concerning SGD solutions in deep learning.
This article proves the connectivity of sublevel sets of the loss function of deep pyramidal networks.
Consider neural networks as a directed acyclic graph, this article shows that the loss function has no spurious valleys as long as there are enough skip-connections from lower layers to the output layer. Empirically, it is shown that adding random skip-connections from lower layers to the output can remove not only spurious valleys but also vanishing gradient issues, which makes the training of very deep networks much more stable and efficient.
In order for neural networks to learn disconnected decision regions in the input space, at least one of the hidden layers should have more neurons than the input dimension.
This article shows that a standard convolutional layer suffices to memorize any N samples as long as the number of parameters exceeds N. It also provides a condition for global optimality of critical points in deep CNNs.
This article studies the global optimality of local minima for deep nonlinear networks. The proof exploits Implicit Function Theorem to characterize the optimality of local minima in terms of their non-degenerate conditions.
The spectrum of the NTK has found applications in proving memorization capacity, convergence of GD and generalization bounds in certain regimes. This article provides tight lower bounds on the smallest eigenvalue of the NTK matrix for Gaussian weights, both in the limit of infinitely wide networks, and for finite-width networks.
He's and LeCun's initialization are very popular methods for initializing neural network weights in deep learning. However, the formulas of these initializations in the original papers have been only derived under the assumption that all the hidden neurons are somewhat independent -- a condition known to be satisfied only for infinitely wide networks. This article provides a rigorous derivation for the case of networks with flinite abeit large widths.
This extends our NIPS'16 paper on polynomial networks to more general optimization problems.
This is an extension of our CVPR paper on hypergraph matching.
This article studies hypergraph matching through the lens of multilinear optimization.
This article proposes a model for learning the compatibility of data and class embeddings for zero shot leanring.
Convex Optimization
Advanced Topics in Machine Learning (Seminar)
Loss surface of deep and wide neural networks
Math Machine Learning seminar at MPI-MIS and UCLA, (virtual) 2020
Optimization landscape of deep neural networks
Simons Institute for the Theory of Computing, Berkeley, California. 2019
Optimization landscape of deep CNNs
Microsoft Research Redmond (MSR) 2018