Title: On the Characterization of the Global Landscape of Neural Networks
Abstract: Understanding why deep neural networks perform well has attracted much attention recently. The non-convexity of the associated loss functions, which may cause a bad landscape, is one of the major concerns for neural network training, but the recent success of neural networks suggests that their loss landscape is not too bad. Nevertheless, a systematic characterization of the landscape is still yet to be done. In this thesis, we aim at a more complete understanding of the global landscape of neural networks.
In the first part, we study the existence of sub-optimal local minima for multi-layer networks. In particular, we prove that for neural networks with generic input data and smooth nonlinear activation functions, sub-optimal local minima can exist, no matter how wide the network is (as long as the last hidden layer has at least two neurons). This result overturns a classical result claiming that "there exists no sub-optimal local minimum for 1-hidden-layer wide neural nets with sigmoid activation function". Moreover, it indicates that sub-optimal local minima are common for wide neural nets.
Given that we cannot eliminate sub-optimal local minima for neural networks, a natural question is: what is the true landscape of neural networks? Specifically, does width affect the landscape? In the second part, we prove two results: on the positive side, for any continuous activation functions, the loss surface of a class of wide networks has no sub-optimal basin, where "basin" is defined as the set-wise strict local minimum; on the negative side, for a large class of networks with width below a threshold, we construct strict local minima that are not globally optimal. These two results together show the phase transition in landscape from narrow to wide networks and indicate the benefit of width as well.
In the last part, we move on to explore how the previously mentioned phase transition occurs via the "generative mechanism" of stationary points. We study a certain transformation called "neuron splitting", which maps a stationary point in narrower networks into stationary points in wider networks. We provide sufficient conditions on which the stationary points of the wider networks are local minima or saddle points: under certain conditions, a local minimum is mapped to a high-dimensional plateau that contains both local minima and saddles of an arbitrarily wide network, while any saddle points can only be mapped to saddle points of wider networks by neuron splitting.
These results altogether characterize the properties of stationary points in neural networks: the existence in different settings, the location and shape, as well as the evolution of stationary points when restructuring the neural networks. They not only provide a deeper understanding of the success of the current wide neural networks, but also propose potential methods to tackle the difficulties in training smaller neural networks.