Maximisers vs. Samplers

Finding a good fit for a neural network requires combatting two issues: Multimodality, and overfitting. Multimodality occurs when there is more than one “good” fit, and is particularly important where there are multiple “best” fits.

Model fitting is performed by a variety of approaches which fall into two categories: Maximisers and Samplers.

2.1 Maximisers

Maximisers fit models by finding the single ‘best fit’ parameters of the model. For example:
• Gradient methods
• Stochastic optimisation
• Genetic algorithms

 

Fig. 1: Example of gradient ascent for non-convex loss function (such as a neural network), with two parameters 𝜃0 and 𝜃1. Source: Andrew Ng.

In general, such methods are prone to getting stuck in local peaks and may miss the global maximum. However, in some situations there may be multiple peaks of equal quality. In such cases, choosing a single peak can correspond to 6 confining your fit to a single mode of operation. More advanced methods use multiple peaks simultaneously to find the optimal fit.

2.2 Samplers

Samplers fit models by finding a collection of ‘most typical’ parameters of the model, and typically do so using a Markov Chain Monte Carlo (MCMC) approach. For example:
• Metropolis Hastings
• Nested Sampling
• Simulated Annealing

Fig. 2: Example of samples drawn from a bimodal posterior distribution. Source: Alex Rogozhinikov.

Sampling is advantageous for two reasons. First, a perfect sampler will explore multimodal distributions correctly. Second, sampling a function naturally combats over-fitting and allows quantification of errors in your analysis and the fidelity of your fit.

Maximisers attempt to combat overfitting via methods like regularisation and dropout. However, the reasons and intuition behind why such methods work are obscure, dubious, and a little ad-hoc. Moreover, optimising maximiserbased neural networks requires trial-and-error attempts in selecting and adjusting hyperparameters, which is not the most effective and qualitative approach to machine learning.

For the reasons above, the use of PolyChord represents a preference of sampling over maximising.