.. _NeuralNetworkEnsemble: Neural Network Ensemble ----------------------- Ensembles of fully connected deep neural networks with an arbitrary number of hidden layers and neurons, different activation functions and weight initialization schemes are used for estimation of the uncertainty of the model. In this way we are able to approximate Bayesian posterior and use it in the context of active learning as well as effectively reduce the chances of overfitting by averaging over the predictions of the ensemble members. Deep ensemble surrogates are expected to perform better than Gaussian processes in higher-dimensional parameter spaces due to the large number of degrees of freedom associated with parametric models. The training time of the deep ensemble surrogate scales linearly with the number of observations, in contrast to quadratic or cubic complexity characteristic to Gaussian processes, which renders them more suitable for large datasets. .. _ActiveLearning.NN.optimization_step_min: optimization_step_min (int) """"""""""""""""""""""""""" Minimum number of observations of the objective before the hyperparameters are optimized. Default: Automatic choice according to number of dimensions. .. note:: Each derivative observation is counting as an independent observation. .. _ActiveLearning.NN.optimization_step_max: optimization_step_max (int) """"""""""""""""""""""""""" Maximum number of observations of the objective after which no more hyperparameter optimization is performed. Default: ``300`` .. note:: Each derivative observation is counting as an independent observation. .. _ActiveLearning.NN.min_optimization_interval: min_optimization_interval (int) """"""""""""""""""""""""""""""" Minimum number of observations of the objective after which the hyperparameters are optimized. Default: ``2`` .. note:: Each derivative observation is counting as an independent observation. .. _ActiveLearning.NN.max_optimization_interval: max_optimization_interval (int) """"""""""""""""""""""""""""""" Maximum number of observations of the objective after which the hyperparameters are optimized. Default: ``20`` .. note:: Each derivative observation is counting as an independent observation. .. _ActiveLearning.NN.optimization_level: optimization_level (float) """""""""""""""""""""""""" Controls how often the hyper-parameters are optimized. Small values (e.g. 0.01) lead to more frequent optimizations. Large values (e.g. 1) lead to less frequent optimizations. Default: ``0.2`` .. _ActiveLearning.NN.num_samples_hyperparameters: num_samples_hyperparameters (int) """"""""""""""""""""""""""""""""" Number of local searches for optimal hyperparameters. Default: Automatic choice ``min(15, max(5, 2 * num_dim))`` according to number of dimensions ``num_dim``. .. _ActiveLearning.NN.min_dist: min_dist (float) """""""""""""""" To speed up the hyperparameter optimization the surrogate model can be sparsified such that data points with a distance (in terms of the length scales of the surrogate) below `min_dist` are neglected. Default: ``0.0`` .. _ActiveLearning.NN.name: name (str) """""""""" The name of the surrogate model. Default: ``'loss'`` .. _ActiveLearning.NN.output_dim: output_dim (int) """""""""""""""" Dimensionality of the output of the model, i.e. dimensionality of data added to observations. Default: ``1`` .. _ActiveLearning.NN.output_names: output_names (list[str]) """""""""""""""""""""""" Allows to assign names to each of the ``output_dim`` outputs of the surrogate. The length of ``output_names`` must be equal to ``output_dim``. By specifying output names they can be accessed in variables as ``{output_names[0]}, {output_names[1]}``, etc. Default: By default, the variables can be accessed as ``{name}0, {name}1``, etc. where ``{name}`` is the name of the surrogate. .. _ActiveLearning.NN.horizon: horizon (int) """"""""""""" Specifies the number `N` of the lastest observations which are used for surrogate model training. This decreases the computational effort of training at the cost of a decreased prediction accuracy. Default: Infinite horizon. .. _ActiveLearning.NN.transformation: transformation (str) """""""""""""""""""" Specifies the name of the transformation applied to the parameter space. Default: No transformation. .. _ActiveLearning.NN.detect_noise: detect_noise (bool) """"""""""""""""""" If true, the noise of the function values and function value derivatives are estimated by means of two hyperparameters. Default: ``False`` .. _ActiveLearning.NN.mean_function: mean_function (str) """"""""""""""""""" The name of the method to parameterize the mean function :math:`\mu(x^*)` of the surrogate model. The ``constant`` method models :math:`\mu(x^*)` by a constant which can be determined by an analytic expression. The ``linear`` method models :math:`\mu(x^*)` by a linear function with :math:`1+D` coefficients. The expansion coefficients are determined by a maximum likelihood estimate. Likewise, the ``quadratic`` method expands the mean function up to quadratic order with :math:`1+D+D^2` coefficients. In order to scale better with the number of dimensions :math:`D`, the ``semiquadratic`` method neglects mixed terms in the quadratic expansion leading to only :math:`1+2D` expansion coefficients. Using a mean function other than ``constant`` increases the computational effort of hyperparameter optimization but can also lead to better predictions of the surrogate model. Default: ``'constant'`` Choices: ``'constant'``, ``'linear'``, ``'semiquadratic'``, ``'quadratic'``. .. _ActiveLearning.NN.warping_function: warping_function (str) """""""""""""""""""""" The name of the warping function :math:`w`. A warping function performs a transformation :math:`y \to w(y,\mathbf{b})` of any function value :math:`y = f(x)` by means of a strongly monotonic increasing function :math:`w(y,\mathbf{b})` that depends on a set of hyperparameters :math:`\mathbf{b}`. The hyperparameters are chosen automatically by a maximum likelihood estimate. The choice ``identity`` leads to no warping of the function values while ``sinh`` uses a sinus hyperbolicus function for warping. Using ``sinh`` can result in better predictions at the cost of increasing the computational effort. It should be therefore only applied for more expensive black box functions. Default: ``'identity'`` Choices: ``'identity'``, ``'sinh'``. .. _ActiveLearning.NN.joint_hyperparameters: joint_hyperparameters (bool) """""""""""""""""""""""""""" If false, mean functions and warping functions learn their hyperparameters for each output of the surrogate model separately. This increases the overall predictive power of the surrogate model, but it also requires more input data and larger computational effort to determine all hyper parameters accurately. Default: ``True`` .. _ActiveLearning.NN.correlate_outputs: correlate_outputs (bool) """""""""""""""""""""""" If true, a global covariance matrix between all outputs of a multi-output surrogate model is determined and used for the predictions. In cases, where some outputs are linearly dependent, this approach fails since the the covariance matrix becomes ill-conditioned. Default: ``False`` .. _ActiveLearning.NN.tolerance: tolerance (float) """"""""""""""""" Numerical tolerance parameter. Used as additional term in diagonal entries of covariance matrices to ensure positive definitness. Default: ``1e-20`` .. _ActiveLearning.NN.tolerance_hypercov: tolerance_hypercov (float) """""""""""""""""""""""""" Numerical tolerance parameter for covariance matrix hyperparameter between multiple outputs of Gaussian process. Default: ``1e-08`` .. _ActiveLearning.NN.covariance_matrix: covariance_matrix (dict) """""""""""""""""""""""" Default: ``{'resize_step': 1000, 'iterative_start_size': 500, 'cuda_start_size': 2000, 'num_iterative_solver_steps': 0, 'online_start_size': 500, 'max_data_hyper_derivs': 10000}`` The object holds and processes information about the covariance :math:`\left(\Sigma\right)_{i j})` between all data points :math:`\mathbf{x}_1, \ldots, \mathbf{x}_N`. One important purpose is to determine the Cholesky decomposition of the covariance matrix, :math:`\Sigma=L L^T` which is used to make predictions or optimize hyperparameters. The decomposition takes usually :math:`\mathcal{O}(N^3)` steps. By using the Cholesky decomposition of a previous iteration, an iterative update can be computed in only :math:`\mathcal{O}(N^2)` steps. See :ref:`matrix configuration ` for details. .. toctree:: :maxdepth: 100 :hidden: CovarianceMatrix1 .. _ActiveLearning.NN.num_NNs: num_NNs (int) """"""""""""" Number of neural networks in the deep ensemble. The bigger the number, the better the approximation of the underlying probability distribution related to the final objective. Default: ``10`` .. _ActiveLearning.NN.hidden_layers_arch: hidden_layers_arch (list[int]) """""""""""""""""""""""""""""" The architecture of the hidden layers of each ensemble member. The number of neurons in each hidden layer is specified by an entry in the corresponding list. The architecture for all the ensemble members has to be the same. Default: ``[200,200]`` .. _ActiveLearning.NN.activation_fun: activation_fun (str) """""""""""""""""""" Nonlinear activations for all neurons in the ensemble members, except for the last layer neurons which by default do not have any activations. This is because we are concerned with regression problems. Default: ``'sine'`` Choices: ``'sine'``, ``'tanh'``, ``'sigmoid'``, ``'relu'``. .. _ActiveLearning.NN.seed: seed (bool) """"""""""" Seed flag is used to enforce the same initial weight distribution over the whole deep ensemble. Used for the purpose of reproducibility. Default: ``False`` .. _ActiveLearning.NN.trainer: trainer (dict) """""""""""""" The training of the network can be either performed on all available data points (``full_data_trainer``) or on a subset of points while the best configuratio is determined based on a set of validation points (``validation_trainer``). Default: ``{'type': 'full_data_trainer', 'num_epochs': 800, 'num_expel_NNs': 0, 'learning_rate': 0.006, 'scale_grad_loss': 0.5, 'optimizer': 'Adam', 'weight_decay': 0.0, 'loss_function': 'MSE', 'train_mean_grad': True, 'batch_size_train': None, 'save_history_path': None, 'num_epochs_warm': 200, 'learning_rate_warm': 0.004, 'warm_start': False, 'shrink': 0.9, 'perturb': 0.1, 'cold_start_interval': 20}`` The dict entry 'type' specifies the type of the module. The remaining entries specify its properties. **Trainer module** (type ``'full_data_trainer'``): In this module, the ensemble is trained on the whole available dataset. This is convenient for active learning purposes since the points are expensive to obtain and the datasets are usually very small. The ensemble weights yielding the smallest loss are saved, which helps in reducing the oversampling over the existing training points. This can be further refined by expelling the worst performing ensemble members according to the training loss by adjusting num_expel_NNs parameter. The training of the ensemble in the context of active learning can be significantly accelerated in each iteration by invoking warm starting process. See :ref:`full_data_trainer configuration ` for details. **Trainer and validation module** (type ``'validation_trainer'``): In this module, the fixed dataset provided (usually by study.add_many) is split into training and testing subsets according to the ``split_ratio`` parameter. The ensemble is trained on the training subset by minimizing the corresponding loss, while the generalization performance is assessed according to the loss induced on the testing subset. The ensemble weights yielding the smallest loss on the testing subset are saved, effectively reducing the chances of overfitting. Worst performing ensemble members according to the loss on the testing subset can be expelled by adjusting num_expel_NNs parameter delivering a reduced ensemble with better generalization qualities. See :ref:`validation_trainer configuration ` for details. .. toctree:: :maxdepth: 100 :hidden: FullDataTrainer ValidationTrainer .. _ActiveLearning.NN.kernel: kernel (dict) """"""""""""" Default: ``{'min_ls_blocks_per_dim': 2.0, 'max_ls_blocks': 100000000, 'target_ls_blocks': 100, 'joint32ls_prior': True, 'matern_order': 5, 'weight_prior': 'StudentT', 'scale_hidden_weights': False, 'num_NNs': 300}`` Gaussian processes model the correlation (or covariance) between two function values :math:`f(x), f(x')` by means of a covariance function :math:`k(x,x') = k(||x-x'||)`, also called kernel. The kernel is monotonically decreasing for increasing distance :math:`d = ||x-x'||` such that far apart function values are uncorrelated while close-by function values are strongly correlated. The distance :math:`d` between function values is defined as the scaled Euclidean distance .. math:: d = ||x-x'||=\sqrt{\sum_{i=1}^{D}\frac{(x_i−x_i^{'})^{2}}{l_i^{2}}}, where the hyperparameters :math:`l_1,\dots,l_D` determine the characteristic length scales at which the covariance between separated function values becomes negligible. The Matérn covariance function is defined as .. math:: k_{\nu}(x, x') = \frac{\sigma_0^2}{\Gamma(\nu)2^{\nu-1}} \left[\sqrt{2\nu} d \right]^\nu K_\nu\left( \sqrt{2\nu} d\right), where :math:`\sigma_0` is a hyperparameter that determines the standard deviation of possible function values, :math:`K_\nu(\cdot)` is a modified Bessel function, and :math:`\Gamma(\cdot)` is the gamma function. See :ref:`matern configuration ` for details. .. toctree:: :maxdepth: 100 :hidden: NNMaternKernel