.. _GaussianProcess:


Gaussian process
----------------


A Gaussian process is a surrogate model that can predict a Gaussian distribution of model values based on the training data of the model :math:`f(x_1), f(x_2), \dots` for sampled points :math:`x_1, x_2, \dots \in \mathbb{R}^D`.  

In the simplest case, the prediction of a noiseless, scalar  function value :math:`f(x^*)` is given a normally distributed  random variable  :math:`y\sim \mathcal{N}( \overline{y}(x^*),\sigma^2(x^*))`  with mean and variance  

.. math::   \overline{y}(x^*) = \mu(x^*) + \sum_{ij} k(x^*,x_i) \mathbf{\Sigma}_{i j}^{-1}[f(x_j)-\mu(x_j)]  

  \sigma^2(x^*) = k(x^*,x^*) - \sum_{ij} k(x^*,x_i) \mathbf{\Sigma}_{i j}^{-1} k(x_j,x^*),  

where :math:`\mathbf{\Sigma}` is the covariance matrix with entries  :math:`\mathbf{\Sigma}_{ij} = k(x_i, x_j)`.  

The functional behavior of the mean :math:`\mu(x)` and the covariance function (also called kernel) :math:`k(x, x')` depend on a set of hyperparameters  :math:`\mathbf{h}` whose values are chosen to maximize the likelihood  of the observed function values. Depending on the configuration, the  Gaussian process can depend on more hyperparameters, that are optimized in the same manner.

.. _ActiveLearning.GP.optimization_step_min:


optimization_step_min (int)
"""""""""""""""""""""""""""
   Minimum number of observations of the objective   before the hyperparameters are optimized. 

   Default: Automatic choice according to number of dimensions.

   .. note:: Each derivative observation is counting as an independent observation.

.. _ActiveLearning.GP.optimization_step_max:


optimization_step_max (int)
"""""""""""""""""""""""""""
   Maximum number of observations of the objective  after which no more hyperparameter optimization is performed. 

   Default: ``300``

   .. note:: Each derivative observation is counting as an independent  observation.

.. _ActiveLearning.GP.min_optimization_interval:


min_optimization_interval (int)
"""""""""""""""""""""""""""""""
   Minimum number of observations of the objective  after which the hyperparameters are optimized. 

   Default: ``2``

   .. note:: Each derivative observation is counting as an independent observation.

.. _ActiveLearning.GP.max_optimization_interval:


max_optimization_interval (int)
"""""""""""""""""""""""""""""""
   Maximum number of observations of the objective  after which the hyperparameters are optimized. 

   Default: ``20``

   .. note:: Each derivative observation is counting as an independent observation.

.. _ActiveLearning.GP.optimization_level:


optimization_level (float)
""""""""""""""""""""""""""
   Controls how often the hyper-parameters are optimized. Small values (e.g. 0.01) lead to more frequent optimizations.  Large values (e.g. 1) lead to less frequent optimizations. 

   Default: ``0.2``

.. _ActiveLearning.GP.num_samples_hyperparameters:


num_samples_hyperparameters (int)
"""""""""""""""""""""""""""""""""
   Number of local searches for optimal hyperparameters. 

   Default: Automatic choice ``min(15, max(5, 2 * num_dim))`` according to number of dimensions ``num_dim``.

.. _ActiveLearning.GP.min_dist:


min_dist (float)
""""""""""""""""
   To speed up the hyperparameter optimization the surrogate model can be sparsified such that data points with a distance  (in terms of the length scales of the surrogate) below `min_dist` are neglected. 

   Default: ``0.0``

.. _ActiveLearning.GP.name:


name (str)
""""""""""
   The name of the surrogate model. 

   Default: ``"loss"``

.. _ActiveLearning.GP.output_dim:


output_dim (int)
""""""""""""""""
   Dimensionality of the output of the model,  i.e. dimensionality of data added to observations. 

   Default: ``1``

.. _ActiveLearning.GP.output_names:


output_names (cell{str})
""""""""""""""""""""""""
   Allows to assign names to each of the  ``output_dim`` outputs of the surrogate.  The length of ``output_names`` must be equal to ``output_dim``.  By specifying output names they can be accessed in variables as  ``{output_names[0]}, {output_names[1]}``, etc. 

   Default:   
   
    
   
   By default, the variables can be accessed as  ``{name}0, {name}1``, etc. where ``{name}`` is the name  of the surrogate.

.. _ActiveLearning.GP.horizon:


horizon (int)
"""""""""""""
   Specifies the number `N` of the latest  observations which are used for Gaussian process regression. This decreases the computational effort of determining new sampling positions at the cost of a decreased prediction  accuracy. 

   Default: Infinite horizon.

.. _ActiveLearning.GP.transformation:


transformation (str)
""""""""""""""""""""
   Specifies the name of the transformation applied to the parameter space. 

   Default: No transformation.

.. _ActiveLearning.GP.detect_noise:


detect_noise (bool)
"""""""""""""""""""
   If true, the noise of the function values and function value derivatives are estimated by means of two hyperparameters. 

   Default: ``false``

.. _ActiveLearning.GP.mean_function:


mean_function (str)
"""""""""""""""""""
   The name of the method to parameterize the mean function :math:`\mu(x^*)` of the surrogate model. The ``constant`` method models :math:`\mu(x^*)` by a constant which can be determined by an analytic expression. The ``linear`` method models :math:`\mu(x^*)` by a linear function with :math:`1+D` coefficients. The expansion  coefficients are determined by a maximum likelihood estimate.  Likewise, the ``quadratic`` method expands the mean function  up to quadratic order with :math:`1+D+D^2` coefficients. In order to scale better with the number of dimensions :math:`D`, the ``semiquadratic`` method neglects  mixed terms in the quadratic expansion leading to only :math:`1+2D` expansion coefficients.    
   
   Using a mean function other than ``constant`` increases the computational effort of hyperparameter optimization but can also lead to better predictions of the surrogate model. 

   Default: ``"constant"``
   Choices: ``"constant"``, ``"linear"``, ``"semiquadratic"``, ``"quadratic"``.

.. _ActiveLearning.GP.warping_function:


warping_function (str)
""""""""""""""""""""""
   The name of the warping function  :math:`w`. A warping function performs a transformation  :math:`y \to w(y,\mathbf{b})` of any function value  :math:`y = f(x)` by means of a strongly monotonic  increasing function :math:`w(y,\mathbf{b})` that depends on a set of hyperparameters :math:`\mathbf{b}`. The hyperparameters are chosen automatically by a maximum likelihood estimate. The choice ``identity`` leads to no warping of the function values while ``sinh`` uses a sinus hyperbolicus function for warping. Using ``sinh`` can result in better predictions at the cost of increasing the computational effort. It should be therefore only applied for more expensive black box functions. 

   Default: ``"identity"``
   Choices: ``"identity"``, ``"sinh"``.

.. _ActiveLearning.GP.joint_hyperparameters:


joint_hyperparameters (bool)
""""""""""""""""""""""""""""
   If false, mean functions and warping functions learn their hyperparameters for each output of the surrogate model separately. This increases the overall predictive power of the surrogate model, but it also requires more input data and larger computational effort to determine all hyper parameters accurately. 

   Default: ``true``

.. _ActiveLearning.GP.correlate_outputs:


correlate_outputs (bool)
""""""""""""""""""""""""
   If true, a global covariance matrix between all outputs of a multi-output surrogate model is determined and used for the predictions. In cases, where some outputs are linearly dependent, this approach fails since the the covariance matrix becomes ill-conditioned. 

   Default: ``false``

.. _ActiveLearning.GP.tolerance:


tolerance (float)
"""""""""""""""""
   Numerical tolerance parameter. Used as additional term in diagonal entries of covariance matrices to ensure positive definitness. 

   Default: ``1e-20``

.. _ActiveLearning.GP.tolerance_hypercov:


tolerance_hypercov (float)
""""""""""""""""""""""""""
   Numerical tolerance parameter for covariance matrix hyperparameter between multiple outputs of Gaussian process. 

   Default: ``1e-08``

.. _ActiveLearning.GP.covariance_matrix:


covariance_matrix (struct)
""""""""""""""""""""""""""
    

   Default: ``struct('resize_step', 1000, 'iterative_start_size', 500, 'cuda_start_size', 2000, 'num_iterative_solver_steps', 0, 'online_start_size', 500, 'max_data_hyper_derivs', 10000)``
   

   The object holds and processes information about the covariance :math:`\left(\Sigma\right)_{i j})` between all data points :math:`\mathbf{x}_1, \ldots, \mathbf{x}_N`. One important purpose is to determine the Cholesky decomposition of the covariance matrix, :math:`\Sigma=L L^T` which is used to make predictions or optimize hyperparameters. The decomposition takes usually :math:`\mathcal{O}(N^3)` steps. By using the Cholesky decomposition of a previous iteration, an iterative update can be computed in only :math:`\mathcal{O}(N^2)` steps.   
   
   See :ref:`matrix configuration <CovarianceMatrix>` for details.

.. toctree::
   :maxdepth: 100 
   :hidden:
   
   CovarianceMatrix



.. _ActiveLearning.GP.num_fantasies:


num_fantasies (int)
"""""""""""""""""""
   In the case of noisy training data, the  predictions of the Gaussian process have some uncertainty also at the sampling positions. Some acquisition functions like noisy expected improvement require the Gaussian process to provide a set of noiseless predictions instead.  The size of this set can be configured by ``num_fantasies``. Higher values lead to more stable values of the acquisition  function at the cost of increased computational complexity. 

   Default: ``16``

.. _ActiveLearning.GP.kernel:


kernel (struct)
"""""""""""""""
    

   Default: ``struct('min_ls_blocks_per_dim', 0.25, 'max_ls_blocks', 1000000.0, 'target_ls_blocks', 100, 'joint32ls_prior', true, 'matern_order', 5)``
   

      
   
    
   
     
   
    
   
   Gaussian processes model the correlation (or covariance) between  two function values :math:`f(x), f(x')` by means of a covariance function  :math:`k(x,x') = k(||x-x'||)`, also called kernel. The kernel is monotonically decreasing for increasing distance :math:`d = ||x-x'||` such that far apart function values are uncorrelated while close-by function values are strongly correlated.      
   
   The distance :math:`d` between function values is defined as the  scaled Euclidean distance       
   
   .. math::   d = ||x-x'||=\sqrt{\sum_{i=1}^{D}\frac{(x_i−x_i^{'})^{2}}{l_i^{2}}},      
   
   where the hyperparameters :math:`l_1,\dots,l_D` determine the characteristic  length scales at which the covariance between separated function values  becomes negligible.      
   
   The Matérn covariance function is defined as      
   
   .. math::      
   
     k_{\nu}(x, x') = \frac{\sigma_0^2}{\Gamma(\nu)2^{\nu-1}}   \left[\sqrt{2\nu} d \right]^\nu    K_\nu\left( \sqrt{2\nu} d\right),      
   
   where  :math:`\sigma_0` is a hyperparameter that determines the  standard deviation of possible function values,   :math:`K_\nu(\cdot)` is a modified Bessel function,  and :math:`\Gamma(\cdot)` is the gamma function.   
   
   See :ref:`matern configuration <GPMaternKernel>` for details.

.. toctree::
   :maxdepth: 100 
   :hidden:
   
   GPMaternKernel



.. _ActiveLearning.GP.outlier_detector:


outlier_detector (struct)
"""""""""""""""""""""""""
    

   Default: ``None``
   The parameter is optional and may also have the  value ``None``.

   Gaussian process regression can cope with noisy input data  that is subject to Gaussian noise. Sometimes however, input data can also be  corrupted and falls out of the probability distribution.  Corrupted data can heavily decrease the prediction accuracy of Gaussian  processes. The outlier detector assumes that the input data  :math:`\mathbf{y}` follows a Student-t distribution rather than a  Gaussian distribution. The Student-t distribution is parameterized by  the degree of freedom :math:`\text{dof}` and a scale parameter :math:`\sigma`.  
   
   The most probable vector of function values :math:`\hat{\mathbf{f}}`  (i.e. the mode) given the observed data :math:`\mathbf{y}` is determined by  an expectation maximimzation (EM) algorithm. An observation :math:`y_i`  is assumed to be an outlier if  :math:`(\hat{f}_i - y_i)^2 > t\cdot \text{dof}\cdot \sigma^2`, where :math:`t` is the tolerance of the outlier detection.   
   
   See :ref:`detector configuration <OutlierDetector>` for details.

.. toctree::
   :maxdepth: 100 
   :hidden:
   
   OutlierDetector