.. _ActiveLearning:

================================================
ActiveLearning
================================================

Contents
=========

:`Purpose`_: The purpose of the driver.
:`Tutorials`_: Tutorials demonstrating the application of this driver.
:`Driver Interface`_: Driver-specific methods of the Python interface.
:`Configuration`_: Configuration of the driver.                   


Purpose
=======


The driver defines a general active learning process that
can be configured for many different purposes such as optimization,
integration or learning of the behavior of one or more expensive black box function.

.. tip::
   The driver is very versatile but also requires a detailed configuration. For
   some use cases there exist specialized drivers that are based on
   this driver.
   For example, for the Bayesian optimization of a scalar function the
   :ref:`BayesianOptimization` driver is easier to set up.

Each black box function
:math:`\mathbf{f}_{\rm bb}(\mathbf{p}_{\rm design}, \mathbf{p}_{\rm env})`
can depend on a vector of design parameters :math:`\mathbf{p}_{\rm design}\in\mathcal{X}` 
and possibly environment parameters :math:`\mathbf{p}_{\rm env}\in\mathcal{E}`. The
design space :math:`\mathcal{X}` and environment :math:`\mathcal{E}` can be
configured when creating the study (see :func:`~jcmoptimizer.Client.create_study`).
   
.. tip::
                
   The output of the black-box functions can be a also vectorial.
   For example, the black box function
   :math:`\mathbf{f}_{\rm bb}: \mathbf{p}_{\rm design} \mapsto (a,b,c)\in\mathbb{R}^3`
   could map a design value to three output values.
   The goal of the study could be to find design parameters that
   minimize :math:`a^2 - b^2` and fulfill the constraint :math:`c \leq 1`.

The most important configurations of the driver concerns three parameters:

:Surrogates: The dependence of the output values on the parameters are learned via
  one or more surrogate models (see `ActiveLearning.surrogates`_ parameter).

:Variables: The target values of the study are defined by variables that are based on the
  outputs of surrogate models or other variables (see `ActiveLearning.variables`_
  parameter).

:Objectives: Finally, one defines objectives of the active learning such as a minimizing
  a variable or constraining a variable in a specified interval
  (see `ActiveLearning.objectives`_ parameter).
          
 


Tutorials
=========
.. toctree::

   ../tutorials/changing_environment

   ../tutorials/harmonic_oscillator_fit

   ../tutorials/benchmark

   ../tutorials/sensitivity_analysis

   ../tutorials/multi_objective_optimization

   ../tutorials/active_surrogate_training

   ../tutorials/physics_informed_bayesian_optimization



Driver Interface
================

The driver instance can be obtained by :attr:`.Study.driver`.

.. currentmodule:: jcmoptimizer
.. autoclass:: ActiveLearning
   :members:
   :inherited-members:
      

   



Configuration
=============
The configuration parameters can be set by calling, e.g.

.. code-block:: python

   study.configure(example_parameter1 = [1,2,3], example_parameter2 = True)



.. _ActiveLearning.max_iter:


max_iter (int)
""""""""""""""
   Maximum number of evaluations of the studied system. 

   Default: Infinite number of evaluations.

.. _ActiveLearning.max_time:


max_time (float)
""""""""""""""""
   Maximum run time of study in seconds.  The time is counted from the moment, that the parameter is set or reset. 

   Default: ``inf``

.. _ActiveLearning.max_memory:


max_memory (float)
""""""""""""""""""
   Maximum amount of RAM in percent that the driver may use. 

   Default: ``80.0``

.. _ActiveLearning.num_parallel:


num_parallel (int)
""""""""""""""""""
   Number of parallel evaluations of the studied system. 

   Default: ``1``

.. _ActiveLearning.transformations:


transformations (list[dict])
""""""""""""""""""""""""""""
   A list of transformations of the parameter space. A surrogate can learn the mapping from the original or transformed parameter space to observation data for the model. 

   Default: ``[]``

   .. admonition:: Example

      A transformation that transformes the absolute parameters :math:`x_1, x_2` to relative parameters :math:`r = x_2-x_1, R=0.5(x_1 + x_2)`.

      .. code-block:: python

         [{'type': 'general', 'name': 'trans1', 'parameters': [{'type': 'expression', 'name': 'r', 'expression': 'x2 - x1', 'bounds': [-2,2]},
          {'type': 'expression', 'name': 'R', 'expression': '0.5*(x1 + x2)', 'bounds': [-2,2]}]}]

   Each element of the list must be a dict. In the following, the list elements are described:

   An input transformation allows to transform a parameter space input another parameter space. The transformation is defined by a list of parameters spanning the new parameter space.   
   
   See :ref:`trans configuration <GeneralTransformedSpace>` for details.

.. toctree::
   :maxdepth: 100 
   :hidden:
   
   GeneralTransformedSpace



.. _ActiveLearning.surrogates:


surrogates (list[dict])
"""""""""""""""""""""""
   A list of surrogate models used to learn the behavior  of the black box function. 

   Default: ``[{'type': 'GP', 'optimization_step_max': 300, 'min_optimization_interval': 2, 'max_optimization_interval': 20, 'optimization_level': 0.2, 'min_dist': 0.0, 'name': 'loss', 'output_dim': 1, 'detect_noise': False, 'mean_function': 'constant', 'warping_function': 'identity', 'joint_hyperparameters': True, 'correlate_outputs': False, 'tolerance': 1e-20, 'tolerance_hypercov': 1e-08, 'covariance_matrix': {'resize_step': 200, 'iterative_start_size': 500, 'cuda_start_size': 2000, 'num_iterative_solver_steps': 0, 'online_start_size': 500, 'max_data_hyper_derivs': 1000}, 'num_fantasies': 16, 'kernel': {'min_ls_blocks_per_dim': 0.25, 'max_ls_blocks': 1000000.0, 'target_ls_blocks': 100, 'joint32ls_prior': True, 'matern_order': 5}, 'outlier_detector': None}]``

   .. admonition:: Example

      Two Gaussian process surrogates that learn the dependence of altogether 4 black-box outputs on the design parameters.

      .. code-block:: python

         
                             study.configure(surrogates=[
                                dict(type="GP", output_dim="3", name="intensities")
                                dict(type="GP", output_dim="1", name="loss")
                             ])
                             #add some data to "intensities" model
                             obs = study.new_observation()
                             obs.add([1.0,2.0,3.0], model_name="intensities")
                             obs.add([6.0], model_name="loss")
                             study.add_observation(obs, design_value=[7.4, 3.1])                    
                             

   Each element of the list must be a dict. The dict entry ``type`` specifies the type of the element. The remaining entries  specify its properties. In the following, all possible  list element types are described:

      **Gaussian process** (type ``'GP'``): A Gaussian process is a surrogate model that can predict a Gaussian distribution of model values based on the training data of the model :math:`f(x_1), f(x_2), \dots` for sampled points :math:`x_1, x_2, \dots \in \mathbb{R}^D`.  
      
      In the simplest case, the prediction of a noiseless, scalar  function value :math:`f(x^*)` is given a normally distributed  random variable  :math:`y\sim \mathcal{N}( \overline{y}(x^*),\sigma^2(x^*))`  with mean and variance  
      
      .. math::   \overline{y}(x^*) = \mu(x^*) + \sum_{ij} k(x^*,x_i) \mathbf{\Sigma}_{i j}^{-1}[f(x_j)-\mu(x_j)]  
      
        \sigma^2(x^*) = k(x^*,x^*) - \sum_{ij} k(x^*,x_i) \mathbf{\Sigma}_{i j}^{-1} k(x_j,x^*),  
      
      where :math:`\mathbf{\Sigma}` is the covariance matrix with entries  :math:`\mathbf{\Sigma}_{ij} = k(x_i, x_j)`.  
      
      The functional behavior of the mean :math:`\mu(x)` and the covariance function (also called kernel) :math:`k(x, x')` depend on a set of hyperparameters  :math:`\mathbf{h}` whose values are chosen to maximize the likelihood  of the observed function values. Depending on the configuration, the  Gaussian process can depend on more hyperparameters, that are optimized in the same manner.      
      
      See :ref:`GP configuration <GaussianProcess>` for details.

      **Neural Network Ensemble** (type ``'NN'``): Ensembles of neural networks provide a prediction from each ensemble member allowing to compute statistical measures like mean and standard deviation of the predictions. In this way it can approximate Bayesian posteriors like those coming from Gaussian processes.  
      
      While for medium number of data points, the training of the ensemble is computationally more demanding than that of a Gaussian process, for large data sets the linear scaling of the training effort becomes beneficial since the effort of Gaussian processes scales at least quadratically with the number of data points.  
      
      Another advantage of neural networks over Gaussian processes is that they can learn more complex functions with non-stationary behaviour, i.e. different correlation lengths in different regions of the parameter space.      
      
      See :ref:`NN configuration <NeuralNetworkEnsemble>` for details.

.. toctree::
   :maxdepth: 100 
   :hidden:
   
   GaussianProcess
   NeuralNetworkEnsemble



.. _ActiveLearning.variables:


variables (list[dict])
""""""""""""""""""""""
   A list of variables that can depend on the outputs  of surrogates or other variables that are further up in the list. 

   Default: ``[{'type': 'SingleSelector', 'name': 'v', 'input': 'loss', 'select_by_name': 'loss'}]``

   .. admonition:: Example

      An expression based on the outputs ``variable1``, ``variable2``.

      .. code-block:: python

         [{'type': 'Expression', 'expression': 'variable1*sin(2*PI*variable2)^2'}]

   Each element of the list must be a dict. The dict entry ``type`` specifies the type of the element. The remaining entries  specify its properties. In the following, all possible  list element types are described:

      **Single-selector variable** (type ``'SingleSelector'``):  
      
      This variable selects one input name of a surrogate model or a multi-output variable.      
      
      See :ref:`SingleSelector configuration <SingleSelectorVariable>` for details.

      **Multi-selector variable** (type ``'MultiSelector'``):  
      
      This variable selects multiple outputs of a surrogate model or a multi-output variable.      
      
      See :ref:`MultiSelector configuration <MultiSelectorVariable>` for details.

      **Collector variable** (type ``'Collector'``):  
      
      This variable collects outputs of multiple surrogate models or variables.      
      
      See :ref:`Collector configuration <CollectorVariable>` for details.

      **Linear combination variable** (type ``'LinearCombination'``):  
      
      This variable describes a linear combination  :math:`c_0 + c_1 \cdot y_1 + \cdots + c_N \cdot y_N` of :math:`N` inputs  :math:`y_1, y_2, \cdots, y_N` stemming either from a surrogate with :math:`N` inputs or from any set of :math:`N` variables.      
      
      See :ref:`LinearCombination configuration <LinearCombinationVariable>` for details.

      **Chi-squared variable** (type ``'ChiSquaredValue'``):  
      
      This variable describes the chi-squared deviation   
      
      .. math::   \chi^2 = \sum_{i=1}^K \frac{\left(t_i - y_i\right)^2}{\eta_i^2}.  
      
      between `K` outputs :math:`y_i` of a surrogate and `K` targets  :math:`t_i` scaled by the target uncertainties :math:`\eta_i`.   
      
      If the target uncertainties are not independent but correlated, it is also possible to specify the covariance matrix :math:`G` of the targets. In this case, the chi-squared deviation is given as  
      
      .. math::   \chi^2 = \sum_{i,j=1}^K (t_i - y_i) G_{i j}^{-1} (t_j - y_j).  
      
      If the predictions of the surrogate, :math:`y_i`,  follow a Gaussian distribution, the variable follows a  generalized chi-squared distribution.      
      
      See :ref:`ChiSquaredValue configuration <ChiSquaredVariable>` for details.

      **Negative log-probability variable** (type ``'NegLogProbability'``): This variable describes the negative log-probability of model parameters :math:`\mathbf{p}` given a set of :math:`K` measurement targets :math:`t_i`, :math:`i=1,\ldots,K`.  
      
      We assume that the measurement process can be accurately modeled by a function :math:`\mathbf{f}(\mathbf{p}) \in \mathbb{R}^K`. That is, the vector of measurements :math:`\mathbf{t}` is a random vector with  
      
      .. math::   \mathbf{t} = \mathbf{f}(\mathbf{p}) + \mathbf{w}  
      
      with :math:`\mathbf{w} \sim \mathcal{N}(0, \mathbf{G})`, where :math:`\mathbf{G}` is a covariance matrix of the measurement errors. In most cases the covariance matrix is diagonal (i.e. the measurement errors are uncorrelated) with diagonal entries :math:`G_{ii} = \eta_i^2`.  
      
      Sometimes, the measurement noise is known. However, generally one has to find a parameterized error model :math:`\eta_i(\mathbf{d})` for the variance itself.  A common choice is to assume that the error is composed of a background term :math:`b` and a noise contribution which scales linearly with :math:`f_i(\mathbf{p})`:  
      
      .. math::  
      
         \eta_i^2(a,b,\mathbf{p}) = b^2 + \left[a    f_i(\mathbf{p})\right]^2  
      
      Since every entry of the measurement vector follows a normal distribution :math:`t_i \sim \mathcal{N}(f_i(\mathbf{p}),\eta_i(\mathbf{d}))` the joint *likelihood* of measuring the vector :math:`\mathbf{t}` is given as  
      
      .. math::    P(\mathbf{t} | \mathbf{p}, \mathbf{d}) = \prod_{i=1}^K    \frac{1}{\sqrt{2\pi}\eta_i(\mathbf{d})}\exp\left[-\frac{1}{2}\left(\frac{t_i -    f_i(\mathbf{p})}{\eta_i(\mathbf{d})}\right)^2\right].  
      
      Sometimes, non-uniform *prior distributions* for the design parameter vector :math:`P_\text{prior}(\mathbf{p})` and the error model parameters :math:`P_\text{prior}(\mathbf{d})` are available. The *posterior distribution* is then proportional to  
      
      .. math::  
      
         P(\mathbf{p}, \mathbf{d} | \mathbf{t}) \propto P(\mathbf{t}    | \mathbf{p}, \mathbf{d})    P_\text{prior}(\mathbf{p}) P_\text{prior}(\mathbf{d})  
      
      .. warning::    If a :ref:`parameter distribution<ActiveLearning.parameter_distribution>`    for the design space is defined, make sure    that it has a non-vanishing probability distribution within the    boundaries of the design space. Otherwise the negative log-probability can    be infinite. In this case the sample computation is numerically unstable.  
      
      Alltogether, the target of finding the parameters with maximum posterior probability density is equivalent of minimizing the value of the negative log-likelihood  
      
      .. math::    \begin{split}    -\log\left(P(\mathbf{p}, \mathbf{d}| \mathbf{t})\right) = &    \frac{1}{2} K\log(2\pi)    +\sum_{i=1}^K\log\left(\eta_i(\mathbf{d})\right)     +\frac{1}{2}\sum_{i=1}^K \left(       \frac{t_i - f_i(\mathbf{p})}{\eta_i(\mathbf{d})}    \right)^2 \\    &-\log\left(P_\text{prior}(\mathbf{d})\right)    -\log\left(P_\text{prior}(\mathbf{p})\right).    \end{split}.      
      
      See :ref:`NegLogProbability configuration <NegLogProbVariable>` for details.

      **Expression variable** (type ``'Expression'``):  
      
      This variable describes a general mathematical expression of known parameters and predicted variables stemming either from surrogates or from any previously defined variable. This variable type is the most flexible. However, its distribution is not determined analytically, but only through sampling from the posterior distribution(s) of the input variables and evaluating the expression of each of the samples. For linear or quadratic expressions it is generally advisable to use the more  specialized variables instead.  
      
      .. caution::       
      
              Avoid expressions that can evaluate to `NaN`, such as         `"sqrt(x)"`. Even if all observations of `"x"` are         positive, values drawn from some posterior might be         still negative. It is safer to use, e.g. `"sqrt(max(0, x))"`.      
      
      See :ref:`Expression configuration <ExpressionVariable>` for details.

      **Interpolation variable** (type ``'Interpolation'``):  
      
      This variable up-samples the output of a surrogate model  by interpolating between the output values. An interpolation  can reduce the cost of obtaining observations at high resolution.      
      
      See :ref:`Interpolation configuration <InterpolationVariable>` for details.

      **Scan variable** (type ``'Scan'``):  
      
      This variable scans the output of a surrogate model  over design or environment parameters. A typical use case is to determine the system behavior based on an interpolation between data points for different environment parameter values. An interpolation can reduce the cost of obtaining  observations at small steps between environment parameters.      
      
      See :ref:`Scan configuration <ScanVariable>` for details.

      **Least square fit variable** (type ``'Fit'``):  
      
      This variable fits a parameter vector :math:`\mathbf{p}` consisting of  :math:`M` parameters :math:`p_1, p_2,\dots, p_M` to a model function :math:`f(\mathbf{x}, \mathbf{p})` that depends on the model variables :math:`x_1, x_2,\dots, x_D`. The output of the variable is a least-square estimate of the fit parameters to data points  :math:`(\mathbf{x}_i, y_i), i=1,\dots,N`, where :math:`y_1,\dots,y_n` are the mean values of the input. That is, the variable locally minimizes  
      
      .. math::   \chi^2(\mathbf{p}) = \sum_{i=1}^N \frac{\left(f\left(\mathbf{x}_i, \mathbf{p}\right) - y_i\right)^2}{\eta_i^2},  
      
      where :math:`\eta_i^2` is the variance of input :math:`i` to the variable.      
      
      See :ref:`Fit configuration <FitVariable>` for details.

.. toctree::
   :maxdepth: 100 
   :hidden:
   
   SingleSelectorVariable
   MultiSelectorVariable
   CollectorVariable
   LinearCombinationVariable
   ChiSquaredVariable
   NegLogProbVariable
   ExpressionVariable
   InterpolationVariable
   ScanVariable
   FitVariable



.. _ActiveLearning.objectives:


objectives (list[dict])
"""""""""""""""""""""""
   A list of objectives that define which system  parameters are optimal. 

   Default: ``[{'type': 'Minimizer', 'name': 'objective', 'strategy': 'EI', 'localize': False, 'min_val': -inf, 'min_PoI': 1e-16, 'min_uncertainty': 0.0, 'min_acq_val': -inf}]``

   .. admonition:: Example

      A minimization objective together  with an outcome constraint.

      .. code-block:: python

         [{'type': 'Minimizer', 'variable': 'variable1', 'strategy': 'EI'},
          {'type': 'Constrainer', 'variable': 'variable2', 'lower_bound': 0.0, 'upper_bound': 1.5}]

   Each element of the list must be a dict. The dict entry ``type`` specifies the type of the element. The remaining entries  specify its properties. In the following, all possible  list element types are described:

      **Minimization objective** (type ``'Minimizer'``): The objective is to minimize a target variable.      
      
      See :ref:`Minimizer configuration <MinimizationObjective>` for details.

      **Multi-objective minimization** (type ``'MultiMinimizer'``): The objective is to learn the *Pareto front* of multiple minimization objectives.  A vector of objective values :math:`\mathbf{y}\in \mathbb{R}^q` lies at the Pareto front if the improvement (decrease) if it is not dominated by any other possible vector of outcomes. A vector :math:`\mathbf{y}_1` is dominated by a vector :math:`\mathbf{y}_2` if all entries of :math:`\mathbf{y}_2` are smaller or equal the entries of :math:`\mathbf{y}_1`.  
      
      The strategy to find samples close or at the Pareto front is to maximize the hypervolume enclosed by all non-dominated sampling points and the upper reference point :math:`\mathbf{y}_{\rm upper}`. The progress is defined in terms of the decreasing hypervolume between the lower reference point :math:`\mathbf{y}_{\rm lower}` and the non-dominated sampling points.      
      
      See :ref:`MultiMinimizer configuration <MultiMinimizationObjective>` for details.

      **Exploration objective** (type ``'Explorer'``): The objective is to learn the global behaviour of a  target variable by exploring regions with large information gain. It implements two exploration strategies:  
      
      :uncertainty: Samples are targeted that have a large prediction uncertainty.   That is, the posterior of the target variable has a large variance.  
      
      :surprise: This strategy only works in conjunction with neural network   ensembles. It considers the gradient of the weights of the output layer   (feature layer) with respect to the loss :math:`L(\mathbf{p}) = \sum_i   \left(nn_i(\mathbf{p}) - \overline{y}(\mathbf{p})\right)^2`, where   :math:`nn_i(\mathbf{p})` is the prediction of the :math:`i`-th ensemble   member and :math:`\overline{y}(\mathbf{p})` is the fantasy of the function   value (i.e. the average of the ensemble members). The strategy targets   parameters with maximal gradient magnitude, :math:`\Vert \nabla_\theta   L(\mathbf{p}) \Vert_2`, where :math:`\theta` denotes the weights of the   output layer.  
      
      If using a neural network ensemble, the ``'surprise'`` strategy offers often offers more informed sampling leading to slightly more accurate predictions.      
      
      See :ref:`Explorer configuration <ExplorationObjective>` for details.

      **Outcome constraint** (type ``'Constrainer'``): The objective is to constrain the value of a variable within a one-sided or two-sided interval. The sampling strategy tries to  maximize the probability that sampled values meet the constraint. This can however not be guaranteed.      
      
      See :ref:`Constrainer configuration <OutcomeConstraint>` for details.

      **Observation penalty** (type ``'ObservationPenalizer'``): The objective penalizes candidate points close to already observed samples. It is intended as an auxiliary objective that multiplies the main acquisition value and can therefore be combined with minimization and exploration objectives.      
      
      See :ref:`ObservationPenalizer configuration <ObservationPenaltyObjective>` for details.

.. toctree::
   :maxdepth: 100 
   :hidden:
   
   MinimizationObjective
   MultiMinimizationObjective
   ExplorationObjective
   OutcomeConstraint
   ObservationPenaltyObjective



.. _ActiveLearning.acquisition_function:


acquisition_function (dict)
"""""""""""""""""""""""""""
   A module that defines additional parameters  of the acquisition function. 

   Default: ``{'parameter_uncertainties': [], 'num_samples': 10000, 'smooth_range': 0.2}``
   

   The acquisition function defines the sampling strategy. That is, new samples for the evaluation of the black-box function are  generated by maximizing the acquisition function. The acquisition strategy  itself is defined in the ``objective``. This module allows to configure  additional properties of the acquisition function.   
   
   See :ref:`acquisition_function configuration <AcquisitionFunction>` for details.

.. toctree::
   :maxdepth: 100 
   :hidden:
   
   AcquisitionFunction



.. _ActiveLearning.acquisition_optimizer:


acquisition_optimizer (dict)
""""""""""""""""""""""""""""
   A module that maximized the acquisition function  to determine new suggestions. 

   Default: ``{'adaptive_local_search': True, 'compute_suggestion_in_advance': True}``
   

   The module optimizes the acquisition function  of the main objective of the study by a combination of a  heuristic global optimization followed by a local convergence  of the best result.   
   
   See :ref:`acquisition_optimizer configuration <AcquisitionOptimizer>` for details.

.. toctree::
   :maxdepth: 100 
   :hidden:
   
   AcquisitionOptimizer



.. _ActiveLearning.scaling:


scaling (float)
"""""""""""""""
   Scaling parameter of the model uncertainty.  For scaling :math:`\gg 1.0` (e.g. ``scaling=10.0``) the search is more explorative. For scaling :math:`\ll 1.0` (e.g. ``scaling=0.1``) the search becomes more greedy (e.g. any local minimum is intensively exploited). 

   Default: ``1.0``

.. _ActiveLearning.vary_scaling:


vary_scaling (bool)
"""""""""""""""""""
   If true, the scaling parameter is randomly varied  between 0.1 and 10. 

   Default: ``True``

.. _ActiveLearning.parameter_distribution:


parameter_distribution (dict)
"""""""""""""""""""""""""""""
   Probability distribution of design and environment parameters. 

   Default: ``{'include_study_constraints': False, 'distributions': [], 'constraints': []}``
   

   Probability distribution of design and environment parameters defined by distribution functions and constraints. The definition of the parameter distribution can have several effects:  
   
   * In a call to the method ``get_statistics`` of the driver interface     the value of interest is averaged over samples drawn from the     space distribution.  
   
   * In a call to the method ``run_mcmc`` of the driver interface     the space distribution acts as a prior distribution.  
   
   * In a call to the method ``get_sobol_indices`` of the driver interface     the space distribution acts as a weighting factor for determining     expectation values.  
   
   * In an :ref:`ActiveLearning` driver, one can access the value of the     log-probability density (up to an additive constant) by the name     ``'log_prob'`` in any expression, e.g. in     :ref:`ExpressionVariable`,     :ref:`LinearCombinationVariable`.   
   
   See :ref:`parameter_distribution configuration <SpaceDistribution>` for details.

.. toctree::
   :maxdepth: 100 
   :hidden:
   
   SpaceDistribution