Physics-informed Bayesian optimization

Driver:

ActiveLearning

Download script:

physics_informed_bayesian_optimization.py

In this tutorial we discuss physics-informed Bayesian optimization (PIBO) [SEK2025]. In contrast to standard Bayesian optimization (BO), the surrogate model is not trained on scalar objective values \(f(\mathbf{x}) \in \mathbb{R}\) but on the physical response \(\mathbf{r}(\mathbf{x}) \in \mathbb{R}^K\) of a system (simulation or experiment). We assume that while obtaining \(\mathbf{r}(\mathbf{x}^*)\) for some \(\mathbf{x}^*\) is expensive, the mapping \(f(\mathbf{x}) = \Phi(\mathbf{r}(\mathbf{x}))\) is an analytic function that can be efficiently evaluated.

The mapping \(\Phi: \mathbb{R}^K \rightarrow \mathbb{R}\) leads intuitively to an information loss. If moreover \(\Phi\) is a non-linear function that adds some complexity, it can be easier to learn the behaviour of the channels \(\mathbf{r}(\mathbf{x})\) than to learn the behavior of \(f(\mathbf{x})\).

As an example we consider a spectral filter consisting of alternating layers of a dielectric layer \(S\) made of silicon oxide (SiO2) and a metallic layer \(A\) made of silver (Ag). The device has the general structure \(S(AS)^N\) which is surrounded by air (refractive index \(n=1\)). We consider \(N=4\) layer pairs which leads to a \(D = 2N + 1 = 9\) dimensional optimization problem.

By varying the thicknesses, the objective is to have a uniform transmittance \(t(\lambda)\) in a short-wavelength regime between \(\lambda_{\rm short} = 800\) nm and \(\lambda_{\rm gap} = 1300\) nm of at least 40% on average and to minimize the transmittance in a long-wavelength regime between \(\lambda_{\rm gap}\) and \(\lambda_{\rm long}=1800\) nm:

\[ \begin{align}\begin{aligned}\text{minimize}\,\, \text{STD}_{\rm short}[t] + \mathbb{E}_{\rm long}[t] &= \sqrt{ \int_{\lambda_{\rm short}}^{\lambda_{\rm gap}} t^2(\lambda){\rm d}\lambda - \left( \int_{\lambda_{\rm short}}^{\lambda_{\rm gap}} t(\lambda){\rm d}\lambda \right)^2 } + \int_{\lambda_{\rm gap}}^{\lambda_{\rm long}} t(\lambda){\rm d}\lambda\\\text{such that}\,\, \mathbb{E}_{\rm short}[t] &= \int_{\lambda_{\rm short}}^{\lambda_{\rm gap}} t(\lambda){\rm d}\lambda > 0.4\,.\end{aligned}\end{align} \]

Luckily, the transmission of light through the material stack can be simulated efficiently using the transfer-matrix method. We employ the Python package tmm-fast which also allows to compute derivatives with respect to the \(D=9\) thicknesses using back propagation. We learn \(K=50\) samples \(\mathbf{t} = [\mathbf{t}_{\rm short}, \mathbf{t}_{\rm long}] \in \mathbb{R}^K\) of the transmission spectrum \(t(\lambda)\) between \(\lambda_{\rm short}\) and \(\lambda_{\rm long}\) using a multi-output GP.

To showcase the advantage of PIBO, we compare its convergence behaviour to that of a global heuristic optimization method, differential evolution (DE). Like other heuristic methods DE cannot make use of derivative information. This would explain a speedup of a factor of \(\leq D + 1 = 10\) since 10 samples are required to estimate the gradient of the function. However, we observe a speedup factor of more than 100. The reason is that even after the very first evaluation, PIBO learns already a local linear model of the transmission spectrum and hence a local quadratic model of the loss function. After some more iterations it has a very good global model of the loss function and can identify local minima very efficiently.

[SEK2025]

Sekulic, I, et al. “Physics-informed Bayesian optimization of expensive-to-evaluate black-box functions”. Mach. Learn.: Sci. Technol. 6 040503 (2025)

  1import sys,os
  2import numpy as np
  3import time
  4import torch
  5import numpy as np
  6import scipy
  7import requests
  8from dispersion import Spectrum, Material # run pip install dispersion if not installed
  9import tmm_fast # run pip install tmm-fast if not installed
 10import matplotlib.pyplot as plt
 11from matplotlib import patches
 12
 13torch.set_default_dtype(torch.float64)
 14
 15
 16from jcmoptimizer import Client, Study, Observation
 17client = Client()
 18
 19
 20K = 50 # number spectral sampling points
 21K_short = K_long = int(K/2) #samples in short and long wavelength range
 22wl = torch.linspace(800, 1800, K) * 1e-9 #Wavelength scan
 23N = 4 # number N in layer stack S(AS)^N
 24D = 2*N + 1 #dimensionality of search space
 25theta = torch.tensor([0.0]) #angle of incidence
 26num_layers = 2*N + 1 + 2 #total number of layers including air
 27num_stacks = 1 #only one stack is simulated in each iteration
 28
 29#Load material data refractiveindex.info
 30if not os.path.exists("SiO2.yml"):
 31    r = requests.get("https://refractiveindex.info/database/data/main/SiO2/nk/Malitson.yml")
 32    open("SiO2.yml", 'wb').write(r.content)
 33if not os.path.exists("Ag.yml"):
 34    r = requests.get("https://refractiveindex.info/database/data/main/Ag/nk/Johnson.yml")
 35    open("Ag.yml", 'wb').write(r.content)
 36    
 37#create stack of refractive indices S(AS)^N
 38M = torch.full((num_stacks, num_layers, wl.shape[0]), 1+0j)
 39M[0,0,:] = M[0,-1,:] = 1.0 #first and last layer is air
 40SiO2 = Material(file_path="SiO2.yml", unit="um")
 41Ag = Material(file_path="Ag.yml", unit="um")
 42spm = Spectrum(wl.numpy(), spectrum_type='Wavelength', unit='meter')
 43#Set n-k data for each layer and wavelength 
 44M[0,1:-1:2,:] = torch.tensor(SiO2.get_nk_data(spm))
 45M[0,2:-1:2,:] = torch.tensor(Ag.get_nk_data(spm))
 46
 47#design space (units nm)
 48design_space_SiO2 = [{'name': f'sio2_{i+1}', 'domain': (100, 500)} for i in range(N+1)]
 49design_space_Ag = [{'name': f'ag_{i+1}', 'domain': (2, 8)} for i in range(N)]
 50design_space = design_space_SiO2 + design_space_Ag
 51param_names = [d['name'] for d in design_space]
 52# Creation of the study object with study_id 'physics_informed_bayesian_optimization'
 53study = client.create_study(
 54    design_space=design_space,
 55    driver="ActiveLearning",
 56    study_name="Physics-informed Bayesian optimization",
 57    study_id="physics_informed_bayesian_optimization"
 58)
 59study.configure(
 60    max_iter=30,
 61    surrogates=[
 62        dict(type='GP', name="model", output_dim=K, 
 63             # To speed things up, we run a single hyperparameter optimization after 
 64             # 5 iterations (= 5*(D+1) observations of function values and derivatives) 
 65             max_optimization_interval=5*(D+1),
 66             optimization_step_max=5*(D+1),
 67             covariance_matrix=dict(max_data_hyper_derivs=5*(D+1))
 68        )
 69    ],
 70    variables=[
 71        # Selector of transmissions in short wangelength range
 72        dict(type="MultiSelector", name="trans_short", input="model", 
 73             select_range=(0, K_short-1), output_dim=K_short),
 74        # Selector of transmissions in long wangelength range
 75        dict(type="MultiSelector", name="trans_long", input="model", 
 76             select_range=(K_short, K-1), output_dim=K_long),
 77        # Mean transmission in short wangelength range
 78        dict(type="LinearCombination", name="avg_short", inputs=["trans_short"]),
 79        # Mean transmission in long wangelength range
 80        dict(type="LinearCombination", name="avg_long", inputs=["trans_long"]),
 81        # Chi^2 = sum of squared transmission sum(t_i^2) in short wavelength range
 82        dict(type="ChiSquaredValue", name="chi_squared_short", input="trans_short"),
 83        # Loss: variance of short wavelengths + squared average of long wavelengths
 84        dict(type="Expression", name="loss", 
 85             expression=f"sqrt(max(0, chi_squared_short/{K_short} - avg_short^2)) + avg_long"
 86        )
 87    ],
 88    objectives=[
 89        dict(type="Minimizer", variable="loss", name="objective"),
 90        dict(type="Constrainer", variable="avg_short", lower_bound=0.4,
 91             name="min_transmission_short"),
 92    ],
 93    acquisition_optimizer=dict(
 94        num_training_samples=5, compute_suggestion_in_advance=False
 95    )
 96)
 97# Transmission spectrum for thickness tensor D
 98def transmissions(D: torch.Tensor) -> torch.Tensor:
 99    T = torch.zeros((1,num_layers))    
100    T[0, 1:-1:2] = D[:N+1] # SiO2 thicknesses
101    T[0, 2:-1:2] = D[N+1:] # Ag thicknesses
102    T[0, 0] = T[0, -1] = np.inf
103    # Get solution using transfer-matrix method
104    output = tmm_fast.coh_tmm('s', M, T * 1e-9, theta, wl, device='cpu')    
105    return output["T"][0,0,:]
106    
107# Evaluation of the black-box function for specified design parameters
108def evaluate(study: Study, **kwargs: float) -> Observation:
109    observation = study.new_observation()
110    D = torch.tensor([kwargs[param_names[i]] for i in range(2*N + 1)])
111    T = transmissions(D)
112    Jac = torch.autograd.functional.jacobian(transmissions, D)
113    observation.add(T.tolist())
114    for idx, param in enumerate(param_names):
115        observation.add(Jac[:,idx].tolist(), derivative=param)    
116    return observation
117
118# Run the minimization
119study.set_evaluator(evaluate)
120study.run()
121kwargs = study.get_state("driver.best_sample")
122D = torch.tensor([kwargs[param_names[i]] for i in range(2*N + 1)])
123T = transmissions(D)
124fig, ax = plt.subplots(nrows=2, height_ratios=[3,1], figsize=(8, 4))
125
126ax[0].vlines(1300, 0, 0.8, colors="black", linestyles="dashed",
127             label=r"$\lambda_{\rm gap}=1300$ nm")
128ax[0].plot(wl*1e9, T, ".-", label="spectral response")
129ax[0].set_xlabel(r"$\lambda [nm]$")
130ax[0].legend()
131ax[0].grid()
132
133T = [0]*(2*N + 1)    
134D = [kwargs[param_names[i]] for i in range(2*N + 1)]
135T[0::2] = D[:N+1]
136T[1::2] = D[N+1:]
137for idx, t in enumerate(T):
138    rect = patches.Rectangle(
139        xy=(wl[0] + sum(T[:idx]),0), width=T[idx], height=1, 
140        facecolor="lightblue" if idx % 2==0 else "white"
141    )
142    ax[1].add_patch(rect)
143    ax[1].set_xlim(0, sum(T))
144    ax[1].text(y=0.5, x=sum(T[:idx]) + 0.5*T[idx], 
145               s="SiO$_2$" if idx % 2==0 else "Ag", 
146               horizontalalignment='center', verticalalignment='center')
147    ax[1].set_xlabel("Layer Stack Thicknesses [nm]")
148    ax[1].get_yaxis().set_visible(False)
149plt.tight_layout()
150plt.show()
151fig.savefig("optimized_filter.svg", transparent=True)
152
153# As a comparison, we run a global differential evolution (DE)
154# and save evaluated objective values.
155Y_DE: list[float] = []
156
157def objective(x: np.ndarray) -> float:
158    T = transmissions(torch.tensor(x))
159    T_short = T[:K_short]
160    T_long = T[K_short:]    
161    loss = T_short.std() + T_long.mean()
162    # Penalized loss is zero if constraint is not fulfilled, otherwise it's negative.
163    penalized_loss = (loss - 1) * torch.heaviside(
164        T_short.mean() - 0.4, values = torch.tensor(0.0)
165    )
166    Y_DE.append(penalized_loss.item() + 1)
167    return penalized_loss.item()
168
169scipy.optimize.differential_evolution(
170    objective,
171    bounds=[info["domain"] for info in design_space],
172    polish=False, # Run standard DE without L-BFGS-B
173    popsize=6, # Generation size = 6 * D = 54
174    maxiter=100 # max 100 generations -> 100 * 54 = 5400 evaluations
175)
176
177Y_BO = study.driver.get_observed_values("variable", "loss")
178fig, ax = plt.subplots(figsize=(8, 4))
179ax.plot(Y_DE, ".", color="steelblue", label="Differential evloution")
180ax.plot(np.minimum.accumulate(Y_DE), "-", color="steelblue")
181ax.plot(Y_BO["means"], ".", color="darkorange", label="Physics-informed BO")
182ax.plot(np.minimum.accumulate(Y_BO["means"] + [min(Y_BO["means"])]*3500),
183        "-", color="darkorange")
184ax.set_xscale("log")
185ax.set_yscale("log")
186ax.set_xlabel("# Iterations")
187ax.set_ylabel("loss")
188ax.legend()
189ax.grid()
190plt.show()
191fig.savefig("comparison_PIBO_DE.svg", transparent=True)
192
193
194client.shutdown_server()
Optimized filter

Top: Spectral response of optimized filter design. Bottom: Geometry of optimized filter design.

Comparison between physics-informed BO and differential evolution

Comparison of the convergence behaviour of physics-informed Bayesian optimization (PIBO), which uses gradient information, and the heuristic global optimization method differential evolution (DE), which cannot exploit gradient information. PIBO converges two orders of magnitude faster to the optimal design than DE.