BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (2024)

Wei-Ting Tang
The Ohio State University
tang.1856@osu.edu
&Ankush Chakrabarty
Mitsubishi Electric Research Laboratories
achakrabarty@ieee.org
Joel A. Paulson
The Ohio State University
paulson.82@osu.edu

Abstract

Novelty search (NS) refers to a class of exploration algorithms that automatically uncover diverse system behaviors through simulations or experiments. Systematically obtaining diverse outcomes is a key component in many real-world design problems such as material and drug discovery, neural architecture search, reinforcement learning, and robot navigation. Since the relationship between the inputs and outputs (i.e., behaviors) of these complex systems is typically not available in closed form, NS requires a black-box perspective. Consequently, popular NS algorithms rely on evolutionary optimization and other meta-heuristics that require intensive sampling of the input space, which is impractical when the system is expensive to evaluate.We propose a Bayesian optimization inspired algorithm for sample-efficient NS that is specifically designed for such expensive black-box systems. Our approach models the input-to-behavior mapping with multi-output Gaussian processes (MOGP) and selects the next point to evaluate by maximizing a novelty metric that depends on a posterior sample drawn from the MOGP that promotes both exploration and exploitation. By leveraging advances in efficient posterior sampling and high-dimensional Gaussian process modeling, we discuss how our approach can be made scalable with respect to both amount of data and number of inputs. We test our approach on ten synthetic benchmark problems and eight real-world problems (with up to 2133 inputs) including new applications such as discovery of diverse metal organic frameworks for use in clean energy technology. We show that our approach greatly outperforms existing NS algorithms by finding substantially larger sets of diverse behaviors under limited sample budgets.

1 Introduction

Search (or optimization) algorithms are used within virtually every field of science and engineering to automatically find a set of high-quality solutions from a (potentially infinite) set of candidates. Most optimization algorithms measure quality as determined by the user who must define one or more so-called “objective functions” that can be used to rank the candidates [1]. A particularly hard class of problems is when the objective functions are defined as the output of a black-box system (unknown structure, without access to gradient) that is noisy and expensive to evaluate. Such problems can be tackled with Bayesian optimization (BO), which has been shown to empirically outperform other derivative-free global optimization methods when the objectives exhibit these challenging characteristics [2, 3]. BO, however, requires an appropriate choice of the objective functions in advance and, unfortunately, the best choice of objectives is not always obvious, especially when simultaneously optimizing several properties (using, e.g., multi-objective BO [4]).Furthermore, the choice of objectives in BO can strongly bias the selection of points. If these objectives are not properly chosen, they may result in missing key (hidden) states or behaviors in the system, which are crucial for unexpected discoveries to occur during the sampling process.

Novelty search (NS) is an idea for overcoming the need to select pre-defined objective functions [5, 6]; it suggests abandoning the goal of improving specific performance metrics and instead intentionally search for diverse system outcomes. NS-like methods have been shown to be useful in several important real-world applications areas including scientific discovery [7], material design [8], and reinforcement learning [9]. The NS perspective has also been found to be very useful in so-called “deceptive problems” wherein the optimization path taken towards a specific objective can easily get stuck in suboptimal solutions. Since the mapping between inputs and system outcomes is generally a black box, established NS methods rely on meta-heuristics, such as evolutionary algorithms, to select new evaluation points. However, these methods are known to be sample inefficient, which limits their use on expensive-to-evaluate systems. Since many NS applications of interest involve expensive evaluations (e.g., experimental discovery of new out-of-trend drug compounds), we need to develop more efficient alternatives for NS. We seek an algorithm capable of achieving high efficiency by preserving the essence of BO while avoiding the need for an explicit goal-oriented objective function.

This paper proposes a new BO-inspired active learning approach for NS. Just like standard BO, we construct Gaussian process (GP) surrogate models [10] for the unknown functions. In our case, however, the unknown functions are not objectives but instead represent outcomes that a user wishes to explore. The outcome space can be freely defined by the user, meaning it can be a vector of as many elements as desired (e.g., the efficacy, size, synthesizability, and stability of a drug compound).

Our main contributions are summarized as follows:

•
We introduce a new active learning approach, referred to as BEACON (Bayesian Exploration Algorithm for outCOme Novelty), for NS over noisy expensive black-box systems that aims to discover unseen behaviors of the system using a minimal number of evaluations.
•
We propose a novel Thompson sampling-based acquisition function for NS applications that, to our knowledge, is the first to address the exploration-exploitation tradeoff, handle stochastic observations noise, and is suitable for gradient-based optimization.
•
We discuss two different strategies for extending BEACON to high-dimensional problems including a general approach that uses fully Bayesian sparisty-inducing function priors and a specialized approach for chemistry applications.
•
We conduct extensive experiments on several synthetic and real-world problems that demonstrate the substantial benefits BEACON can achieve in terms of identified behaviors over relevant existing NS works. These include a first-time application of NS to discovery of metal organic frameworks that are useful materials in clean energy applications. We also consider challenging molecular discovery benchmark problems with up to 2133 dimensions.

2 Related Work

Our work is inspired by the BO framework for global optimization of expensive and noisy black-box functions that originated from [11, 12]. It has regained popularity in recent years due to its outstanding performance on tasks such as hyperparameter tuning in machine learning methods [13]. Readers are referred to [2] and [3] for a recent review and tutorial introduction to BO. BEACON is similar to multi-objective BO [4] in the sense that they both develop multi-output GP models for the different functions, however, they majorly differ in their choice of acquisition functions. Multi-objective BO aims to learn the Pareto front between competing objectives while BEACON aims to fully explore unseen regions of the outcome space. We argue, similarly to [7], the latter is more appropriate in discovery applications. There has been recent work on improving the solution diversity in BO through the use of a user-defined diversity constraint [14]. Although similar in spirit to BEACON, this method still uses a specific objective function to guide the search and focuses on a local version of BO. BEACON is an objective-free approach aimed at global search of the design space.

Our work directly builds upon the NS method proposed in [6] (which we refer to as NS-EA) that optimizes a distance-based novelty metric using a traditional evolutionary algorithm through random evolution of a maintained population of points. Since this initial work, there have been many research efforts to deploy NS to solve previously unsolved deceptive problems [15, 16, 17]. There have also been extensions of NS-EA to simultaneously assess novelty and fitness [18]. NS is also closely related to so-called “illumination algorithms” such as MAP-Elites [19] that maintains an archive of diverse solutions with respect to different behaviors of interest.The major downside in all of these methods is that they (i) rely on sample-inefficient strategies to search the input space such that they are not applicable to expensive systems and (ii) assume the observations are noise-free such that they cannot handle intrinsic noise in the system. BEACON addresses both of these limitations in NS (for the first time, to our knowledge) by leveraging ideas from the BO literature.

3 Problem Setup

We consider the behavior of a system to be characterized by a vectored-valued black-box function $\boldsymbol{f}:\mathcal{X}\to\mathcal{O}$ with $\boldsymbol{f}=(f^{(1)},\ldots,f^{(n)})$ that maps an input space $\mathcal{X}$ to a possibly multi-output outcome space $\mathcal{O}$ that are, respectively, compact subsets of $\mathbb{R}^{d}$ and $\mathbb{R}^{n}$ . Although $\mathcal{O}$ can be a continuous set, we assume that neighboring values in outcome space share similar behaviors. For example, robot morphologies with similar height, weight, and energy consumption per distance moved are treated as being from the same group. To express this mathematically, we define a discrete (finite) behavior space $\mathcal{B}$ that is an $\epsilon$ -cover of $\mathcal{O}$ , i.e., for some $\epsilon>0$ , $\forall\boldsymbol{y}\in\mathcal{O},~{}\exists\boldsymbol{y}^{\prime}\in%\mathcal{B}~{}\text{such that}~{}\|\boldsymbol{y}-\boldsymbol{y}^{\prime}\|\leq\epsilon$ ¹¹1Since our proposed algorithm works directly in outcome space, the exact choice of $\epsilon$ does not impact the search process (but does impact how performance is measured). We examine how $\epsilon$ impacts performance in Appendix B.1 for which we find that our method is fairly robust to the neighborhood size..

We are interested in identifying inputs that cover/spread as much of the behavior space $\mathcal{B}$ as possible. Since $\boldsymbol{f}$ is assumed to expensive to evaluate and unmodeled, we look to accomplish this task by sequentially evaluating the outcome function (through simulations or experiments) over a finite and discrete number of samples indexed by $t=1,\ldots,T$ . For every query point $\boldsymbol{x}_{t}$ , we receive a noisy measurement of the outcome function $\boldsymbol{y}_{t}=\boldsymbol{f}(\boldsymbol{x}_{t})+\boldsymbol{\eta}_{t}$ where $\boldsymbol{\eta}_{t}\sim\mathcal{N}(0,\sigma^{2}\mathbf{I}_{n})$ is Gaussian noise.Letting $\boldsymbol{\varphi}:\mathcal{O}\to\mathcal{B}$ be the function that maps the outcome of the system to a particular behavior, we introduce the concept of the behavior gap $\text{BG}_{t}=1-|\{\boldsymbol{\varphi}(\boldsymbol{f}(\boldsymbol{x}_{i}))\}_%{i=1}^{t}|/|\mathcal{B}|$ , which measures the fraction of unobserved behaviors in the system.We seek to minimize the cumulative behavior gap $\sum_{t=1}^{T}\text{BG}_{t}$ , or equivalently, maximize the fraction of unique behaviors observed at every iteration.

As commonly done in the BO literature, we will model the black-box function $\boldsymbol{f}$ as being drawn from a multi-output Gaussian process (MOGP) prior [20]. This is a valid assumption as long as $\boldsymbol{f}$ satisfies certain smoothness properties such as belonging to some Reproducing Kernel Hilbert Space (RKHS) [21]. The MOGP can be used to derive a posterior distribution $\mathbb{P}(\boldsymbol{f}|\mathcal{D}_{t})$ over the true function $\boldsymbol{f}$ given observed data $\mathcal{D}_{t}=\{(\boldsymbol{x}_{i},\boldsymbol{y}_{i})\}_{i=1}^{t}$ . The posterior mean and covariance function can still be calculated analytically in the multi-output case, as discussed further in the next section.

4 Proposed Method

This section describes our approach. We first present the statistical model of the outcome function that accounts for its multi-output structure. We then define our acquisition function defined in terms of a Thompson sample that leverages the inherent randomness in the statistical model to push the query points toward discovering new system behaviors. After summarizing our proposed algorithm, we describe how our acquisition function can be efficiently maximized and end this section by extending the method to high-dimensional problems by incorporating strong priors.

4.1 Multi-Output Gaussian Processes

Gaussian process (GP) models are one of the most popular non-parametric regression approaches in machine learning [10]. The learning process operates by assuming the function values at any collection of inputs are multivariate Gaussian random variables. A GP prior is fully specified by a prior mean function $\mu$ and covariance (or kernel) function $\kappa$ .A general way to express a MOGP for $\boldsymbol{f}$ is to define a function $h(\boldsymbol{x},j)=f^{(j)}(\boldsymbol{x})$ that takes an additional input $j\in\mathcal{J}=\{1,\ldots,n\}$ , which merely returns the corresponding outcome function. We can now equivalently represent the MOGP prior as a single-output GP prior on $h\sim\mathcal{GP}(\mu,\kappa)$ over the extended input space $\mathcal{X}\times\mathcal{J}$ , e.g., [22]. Since the noise is Gaussian, the posterior $h|\mathcal{A}\sim\mathcal{GP}(\mu_{\mathcal{A}},\kappa_{\mathcal{A}})$ for any collection of $N$ observations $\mathcal{A}=\{(\boldsymbol{x}_{i},j_{i},y_{i})\}_{i=1}^{N}$ from this GP, where $\boldsymbol{x}_{i}\in\mathcal{X}$ , $j_{i}\in\mathcal{J}$ , and $y_{i}=h(\boldsymbol{x}_{i},j_{i})+\eta^{(j)}_{i}\in\mathbb{R}$ , remains a GP with $\mu_{\mathcal{A}}$ and covariance $\kappa_{\mathcal{A}}$ given by


$\displaystyle\mu_{\mathcal{A}}(\boldsymbol{x},j)$	$\displaystyle=\mu(\boldsymbol{x},j)+\boldsymbol{\kappa}_{\mathcal{A}}^{\top}(%\boldsymbol{x},j)\left(\boldsymbol{K}_{\mathcal{A}}+\sigma^{2}\mathbf{I}_{N}%\right)^{-1}\left(\boldsymbol{y}-\boldsymbol{\mu}_{\mathcal{A}}\right),$	(1a)
$\displaystyle\kappa_{\mathcal{A}}((\boldsymbol{x},j),(\boldsymbol{x}^{\prime},%j^{\prime}))$	$\displaystyle=\kappa((\boldsymbol{x},j),(\boldsymbol{x}^{\prime},j^{\prime}))-%\boldsymbol{\kappa}_{\mathcal{A}}^{\top}(\boldsymbol{x},j)\left(\boldsymbol{K}%_{\mathcal{A}}+\sigma^{2}\mathbf{I}\right)^{-1}\boldsymbol{\kappa}_{\mathcal{A%}}(\boldsymbol{x}^{\prime},j^{\prime}),$	(1b)

where $\boldsymbol{y}=[y_{1},\ldots,y_{N}]^{\top}$ , $\boldsymbol{\mu}_{\mathcal{A}}=[\mu(x_{1}),\ldots,\mu(x_{N})]^{\top}$ , $\boldsymbol{\kappa}_{\mathcal{A}}(\boldsymbol{x},j)\in\mathbb{R}^{N}$ is the vector of covariance values between the test input $(\boldsymbol{x},j)$ and the observed inputs in $\mathcal{A}$ , and $\boldsymbol{K}_{\mathcal{A}}\in\mathbb{R}^{N\times N}$ is the covariance matrix between all observed inputs in $\mathcal{A}$ .

4.2 A Thompson Sampling-based Acquisition Function for Novelty Search

We are interested in discovering new behaviors via exploring unseen regions of the outcome space $\mathcal{O}$ . An established approach for tackling this problem is to attempt to maximize a novelty metric that biases the search process toward observing new things. Although there are many different ways to measure novelty, a popular approach is to evaluate the average distance to the $k$ -nearest neighbors of observed outcomes given any set of $N$ datapoints $\mathcal{D}=\{(\boldsymbol{x}_{i},\boldsymbol{y}_{i})\}_{i=1}^{N}$

\displaystyle\rho(\boldsymbol{x}|\mathcal{D})=\frac{1}{k}\sum_{i=1}^{k}\text{%dist}(\boldsymbol{f}(\boldsymbol{x}),\boldsymbol{y}^{\star}_{i}),

(2)

where $\{\boldsymbol{y}^{\star}_{1},\ldots,\boldsymbol{y}^{\star}_{k}\}\subset\{%\boldsymbol{y}_{1},\ldots,\boldsymbol{y}_{N}\}$ denote the set of $k$ closest outcomes to $\boldsymbol{f}(\boldsymbol{x})$ in the outcome space and $\text{dist}(\cdot)$ is any valid distance metric defined over the outcome space $\mathcal{O}$ . Although $\rho_{t}(\boldsymbol{x})$ is intuitively a good metric, it exhibits two major challenges in our problem setting.First, $\boldsymbol{f}(\boldsymbol{x})$ is a black-box function and so we cannot actually compute $\rho_{t}(\boldsymbol{x})$ until after an outcome function evaluation has taken place. Standard NS methods rely on evolutionary algorithms to attempt to maximize $\rho_{t}(\boldsymbol{x})$ , which require too many function evaluations for expensive $\boldsymbol{f}$ . Second, this metric can degrade in the presence of noisy evaluations since the true behavior for any observed outcome might be miscategorized, i.e., there may exist realizations of $\boldsymbol{\eta}$ such that $\boldsymbol{\varphi}(\boldsymbol{f}(\boldsymbol{x})+\boldsymbol{\eta})\neq%\boldsymbol{\varphi}(\boldsymbol{f}(\boldsymbol{x}))$ . Note that we analyze how noise degrades NS performance in Appendix B.3.

We can overcome these challenges with a MOGP model for $\boldsymbol{f}|\mathcal{D}\sim\mathcal{MOGP}(\boldsymbol{\mu}_{\mathcal{D}},%\boldsymbol{\kappa}_{\mathcal{D}})$ where $\boldsymbol{\mu}_{\mathcal{D}}(\cdot)=(\mu_{\mathcal{D}}(\cdot,1),\ldots,\mu_{%\mathcal{D}}(\cdot,n))$ is a vector function and $\boldsymbol{\kappa}_{\mathcal{D}}(\cdot,\cdot)$ is a matrix function with elements $[\boldsymbol{\kappa}_{\mathcal{D}}(\cdot,\cdot)]_{i,j}=[\kappa_{\mathcal{D}}((%\cdot,i),(\cdot,j))]$ for all $i,j\in\{1,\ldots,n\}$ .The MOGP serves two key purposes: (i) it enables us to filter observation noise by replacing measured data with a surrogate prediction and (ii) it provides a way for us to fantasize future outcome realizations while accounting for uncertainty. The latter step is accomplished through Thompson Sampling (TS), which is a classical strategy for decision-making under uncertainty [23].Our acquisition function can thus be expressed as follows

\displaystyle\alpha_{\text{NS}}(\boldsymbol{x}|\boldsymbol{g},\mathcal{D})=%\frac{1}{k}\sum_{i=1}^{k}\text{dist}(\boldsymbol{g}(\boldsymbol{x}),%\boldsymbol{\mu}_{\mathcal{D}}(\boldsymbol{x}^{\star}_{i})),

(3)

where $\boldsymbol{g}(\boldsymbol{x})\sim\boldsymbol{f}|\mathcal{D}$ is a sample function realization from the MOGP posterior and $\{\boldsymbol{x}^{\star}_{1},\ldots,\boldsymbol{x}_{k}^{\star}\}\subset\{%\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{N}\}$ are the set of $k$ closest outcomes to $\boldsymbol{g}(\boldsymbol{x})$ in terms of distance to the posterior mean predictions $\{\boldsymbol{\mu}_{\mathcal{D}}(\boldsymbol{x}_{1}),\ldots,\boldsymbol{\mu}_{%\mathcal{D}}(\boldsymbol{x}_{N})\}$ .Note that we use the method from [24] to efficiently generate posterior samples of the MOGP using a combination of weight- and function-space views to perform decoupled sampling. This directly provides differentiable high-accuracy approximations of $\boldsymbol{g}$ ; we discuss a more efficient differentiable of the $k$ -nearest neighbor computation in Section 4.4.

4.3 The BEACON Algorithm

Algorithm 1, which we dub ‘Bayesian Exploration Algorithm for outCOme Novelty’ (BEACON), sequentially runs a combination of the ideas from the previous section. A visual illustration of BEACON applied to a simple test problem is shown in Figure 1. Inspired by [25], we present an asynchronous form of BEACON that takes advantage of $M\geq 1$ parallel workers wherein, once a worker finishes a job, it becomes available to begin a new evaluation.This form is useful in practice since we are motivated by problems that can exhibit asynchronicity such as material discovery.

1:Input: Outcome function $\boldsymbol{f}$ , input domain $\mathcal{X}$ , initial data $\mathcal{D}$ , budget $T$ , number of workers $M$ , number of nearest neighbors $k$ , outcome distance metric $\text{dist}(\cdot)$ , MOGP prior $\mathcal{MOGP}(\boldsymbol{\mu},\boldsymbol{\kappa})$ .

2:Initialize: Data $\mathcal{D}_{0}\leftarrow\mathcal{D}$ and surrogate model $\mathcal{MOGP}_{0}\leftarrow\mathcal{MOGP}(\boldsymbol{\mu}_{\mathcal{D}_{0}},%\boldsymbol{\kappa}_{\mathcal{D}_{0}})$ .

3:for $t=1,2,\ldots,T$ do

4:if $t>M$ then

5:Wait for a worker to finish its evaluation.

6: $\mathcal{D}_{t}\leftarrow\mathcal{D}_{t-1}\cup\{(\boldsymbol{x}^{\prime},%\boldsymbol{y}^{\prime})\}$ where $(\boldsymbol{x}^{\prime},\boldsymbol{y}^{\prime})$ are the worker’s previous query point $\boldsymbol{x}^{\prime}$ and its corresponding potentially noisy observation $\boldsymbol{y}^{\prime}=\boldsymbol{f}(\boldsymbol{x}^{\prime})+\boldsymbol{\eta}$ .

7:Update surrogate model with new data $\mathcal{MOGP}_{t}\leftarrow\mathcal{MOGP}(\boldsymbol{\mu}_{\mathcal{D}_{t}},%\boldsymbol{\kappa}_{\mathcal{D}_{t}})$ .

8:else (at least one worker has not been assigned a job)

9: $\mathcal{D}_{t}\leftarrow\mathcal{D}_{t-1}$ and $\mathcal{MOGP}_{t}\leftarrow\mathcal{MOGP}_{t-1}$ .

10:endif

11: $\boldsymbol{x}_{t}\leftarrow\operatorname*{argmax}_{\boldsymbol{x}\in\mathcal{%X}}\alpha_{\text{NS}}(\boldsymbol{x}|\boldsymbol{g},\mathcal{D}_{t})$ where $\boldsymbol{g}\sim\mathcal{MOGP}_{t}$ is a posterior sample.

12:Re-deploy worker to evaluate $\boldsymbol{f}$ at $\boldsymbol{x}_{t}$

13:endfor

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (1)

4.4 Gradient-based Maximization of Proposed Acquisition Function

An important step in Algorithm 1 is maximization of our proposed acquisition function $\alpha_{\text{NS}}$ in line 11. To ensure this step can be efficiently carried out, we would like to take advantage of gradient-based deterministic optimization algorithms, such as L-BFGS-B [26], with multiple randomly generated restarts. It turns out that this is possible by re-expressing (3) in terms of the sort operator:

\displaystyle\alpha_{\text{NS}}(\boldsymbol{x}|\boldsymbol{g},\mathcal{D})=%\frac{1}{k}\mathbf{e}_{k}^{\top}\text{sort}\left(\left[\text{dist}(\boldsymbol%{g}(\boldsymbol{x}),\boldsymbol{\mu}_{\mathcal{D}}(\boldsymbol{x}_{1})),\ldots%,\text{dist}(\boldsymbol{g}(\boldsymbol{x}),\boldsymbol{\mu}_{\mathcal{D}}(%\boldsymbol{x}_{N}))\right]^{\top}\right),

(4)

where $\mathbf{e}_{k}$ is a vector whose first $k$ entries are equal to 1 and the remaining entries are equal to 0 and $\text{sort}(\cdot)$ denotes a mapping that sorts its input in descending order. As discussed in [27], the standard sort operator is continuous and almost everywhere differentiable (with non-zero gradients).

4.5 Tackling High-Dimensional Problems using Strong Priors

An advantage of BEACON is its close relationship to the BO framework such that it can leverage recent advances that have resulted in significant performance boosts for various settings. We describe two particularly useful examples in this section that can help BEACON scale to high-dimensional problems ( $d\gg 10$ ) if certain substructure in $\boldsymbol{f}$ can be exploited. Other advances, such as accounting for unknown GP hyperparameters [28] and black-box constraints [29], can in principle also be incorporated into BEACON with relatively minor modifications.

SAAS.

The sparse axis-aligned subspace (SAAS) function prior was introduced in [30] as straightforward way to introduce a flexible amount of sparsity into the GP surrogate model. The core assumption in SAAS is that the inputs exhibit a hierarchy of relevance (i.e., certain inputs are more important than others toward predicting $\boldsymbol{f}$ ). This is achieved by SAAS through enforcing a strong prior on the lengthscale hyperparameters that appear in the GP, with the default position being a given dimension is mostly unimportant.Since it is a general-purpose prior, we recommend using SAAS as the default choice for problems with a large number of inputs. We study the impact of SAAS on a high-dimensional molecular discovery problem in Section 5.3.

GAUCHE.

The SAAS prior may not perform well if the hierarchy of relevance assumption is not satisfied. This is likely to be the case in which the high-dimensional inputs are highly correlated such as image data wherein the pixels typically exhibit strong spatial correlation. In such cases, one must resort to an alternative approach. Since our work is partly motivated by material and molecule discovery problems, we briefly describe an interesting alternative suited towards these applications. Specifically, the GAUCHE library was recently developed in [31], which provides a GP framework specifically designed for non-continuous molecular inputs (including protein and chemical reaction representations). Therefore, we can directly port over any such representation in GAUCHE to be used by BEACON, which we explore on an aqueous solubility problem in Section 5.3.

5 Numerical Experiments

In this section, we compare BEACON against established NS algorithms and their variants including the standard evolutionary-based method (NS-EA) [6], a distance-enhanced evolutionary method (NS-DEA) [32], and a feature space-based method (NS-FS). We also compare with the performance of three other algorithms: one that chooses points uniformly at random over $\mathcal{X}$ (RS); one that chooses points using low-discrepancy quasi-random Sobol samples over $\mathcal{X}$ (Sobol); and a maximum variance active learning strategy (MaxVar).Implementation details for BEACON and the benchmark algorithms are provided in Appendix A.

We evaluate performance on 10 synthetic and 8 real-world problems, which are described below or in Appendices C and D (covers a reinforcement learning problem for maze navigation). In all problems, we generate an initial set of 10 points uniformly at random over $\mathcal{X}$ as a first stage to seed the algorithms. Then, we use the algorithms to select an additional 80 to 300 points for evaluation, which is depicted in the figures below. All experiments are replicated 20 times.We use reachability as our performance metric, which is defined as $\text{Reach}_{t}=1-\text{BG}_{t}$ (the fraction of total behaviors observed). Figures 2 and 3 plot the mean of $\text{Reach}_{t}$ plus and minus one standard deviation computed over the independent replicates. Information on the runtimes are provided in the supplement and the code that can be used to replicate our results has been provided in a supplemental zip file.

5.1 Synthetic Functions

We create three synthetic test functions from widely used benchmark functions in the global optimization literature – Ackley, Rosenbrock, and Styblinski-Tang – by adapting them to the NS problem setting. We treat the output of these functions as our outcomes and partition the range of outcomes into 25 equally-spaced intervals to define the corresponding behaviorsFurther details of these functions are provided in Appendix C. The reachability performance of all algorithms across 9 problems (3 synthetic functions, each with 3 input dimensions $d\in\{4,8,12\}$ ) is shown in Figure 2. We see that BEACON consistently outperforms the other algorithms in all cases, achieving near perfect reachability values of 1 in several cases. We further see that the performance gap increases as problem dimension increases.To test BEACON in the multi-output setting, we also created a challenging Multi-Output Plus function with a long-tail joint distribution. The results are shown in Appendix C.1.4 wherein we again observe superior performance by BEACON.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (2)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (3)

5.2 Discovery of Metal Organic Frameworks with Diverse Properties

Metal organic frameworks (MOFs) are an extremely diverse set of porous crystalline materials made from metal ions and organic linker. Currently, we do not have a theoretical framework to evaluate the chemical/property diversity of the effectively infinite number of MOFs that can be made by tuning degrees of freedom available in terms of different linkers, metal nodes, defects, and so forth [33]. We argue that NS could provide a pathway to generate more diverse libraries of MOFs. For example, recent work [34] has shown how biases in existing libraries can lead to incorrect conclusions when used to train machine learning-based models for the purposes of screening. Since the properties of MOFs can be expensive to evaluate computationally and experimentally, BEACON is an attractive NS approach for these settings. We consider four MOF-related problems below that are based on existing experimentally-determined databases (further details in Appendix C).

Hydrogen uptake capacity.

Hydrogen uptake capacity is an important property in clean energy applications. We use the dataset from [35] that consists of 98,000 unique MOF structures. We develop a 7-dimensional feature representation for the input MOF.

Nitrogen uptake capacity.

Nitrogen uptake capacity is an important property for reducing the cost of natural gas production from renewable feedstocks. We use the dataset from [36] consisting of 5,224 unique MOF structures, with a 20-dimensional feature representation for the input MOF.

Sour gas sweetening.

Sour natural gas contains significant amounts of hydrogen sulfide, which poses significant health and safety risks. MOFs are actively being considered as a material for adsorption-based separation processes that can efficiently remove hydrogen sulfide at low cost. We use the dataset from [37] that consists of 1,600 unique MOF structures. We develop a 12-dimensional feature representation for the input MOF, consisting of both chemical and structural features.

Joint CO₂ and CH₄ gas uptake capacity.

We also consider a multi-output MOF discovery application that treats both carbon dioxide (CO₂) uptake capacity $y_{1}$ and methane (CH₄) uptake capacity $y_{2}$ as outcomes. We use the dataset from [34] that consists of 7,000 unique MOF structures. We develop a 25-dimensional feature representation for the input MOF.The highly skewed joint outcome distribution is displayed in Appendix C.2.4.

The performance of the algorithms on all MOF-related experiments is shown in Figure 3a–d. Since the MOF problems are defined over discrete input spaces, we are unable to run NS-EA, NS-DEA, and NS-FS as they are only applicable to continuous inputs. We still observe that BEACON outperforms all other methods achieving (on average) the highest possible reachability of 1 in all cases.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (11)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (12)

5.3 Solubility of Small-Molecule Organic Compounds

Solubility is a chemical property describing how well a substance (the solute) forms a solution with another substance (the solvent), and is important in many applications including drug discovery, pollutant transport, and nuclear reprocessing, to name a few [38]. We explore the application of NS to three problems for finding organic compounds with diverse solubility properties.

Water solubility.

We consider the dataset from [39], consisting of 900 organic compounds. We select a 14-dimensional representation of the compounds using molecular descriptors.

ESOL.

We consider the aqueous solubility dataset with 1128 organic small molecules from [40]. To demonstrate BEACON’s ability to scale to high-dimensional problems, we do not use molecular descriptors and instead use a 2133-dimensional representation derived from one-hot encoding fragprint vectors. We use the Tanimoto-fragprint covariance function [31] to handle this binary input.

LogD.

The octanol-water partition distribution coefficient (LogD) is important in drug discovery applications since influences several of drug candidate’s properties such as solubility, permeability, and toxicity. LogD can be influenced by many factors such that we cannot directly choose a low-dimensional representation using molecular descriptors. Instead, as done in previous work [41], we consider a 125-dimensional feature representation, for a dataset consisting of 2,070 molecules, that is not amenable to standard GP modeling. Similarly to [42], we rely on the SAAS function prior that assumes only a small number of sensitive features exists.

The performance of the ESOL and LogD experiments can be found in Figure 3e–f (results for the water solubility case are shown in Appendix C.3.1). Again, we find that BEACON significantly outperforms all other methods for all cases. We also see that the SAAS prior and Tanimoto-fragprint covariance function improve the performance of BEACON in high-dimensional settings.

6 Conclusion

This paper develops an approach for efficient novelty search (NS) for systems whose outcomes are defined in terms of expensive black-box functions with noisy observations. Such problems arise in several important applications including material design, drug discovery, and reinforcement learning. Current NS methods, which are based on sample-inefficient evolutionary algorithms, struggle in the noisy low-data regime of interest in this work. Inspired by the principles of Bayesian optimization, we develop a surrogate-based NS algorithm that intelligently selects new sample locations given past observations. We further discuss how our approach can be easily modified to scale to problems with high-dimensional input spaces and large datasets. Our extensive numerical experiments show that our proposed approach can dramatically outperform existing NS methods on synthetic and real-world problems. This includes, to our knowledge, a first time application of NS to metal organic framework discovery, which are important materials in emerging clean energy technologies.

Although we empirically observe significant benefits with our proposed approach, there are two limitations worth noting. First, it requires more computation than alternative NS methods, which we discuss in the supplement. This is not a concern for expensive systems (evaluations on the order of several minutes or longer), as the improved sample efficiency more than compensates for the increased computation; however, for cheaper functions, it is not clear if our approach would still be favored. This computation gap also widens as more data is collected, though this can be controlled through appropriate choice of surrogate model (also explored in the supplement). Second, we have not established any theoretical convergence results for our method. There has been some recent work toward building a theoretical framework for NS [43], but additional work is needed to fit our approach into this framework. We believe, however, this is an exciting direction for future work.

Since our approach can accelerate the discovery of diverse behaviors in expensive black-box systems, it can be useful in many applications. This includes several areas that benefit society like mitigating climate change (through better material design) and improving public health (through faster drug discovery) but also potentially harmful areas such as weapon design. Therefore, it is important for society to establish guardrails to help ensure appropriate use of our proposed methodology.

References

[1]StuartJ Russell and Peter Norvig.Artificial intelligence: A modern approach.Pearson, 2016.
[2]Bobak Shahriari, Kevin Swersky, Ziyu Wang, RyanP Adams, and Nando DeFreitas.Taking the human out of the loop: A review of Bayesianoptimization.Proceedings of the IEEE, 104(1):148–175, 2015.
[3]PeterI Frazier.A tutorial on Bayesian optimization.arXiv preprint arXiv:1807.02811, 2018.
[4]Samuel Daulton, Maximilian Balandat, and Eytan Bakshy.Differentiable expected hypervolume improvement for parallelmulti-objective bayesian optimization.Advances in Neural Information Processing Systems,33:9851–9864, 2020.
[5]Joel Lehman and KennethO Stanley.Abandoning objectives: Evolution through the search for noveltyalone.Evolutionary Computation, 19(2):189–223, 2011.
[6]Joel Lehman and KennethO Stanley.Novelty search and the problem with objectives.Genetic Programming Theory and Practice IX, pages 37–56, 2011.
[7]Jonathan Grizou, LaurieJ Points, Abhishek Sharma, and Leroy Cronin.A curious formulation robot enables the discovery of a novelprotocell behavior.Science Advances, 6(5):eaay4237, 2020.
[8]Kei Terayama, Masato Sumita, Ryo Tamura, DanielT Payne, MandeepK Chahal,Shinsuke Ishihara, and Koji Tsuda.Pushing property limits in materials discovery via boundlessobjective-free exploration.Chemical Science, 11(23):5959–5968, 2020.
[9]EthanC Jackson and Mark Daley.Novelty search for deep reinforcement learning policy network weightsby action sequence edit metric distance.In Proceedings of the Genetic and Evolutionary ComputationConference Companion, pages 173–174, 2019.
[10]ChristopherKI Williams and CarlEdward Rasmussen.Gaussian Processes for Machine Learning, volume2.MIT Press Cambridge, MA, 2006.
[11]AGZhilinskas.Single-step Bayesian search method for an extremum of functions ofa single variable.Cybernetics, 11(1):160–166, 1975.
[12]Jonas Mockus.On Bayesian methods for seeking the extremum.In Proceedings of the IFIP Technical Conference, pages400–404, 1974.
[13]Jasper Snoek, Hugo Larochelle, and RyanP Adams.Practical Bayesian optimization of machine learning algorithms.Advances in Neural Information Processing Systems, 25, 2012.
[14]Natalie Maus, Kaiwen Wu, David Eriksson, and Jacob Gardner.Discovering many diverse solutions with Bayesian optimization.arXiv preprint arXiv:2210.10953, 2022.
[15]Sebastian Risi, CharlesE Hughes, and KennethO Stanley.Evolving plastic neural networks with novelty search.Adaptive Behavior, 18(6):470–491, 2010.
[16]Jean-Baptiste Mouret.Novelty-based multiobjectivization.In New Horizons in Evolutionary Robotics: Extended Contributionsfrom the 2009 EvoDeRob Workshop, pages 139–154. Springer, 2011.
[17]Jorge Gomes, Pedro Mariano, and AndersLyhne Christensen.Devising effective novelty search algorithms: A comprehensiveempirical study.In Proceedings of the 2015 Annual Conference on Genetic andEvolutionary Computation, pages 943–950, 2015.
[18]Joel Lehman and KennethO Stanley.Evolving a diversity of virtual creatures through novelty search andlocal competition.In Proceedings of the 13th Annual Conference on Genetic andEvolutionary Computation, pages 211–218, 2011.
[19]Jean-Baptiste Mouret and Jeff Clune.Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909, 2015.
[20]Haitao Liu, Jianfei Cai, and Yew-Soon Ong.Remarks on multi-output Gaussian process regression.Knowledge-Based Systems, 144:102–121, 2018.
[21]SayakRay Chowdhury and Aditya Gopalan.On kernelized multi-armed bandits.In International Conference on Machine Learning, pages844–853. PMLR, 2017.
[22]Akshay Kudva, Wei-Ting Tang, and JoelA Paulson.Robust Bayesian optimization for flexibility analysis of expensivesimulation-based models with rigorous uncertainty bounds.Computers & Chemical Engineering, 181:108515, 2024.
[23]WilliamR Thompson.On the likelihood that one unknown probability exceeds another inview of the evidence of two samples.Biometrika, 25(3-4):285–294, 1933.
[24]James Wilson, Viacheslav Borovitskiy, Alexander Terenin, Peter Mostowsky, andMarc Deisenroth.Efficiently sampling functions from Gaussian process posteriors.In International Conference on Machine Learning, pages10292–10302. PMLR, 2020.
[25]Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and BarnabásPóczos.Parallelised Bayesian optimisation via Thompson sampling.In International Conference on Artificial Intelligence andStatistics, pages 133–142. PMLR, 2018.
[26]RichardH Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu.A limited memory algorithm for bound constrained optimization.SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.
[27]Sebastian Prillo and Julian Eisenschlos.Softsort: A continuous relaxation for the argsort operator.In International Conference on Machine Learning, pages7793–7802. PMLR, 2020.
[28]Felix Berkenkamp, AngelaP Schoellig, and Andreas Krause.No-regret Bayesian optimization with unknown hyperparameters.Journal of Machine Learning Research, 20(50):1–24, 2019.
[29]JacobR Gardner, MattJ Kusner, ZhixiangEddie Xu, KilianQ Weinberger, andJohnP Cunningham.Bayesian optimization with inequality constraints.In International Conference on Machine Learning, volume 2014,pages 937–945, 2014.
[30]David Eriksson and Martin Jankowiak.High-dimensional Bayesian optimization with sparse axis-alignedsubspaces.In Uncertainty in Artificial Intelligence, pages 493–503.PMLR, 2021.
[31]Ryan-Rhys Griffiths, Leo Klarner, Henry Moss, Aditya Ravuri, Sang Truong,Yuanqi Du, Samuel Stanton, Gary Tom, Bojana Rankovic, Arian Jamasb, etal.GAUCHE: A library for Gaussian processes in chemistry.Advances in Neural Information Processing Systems, 36, 2024.
[32]Stephane Doncieux, Giuseppe Paolo, Alban Laflaquière, and Alexandre Coninx.Novelty search makes evolvability inevitable.In Proceedings of the 2020 Genetic and Evolutionary ComputationConference, pages 85–93, 2020.
[33]Sangwon Lee, Baekjun Kim, Hyun Cho, Hooseung Lee, SarahYunmi Lee, EunSeonCho, and Jihan Kim.Computational screening of trillions of metal–organic frameworks forhigh-performance methane storage.ACS Applied Materials & Interfaces, 13(20):23647–23654, 2021.
[34]SeyedMohamad Moosavi, Aditya Nandy, KevinMaik Jablonka, Daniele Ongari,JonPaul Janet, PeterG Boyd, Yongjin Lee, Berend Smit, and HeatherJ Kulik.Understanding the diversity of the metal-organic framework ecosystem.Nature Communications, 11(1):1–10, 2020.
[35]Sumedh Ghude and Chandra Chowdhury.Exploring hydrogen storage capacity in metal-organic frameworks: ABayesian optimization approach.Chemistry–A European Journal, 29(69):e202301840, 2023.
[36]Hilal Daglar and Seda Keskin.Combining machine learning and molecular simulations to unlock gasseparation potentials of MOF membranes and MOF/polymer MMMs.ACS Applied Materials & Interfaces, 14(28):32134–32148, 2022.
[37]YongchulG Chung, Emmanuel Haldoupis, BenjaminJ Bucior, Maciej Haranczyk,Seulchan Lee, Hongda Zhang, KonstantinosD Vogiatzis, Marija Milisavljevic,Sanliang Ling, JeffreyS Camp, etal.Advances, updates, and analytics for the computation-ready,experimental metal–organic framework database: CoRE MOF 2019.Journal of Chemical & Engineering Data, 64(12):5985–5998,2019.
[38]KetanT Savjani, AnuradhaK Gajjar, and JignasaK Savjani.Drug solubility: Importance and enhancement techniques.International Scholarly Research Notices, 2012, 2012.
[39]Samuel Boobier, DavidRJ Hose, AJohn Blacker, and BaoN Nguyen.Machine learning with physicochemical relationships: solubilityprediction in organic solvents and water.Nature Communications, 11(1):5753, 2020.
[40]JohnS Delaney.ESOL: Estimating aqueous solubility directly from molecularstructure.Journal of Chemical Information and Computer Sciences,44(3):1000–1005, 2004.
[41]Zaw-Myo Win, AllenMY Cheong, and WScott Hopkins.Using machine learning to predict partition coefficient (Log P) anddistribution coefficient (Log D) with molecular descriptors and liquidchromatography retention time.Journal of Chemical Information and Modeling, 63(7):1906–1913,2023.
[42]Farshud Sorourifar, Thomas Banker, and JoelA Paulson.Accelerating black-box molecular property optimization by adaptivelylearning sparse subspaces.arXiv preprint arXiv:2401.01398, 2024.
[43]Stephane Doncieux, Alban Laflaquière, and Alexandre Coninx.Novelty search: A theoretical perspective.In Proceedings of the 2019 Genetic and Evolutionary ComputationConference, pages 99–106, 2019.
[44]Maximilian Balandat, Brian Karrer, Daniel Jiang, Samuel Daulton, Ben Letham,AndrewG Wilson, and Eytan Bakshy.BoTorch: A framework for efficient Monte-Carlo Bayesianoptimization.Advances in Neural Information Processing Systems,33:21524–21538, 2020.
[45]Niranjan Srinivas, Andreas Krause, ShamM Kakade, and Matthias Seeger.Gaussian process optimization in the bandit setting: No regret andexperimental design.arXiv preprint arXiv:0912.3995, 2009.
[46]Il’yaMeerovich Sobol’.On the distribution of points in a cube and the approximateevaluation of integrals.Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki,7(4):784–802, 1967.
[47]Momin Jamil and Xin-She Yang.A literature survey of benchmark functions for global optimisationproblems.International Journal of Mathematical Modelling and NumericalOptimisation, 4(2):150–194, 2013.
[48]SachinP Shet, SShanmuga Priya, KSudhakar, and Muhammad Tahir.A review on current trends in potential use of metal-organicframework for hydrogen storage.International Journal of Hydrogen Energy, 46(21):11782–11803,2021.
[49]Xiaofei Wu, Bin Yuan, Zongbi Bao, and Shuguang Deng.Adsorption of carbon dioxide, methane and nitrogen on anultramicroporous copper metal–organic framework.Journal of Colloid and Interface Science, 430:78–84, 2014.
[50]Ryther Anderson, Achay Biong, and DiegoA Gómez-Gualdrón.Adsorption isotherm predictions for multiple molecules in MOFsusing the same deep learning model.Journal of Chemical Theory and Computation, 16(2):1271–1283,2020.
[51]Antonio Peluso, Nicola Gargiulo, Paolo Aprea, Francesco Pepe, and DomenicoCaputo.Nanoporous materials as H2S adsorbents for biogas purification: Areview.Separation & Purification Reviews, 48(1):78–89, 2019.
[52]MansiS Shah, Michael Tsapatsis, and JIlja Siepmann.Hydrogen sulfide capture: From absorption in polar liquids to oxide,zeolite, and metal–organic framework adsorbents and membranes.Chemical Reviews, 117(14):9755–9803, 2017.
[53]EunHyun Cho, Xuepeng Deng, Changlong Zou, and Li-Chiang Lin.Machine learning-aided computational study of metal–organicframeworks for sour gas sweetening.The Journal of Physical Chemistry C, 124(50):27580–27591,2020.
[54]Yabing He, Wei Zhou, Guodong Qian, and Banglin Chen.Methane storage in metal–organic frameworks.Chemical Society Reviews, 43(16):5657–5678, 2014.
[55]Xuanjun Wu, Sichen Xiang, Jiaqi Su, and Weiquan Cai.Understanding quantitative relationship between methane storagecapacities and characteristic properties of metal–organic frameworks basedon machine learning.The Journal of Physical Chemistry C, 123(14):8550–8559, 2019.
[56]Michael Fernandez, PeterG Boyd, ThomasD Daff, MohammadZein Aghaji, and TomKWoo.Rapid and accurate machine learning recognition of high performingmetal organic frameworks for CO2 capture.The Journal of Physical Chemistry Letters, 5(17):3056–3060,2014.
[57]Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi.Deep architectures and deep learning in chemoinformatics: Theprediction of aqueous solubility for drug-like molecules.Journal of Chemical Information and Modeling, 53(7):1563–1575,2013.
[58]Tao Wang, Mian-Bin Wu, Ri-Hao Zhang, Zheng-Jie Chen, Chen Hua, Jian-Ping Lin,and Li-Rong Yang.Advances in computational structure-based drug design and applicationin drug discovery.Current Topics in Medicinal Chemistry, 16(9):901–916, 2016.
[59]DavidH Kenney, RandyC Paffenroth, MichaelT Timko, and AndrewR Teixeira.Dimensionally reduced machine learning model for predicting singlecomponent octanol–water partition coefficients.Journal of Cheminformatics, 15(1):9, 2023.
[60]Nadin Ulrich, Kai-Uwe Goss, and Andrea Ebert.Exploring the octanol–water partition coefficient dataset using deeplearning techniques and data augmentation.Communications Chemistry, 4(1):90, 2021.
[61]Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman,Jie Tang, and Wojciech Zaremba.Openai gym.arXiv preprint arXiv:1606.01540, 2016.
[62]DonaldR Jones, Matthias Schonlau, and WilliamJ Welch.Efficient global optimization of expensive black-box functions.Journal of Global Optimization, 13:455–492, 1998.
[63]Jacob Gardner, Geoff Pleiss, KilianQ Weinberger, David Bindel, and AndrewGWilson.Gpytorch: Blackbox matrix-matrix Gaussian process inference withGPU acceleration.Advances in Neural Information Processing Systems, 31, 2018.
[64]Sattar Vakili, Henry Moss, Artem Artemev, Vincent Dutordoir, and VictorPicheny.Scalable Thompson sampling using sparse Gaussian process models.Advances in Neural Information Processing Systems,34:5631–5643, 2021.
[65]Alexander G deG Matthews, Mark Van DerWilk, Tom Nickson, Keisuke Fujii,Alexis Boukouvalas, Pablo Le, Zoubin Ghahramani, James Hensman, etal.GPflow: A Gaussian process library using TensorFlow.Journal of Machine Learning Research, 18(40):1–6, 2017.

Appendix A Additional Implementation Details

A.1 BEACON

All MOGPs in our experiments have a covariance function $\kappa((\boldsymbol{x},j),(\boldsymbol{x}^{\prime},j^{\prime}))=\delta_{jj^{%\prime}}\kappa_{j}(\boldsymbol{x},\boldsymbol{x}^{\prime})$ where $\delta_{jj^{\prime}}$ is the Kronecker delta, which implies all outputs are modeled independently. For the synthetic experiments, the GPs have a constant mean function and a standard Radial Basis Function (RBF) covariance function of the form:

\displaystyle\kappa_{\text{RBF}}(\boldsymbol{x},\boldsymbol{x}^{\prime})=%\theta_{0}\exp\left(-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{x}^{\prime})^{\top%}\Theta(\boldsymbol{x}-\boldsymbol{x}^{\prime})\right),

where $\Theta=\text{diag}(\ell_{1},\ldots,\ell_{d})$ with $\ell_{i}$ denoting the lengthscale parameter for the $i$ th dimension and $\theta_{0}$ is an output scale parameter. For the real-world MOF and water solubility case studies, we replace the RBF kernel with a standard Matérn 5/2 covariance function. For the ESOL and logD case studies, we use a SAAS function prior [30] and a Tanimoto-fragprint covariance function [31], respectively, to account for their high-dimensional nature. We estimate the GP hyperparameters (and noise variance $\sigma^{2}$ ) using the fit_gpytorch_mll function in BoTorch [44] (corresponds to maximum likelihood estimation) using standard settings.

For problems with continuous $\mathcal{X}$ , we use the efficient Thompson sampling (TS) method in [24] that yields a high-accuracy continuously differentiable approximation of $\boldsymbol{g}\sim\boldsymbol{f}|\mathcal{D}$ . We then develop a PyTorch-based implementation of $\alpha_{\text{NS}}$ in (4) that we maximize using the SciPy implementation of L-BFGS-B [26] over a random set of multi-start initial conditions, as done in BoTorch’s optimize_acqf function. For problems with discrete $\mathcal{X}$ , we simply exhaustively evaluate the TS at all candidates $\boldsymbol{x}\in\mathcal{X}$ and find the one that leads to the largest $\alpha_{\text{NS}}$ .

A.2 NS-EA

NS-EA refers to a standard evolutionary algorithm for NS applied to the novelty metric in (2) that was originally proposed in [6]. We follow the implementation of NS-EA described in [32] that starts with an initial population of size $n_{\text{pop}}$ and then evolves each generation by randomly mutating the existing genomes. The algorithm selectively retains only the most novel genomes from the current population and the offspring, maintaining a constant population size throughout the evolutionary process. The specific Python implementation of NS-EA that we use is available at: https://github.com/alaflaquiere/simple-ns.git.

A.3 NS-DEA

NS-DEA refers to a modified version of NS-EA from [32] that replaces the novelty metric in (2) with a new metric ‘Distance to Explored Area’ (DEA) that measures the distance between an individual’s outcome and the convex hull of previously sampled outcomes. The central idea behind NS-DEA is that we want to sample points clearly outside of the connected region of behaviors that have already been observed. Similarly to NS-EA, we use the Python implementation of NS-DEA available at: https://github.com/alaflaquiere/simple-ns.git.

A.4 NS-FS

We propose a simple variant of NS-EA, referred to as NS-FS, that modifies the novelty metric (2) to be over the input space instead of outcome space. The acquisition function for this case is defined as

\displaystyle\alpha_{\text{NS-FS}}(\boldsymbol{x})=\frac{1}{k}\sum_{i=1}^{k}%\text{dist}_{\mathcal{X}}(\boldsymbol{x},\boldsymbol{x}_{i}^{\star\star}),

where $\text{dist}_{\mathcal{X}}(\cdot)$ denotes a distance function that operates over the input space $\mathcal{X}$ and $\{\boldsymbol{x}_{1}^{\star\star},\ldots,\boldsymbol{x}_{k}^{\star\star}\}$ denotes the top $k$ nearest neighbors to $\boldsymbol{x}$ as measured by $\text{dist}_{\mathcal{X}}(\cdot)$ . This metric incentives exploration of new regions of the input space, however, it is independent of the observed outcome values and so we expect it to perform worse than outcome-based alternative novelty metrics. Since $\alpha_{\text{NS-FS}}$ is only defined in terms of the input, it can easily be maximized for continuous $\boldsymbol{x}\in\mathcal{X}$ using gradient-based optimization algorithms.

A.5 MaxVar

The MaxVar algorithm is an approach commonly utilized in the field of active learning that aims to explore regions of the sample space whose predictions are most uncertain according to some model. This approach has also been explored in the context of BO [45]. Here, we take a standard MaxVar approach defined in terms of the same MOGP model used by BEACON, but replaces the acquisition function by the sum of the diagonal of the predicted posterior covariance matrix, i.e.,

\displaystyle\alpha_{\text{MaxVar}}(\boldsymbol{x}|\mathcal{D})=\text{tr}(%\boldsymbol{\Sigma}_{\mathcal{D}}(\boldsymbol{x})),

where $\boldsymbol{\Sigma}_{\mathcal{D}}(\boldsymbol{x})=\boldsymbol{\kappa}_{%\mathcal{D}}(\boldsymbol{x},\boldsymbol{x})$ and $\text{tr}(\cdot)$ denotes the trace operator. The logic behind using MaxVar for NS tasks is that, once we have constructed a globally accurate (cheap) surrogate model, we can easily identify $\boldsymbol{x}\in\mathcal{X}$ that lead to new behaviors. Although sufficient for NS, it is not necessary to learn such an accurate model of $\boldsymbol{f}$ to find new behaviors, which is the key advantage of BEACON.

A.6 Sobol

Sobol refers to the simple algorithm in which one draws a sequence of quasi-random low-discrepancy Sobol points from the input domain $\mathcal{X}$ [46] to generate the points $\{\boldsymbol{x}_{1},\boldsymbol{x}_{2},\ldots,\boldsymbol{x}_{T}\}$ . Sobol sampling is a commonly used baseline method in the global optimization literature since it is guaranteed to densely sample $\mathcal{X}$ in the limit of infinite experimental budgets $T\to\infty$ .

A.7 RS

RS refers to the simple algorithm in which one independently draws a sequence of points from a uniform distribution defined over the input domain $\mathcal{X}$ to generate points $\{\boldsymbol{x}_{1},\boldsymbol{x}_{2},\ldots,\boldsymbol{x}_{T}\}$ . Similarly to Sobol, RS is able to densely sample $\mathcal{X}$ in the limit of infinite experimental budgets $T\to\infty$ .

A.8 Runtimes and Licenses

The average runtimes of the considered NS methods for the synthetic problems are summarized in Table 1. These times were computed by running the algorithms on a CPU with an Intel(R) Core(TM) i7-10700K CPU 3.8 GHz CPU processor. Although BEACON is more expensive than the other standard NS methods, the average runtime per iteration remains less than 1 second. This additional computation is more than compensated for by BEACON’s significantly improved performance, especially for problems where each outcome evaluation takes several minutes or more. For example, in material and drug discovery/design applications, it is not uncommon for an evaluation to take multiple hours to days (through high-fidelity simulations or experiments). It is also worth noting that BEACON requires less cost compared to MaxVar; the major cost of both methods is the training of the GP model and optimization of the acquisition function. This highlights the importance of the choice of acquisition function (as BEACON consistently outperforms MaxVar) and the computational efficiency of our proposed $\alpha_{\text{NS}}$ in (4) due to efficient TS and the simple sort formulation.

	BEACON	MaxVar	NS-EA	NS-DEA	NS-FS
Ackley4D	0.161	0.142	0.001	0.001	0.001
Ackley8D	0.233	0.339	0.001	0.001	0.001
Ackley12D	0.336	0.777	0.001	0.001	0.001
Rosenbrock4D	0.437	0.804	0.001	0.001	0.001
Rosenbrock8D	0.703	1.180	0.001	0.001	0.001
Rosenbrock12D	1.018	1.293	0.001	0.001	0.001
Styblinski-Tang4D	0.228	0.250	0.001	0.001	0.001
Styblinski-Tang8D	0.540	0.845	0.001	0.001	0.001
Styblinski-Tang12D	1.048	0.916	0.001	0.001	0.001

We use BoTorch [44] to develop BEACON, which are under the MIT license. The synthetic test problems all have implementations in BoTorch while the real-world problems are all based on publicly available datasets that are cited in Appendix C below.Our code implementation is included in the supplemental material and will be released under the MIT license.

Appendix B Ablation Studies

B.1 Impact of the Size of Behavior Space

In this section, we study the impact of the choice of $\epsilon$ (directly related to grid size) that defines how nearby points in outcome space $\mathcal{O}$ are divided into new behaviors $\mathcal{B}$ . In practice, a user does not have to have an actual value for $\epsilon$ selected – as long as some “clusters” exist in the outcome space that (once observed) can be treated as behaviors, BEACON will eventually uncover them through exploration of $\mathcal{O}$ . Therefore, $\epsilon$ only impacts how calculate reachability (or equivalently the behavior gap). We plot the reachability performance for 3 different grid values (10, 50, 100) across the 4-dimensional versions of the Ackley and Rosenbrock synthetic test problems in Figure 4. BEACON continuous to be the best-performing algorithm for all grid sizes, highlighting BEACON’s robustness to the choice of $\mathcal{B}$ . It is interesting to note that gap between BEACON and the other algorithms does appear to increase with increasing grid size; however, more analysis would be needed to see if this trends holds for a larger set of problems of varying dimensions and complexity.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (17)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (18)

B.2 Impact of Number of Nearest Neighbors

The choice of the number of nearest neighbors $k$ is a hyperparameter of our algorithm. We selected it to be $k=10$ in all case studies, as we found that to be a robust choice. We study the impact of $k$ on the 12-dimensional versions of the synthetic test problems in Figure 5 by calculating performance for $k\in\{1,5,10,20\}$ . In this experiment, we use 50 initial datapoints to train the GP and use 25 equally-spaced intervals to divide outcomes into behaviors. Surprisingly, we find that even a choice of $k=1$ does reasonably well on Rosenbrock and Styblinski-Tang, though performance does start to drop for Ackley. We see negligible difference in performance between $k=10$ and $k=20$ , so it appears $k=10$ is a sufficiently large value in practice.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (23)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (24)

B.3 Importance of Filtering Observation Noise

As discussed in Section 4, the presence of noise in the outcome evaluation can lead to challenges in NS algorithms due to miscategorization of behaviors, which we explore in this section. Specifically, we compare the performance of BEACON (Algorithm 1) to a noiseless variant (BEACON-noiseless) in which the acquisition function in (3) is modified by replacing $\boldsymbol{\mu}_{\mathcal{D}}(\boldsymbol{x}_{i}^{\star})$ with $\boldsymbol{y}_{i}^{\star}$ . We perform experiments on the 4d Ackley problem in which 50 initial data points are available and we divide the outcome space into 50 equally-spaced intervals to calculate reachability. Figure 6 shows the performance of BEACON and BEACON-noiseless for four different values of the noise standard deviation $\sigma$ . We see BEACON outperforms BEACON-noiseless in all cases and the gap between them widens as $\sigma$ increases. This study emphasizes the importance of accounting for observation noise in NS, which, to our knowledge, has not been considered by existing NS algorithms.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (26)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (27)

B.4 Hyperparameter Study for NS-EA

NS-EA is one of the most popular NS algorithms, which does have some hyperparameters that could potentially be tuned to improve performance in specific cases. Our results in Section 5 are based on the default settings of the implementation of NS-EA described in Appendix A. To ensure that these settings are reasonably robust for the problems considered in this work, we studied the impact of two key hyparparameters, mainly population size (default value of 10) and offspring size (default value of 10), in this section. The reachability performance of NS-EA for three population sizes $\{10,20,40\}$ (default offspring size of 10) on the synthetic test problems is shown in Figure 7. We see that the performance is fairly similar across these different population sizes, with smaller values doing slightly better on 4d problems and larger values doing slightly better on 12d problems. We similarly plot the reachability performance of NS-EA for three offspring sizes $\{10,20,40\}$ (default population size of 10) on the synthetic test problems in Figure 8. Again, performance is similar in all cases, with the default value yielding equal or better results in most cases. This study suggest that our performance analysis in Section 5 is not sensitive to the particular settings of the benchmark algorithms.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (30)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (31)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (39)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (40)

Appendix C Additional Details on Numerical Experiments

C.1 Details on Synthetic Test Problems

Here, we provide specific details for the synthetic test functions.

C.1.1 Ackley

The Ackley function is a widely used benchmark in the global optimization literature due to its highly multi-modal nature [47]. The $D$ -dimensional Ackley function, with $\boldsymbol{x}=(x_{1},\ldots,x_{D})$ , is defined by

\displaystyle f(\boldsymbol{x})=-a\exp\left(-b\sqrt{\frac{1}{D}\sum_{i=1}^{D}x%_{i}^{2}}\right)-\exp\left(\frac{1}{D}\sum_{i=1}^{D}\cos(cx_{i})\right)+a+\exp%(1),

where $a=-20$ , $b=0.2$ , and $c=2\pi$ .

C.1.2 Rosenbrock

The Rosenbrock function is a widely used benchmark in the global optimization literature that is valley shaped [47]. The $D$ -dimensional Rosenbrock function, with $\boldsymbol{x}=(x_{1},\ldots,x_{D})$ , is defined by

\displaystyle f(\boldsymbol{x})=\sum_{i=1}^{D-1}\left[100(x_{i+1}-x_{i}^{2})^{%2}+(1-x_{i})^{2}\right].

C.1.3 Styblinski-Tang

The Styblinski-Tang function is a widely used benchmark in the global optimization literature [47]. The $D$ -dimensional Styblinski-Tang function, with $\boldsymbol{x}=(x_{1},\ldots,x_{D})$ , is defined by

\displaystyle f(\boldsymbol{x})=\frac{1}{2}\sum_{i=1}^{D}\left(x_{i}^{4}-16x_{%i}^{2}+5x_{i}\right).

C.1.4 Multi-Output Plus

We constructed our own synthetic function defined over $D=6$ inputs, i.e., $\boldsymbol{x}=(x_{1},\ldots,x_{D})$ with $n=2$ outputs, i.e., $\boldsymbol{y}=(y_{1},y_{2})$ . Its is defined as follows

	$\displaystyle y_{1}$	$\displaystyle=\sin(x_{1})\cos(x_{2})+x_{3}\exp(-x_{1}^{2})\cos(x_{1}+x_{2})+0.%01\sin(x_{4}+x_{5}+x_{6}),$
	$\displaystyle y_{2}$	$\displaystyle=\sin(x_{4})\cos(x_{5})+x_{6}\exp(-x_{4}^{2})\cos(x_{4}+x_{5})+0.%01\cos(x_{1}+x_{2}+x_{3}).$

The distribution of outcomes $\mathcal{O}$ based on 10,000 input samples drawn from the space $\mathcal{X}=[-5,5]^{D}$ is shown in Figure 9. Notice how most of the outcomes lie near the center square around the origin, implying jointly long-tail distributions. This type of distribution is difficult for standard NS algorithms to deal with, as they tend to explore the high density center zone of the outcome distribution. This is apparent in the reachability performance results shown on the right plot in Figure 9 wherein BEACON outperforms the other algorithms by discovering a larger number of behaviors in a limited evaluation budget (finding nearly 80% while some established NS methods find less than 10%). The reachability metric is computed with respect to 100 intervals defined on a $10\times 10$ grid over the two outcomes.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (48)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (49)

C.2 Metal Organic Framework (MOF) Properties

Here, we provide some additional details for the realistic MOF problems.

C.2.1 Hydrogen uptake capacity

Hydrogen is a promising alternative clean energy source, however, it remains a challenge to successfully store it due to its low volumetric density. MOFs have been shown to have great promise as a hydrogen gas carrier due to their tunable surface area and porosity [48]. We explore the use of NS to identify MOFs with a wide-range of hydrogen uptake capacities. We consider the dataset from [35], which includes 98,000 total unique MOF structures. For this dataset, one can use $d=7$ key features to represent the MOFs that are summarized in [35, Table 1] (e.g., density, volumetric surface area, pore volume). The outcome distribution for this case is shown in Figure 10, which exhibits a long tail making it difficult to explore using naive random sampling strategies. The reachability metric is computed over 25 equally-spaced intervals over the outcome space.

C.2.2 Nitrogen uptake capacity

Natural gas, which is primarily composed of methane, is considered a cleaner alternative to fossil fuels due to its lower carbon dioxide emissions during combustion. Methane, however, typically only constitutes 50% to 85% of the natural gas that is produced from biogas processes and landfills [49]. Nitrogen is a significant component of the remaining mixture whose presence decreases the thermal efficiency of natural gas combustion. Consequently, the separation of nitrogen from natural gas has become an important area of research over the past decade [50]. MOFs have emerged as a promising material for separating nitrogen from natural gas due to their ability to reduce energy cost compared to traditional methods such as cryogenic distillation. We explore the use of NS in this application using the dataset from [36], which includes 5,224 unique MOF structures. We use the same set of 20 descriptors reported in [36, Table 1] to represent the MOFs. The outcome distribution for this case is shown in Figure 10. The reachability metric is computed over 25 equally-spaced intervals over the outcome space.

C.2.3 Sour gas sweetening

Raw natural gas, also commonly referred to as sour natural gas, often contains a significant amount of hydrogen sulfide (H₂S). Hydrogen sulfide is a colorless, flammable, and extremely hazardous gas known for its pungent “rotten egg” odor at low concentrations [51]. The removal of hydrogen sulfide from raw natural gas is an important unit operation; MOFs have been considered for adsorption-based separation of H₂S due to their potential to improve efficiency and reduce cost [52, 53]. Here, we explore the use of NS to identify MOFs with a wide-range of hydrogen sulfide to methane selectivity values. We use the dataset from [37], which includes 1,600 unique MOF structures. Previous work constructed a machine learning model for this system in terms of 12 features; we use the same 12 features summarized in [37, Table 1]. The outcome distribution for this case is shown in Figure 10. The reachability metric is computed over 40 equally-spaced intervals over the outcome space.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (50)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (51)

C.2.4 Joint carbon dioxide and methane uptake capacity

MOFs are capable of adsorbing both carbon dioxide and methane. Methane is a promising clean energy fuel alternative to fossil fuels; however, its low volumetric energy density poses challenges in storage applications [54]. Previous works how studied the use of machine learning methods for predicting both methane [55] and carbon dioxide [56] storage capacity. To our knowledge, none of these studies have systematically aimed to identify MOFs that cover different regions of the outcome space. This can be important for ensuring training data is not overly biased to specific regions of the space. We explore the use of NS for this type of application using the dataset provided by [34], which is available online at: https://github.com/byooooo/dispersant_screening_PAL.git. The joint distribution of outcomes for this dataset is shown in Figure 11 in which we see that the distribution of carbon dioxide uptake rates is heavily skewed toward small values. The reachability metric is computed with respect to 100 intervals defined on a $10\times 10$ grid over the two outcomes.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (53)

C.3 Solubility of Organic Molecules

Here, we provide some additional details for the realistic solubility problems.

C.3.1 Water solubility

Solubility in water is a critical property for drug discovery and design. Machine learning models have been previously explored for solubility prediction [57], however, it is important to note that such models may be trained on data biased toward particular regions of the outcome space. In addition to having a larger spread of solubility values for the purposes of better model training, it can be useful to have solutes with a range of solubility (LogS) values for different applications. Therefore, we consider the use of NS in this application. Specifically, we use the “Water_set_wide” dataset from [39], which contains 900 organic compounds. We use the same 14 features in [39, Table 1]. The outcome distribution is shown in Figure 12 and the reachability performance for all algorithms is shown in Figure 13 where BEACON achieves an average reachability of 1 within 100 iterations. The reachability metric is computed over 25 equally-spaced intervals over the outcome space.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (54)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (55)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (57)

C.3.2 ESOL

The ESOL dataset is from [40] and consists of 1,128 organic molecules. Using ideas from GAUCHE [31], we develop a tailored GP model using the following Tanimoto-fragprint covariance function

\displaystyle\kappa_{\text{Tanimoto}}(\boldsymbol{x},\boldsymbol{x}^{\prime})=%\theta_{0}\frac{\boldsymbol{x}^{\top}\boldsymbol{x}^{\prime}}{\|\boldsymbol{x}%\|^{2}+\|\boldsymbol{x}^{\prime}\|^{2}-\boldsymbol{x}^{\top}\boldsymbol{x}^{%\prime}},

where $\boldsymbol{x}$ and $\boldsymbol{x}^{\prime}$ are binary vectors corresponding to the 2133-dimensional one-hot encoding representation of the fragprints for every molecule in the dataset and $\|\cdot\|$ is the Euclidean norm. Note that $\kappa_{\text{Tanimoto}}$ only involves a single hyperparameter $\theta_{0}$ that is similarly optimized as described in Appendix A.1. The outcome distribution for the ESOL case is shown in Figure 12, which we see is somewhat skewed to the left. The reachability metric is computed over 50 equally-spaced intervals over the outcome space.

C.3.3 LogD

Identifying the physicochemical properties of a candidate molecule is critical in the early stages of drug design. The octanol-water partition distribution coefficient (LogD) stands out as an important indicator as it helps estimate the drug distribution within the human body [58]. Traditional ways to measurement LogD are known to be difficult and time consuming, which has led to an increase in developing machine learning-based models to predict LogD for untested molecules [59, 60]. As mentioned previously, it is important to have nicely distributed data in the outcome space to reduce the chance of overfitting (leading to bias in the model predictions). One way to achieve this spread is using NS, though it is also useful for exploring new regions of the outcome space. We use the dataset from [41] containing 2,070 molecules characterized by 125 descriptors. Since this is a high-dimensional problem, we resort to the SAAS function prior [30] that is based on the assumption that the descriptors exhibit a hierarchy of relevance when it comes to outcome prediction. SAAS has recently been shown to be effective at dealing with such high-dimensional molecular feature representations [42]. The complete outcome distribution for the LogD case is shown in Figure 12. The reachability metric is computed over 25 equally-spaced intervals over the outcome space.

Appendix D Additional Numerical Example: Maze Problem

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (58)

As an additional experiment, we explore a complex reinforcement learning (RL) problem involving navigating through a maze. We take the large maze configuration from the OpenAI Gymnasium [61] whose documentation can be found here: https://minari.farama.org/datasets/pointmaze/large/. The goal is to navigate a green ball from a starting location to an ending location shown by a red ball in Figure 14 in 300 time steps or less. We use a linear control policy that maps the measured states to actions, which is defined by eight input parameters that need to be optimized. Novelty search (NS) is well-suited for this type of problem where the objective landscape can be deceptive (easy to get stuck in local optima when measuring only the distance between the final and desired locations). Here, as opposed to reachability, we measure performance as follows

\displaystyle\text{Reward}=\frac{\text{initial distance from the target}-\text%{final distance from the target}}{\text{initial distance from the target}}.

(5)

This metric quantifies improvement in proximity toward the target and is ideally equal to 1 (meaning we have exactly landed at the final target). In addition to the previous algorithms, we also consider the classical expected improvement (EI) algorithm [62] under a GP model over the reward function.

The results obtained on the maze problem are shown in Figure 15. On the left, we see that BEACON again outperforms all other algorithms including objective-focused EI that is actually the second worst. As seen by the plot on the right, BEACON is the only method to consistently escape the maze (doing so in all 20 replicates). Interestingly, other NS approaches are able to escape the maze in some of the replicates, with NS-FS and MaxVar both achieving average scores at or above 0.9. However, as seen by the distribution of scores at the final iteration, there are cases where both of these algorithms underperform and barely break 0.5. This case study really underscores the value of an efficient NS perspective; BEACON outperforms EI on the very task that EI was designed for simply due to the complexity of the objective landscape, which makes the GP a poor model for the reward function. We expect other problems with these types of deceptive characteristics would be able to similarly benefit from BEACON.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (59)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (60)

Appendix E Scaling BEACON to Large Evaluations Budgets

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (61)

The computational cost of BEACON is mostly dominated by the acquisition function optimization step, which requires repeated inference with the underlying GP model of the outcome function. It is well-known that the exact GP posterior equations (1) exhibit $O(N^{3})$ scaling due to the inversion of the covariance matrix [63]. Although not expected to be a concern for reasonably small $N$ (on the order of hundreds), this can become a challenge when $N$ is on the order of thousands or more. Even though such larger budgets are not expected for all types of expensive systems, there are relevant cases where this could occur such as when the system is modeled by a high-fidelity simulator that can be easily parallelized. Fortunately, there has been a substantial amount of work on sparse approximate GP models that can be leveraged in such cases. To demonstrate this fact, we perform an experiment with the sparse variational Gaussian process (SVGP) described in [64]. We consider a 20-dimensional version of the Rosenbrock problem where 300 initial data points are available. We use the SVGP implementation available at https://github.com/secondmind-labs/trieste, with a Matérn 5/2 covariance function and 100 inducing points allocated using the $k$ -means approach in [64]. Note that the trieste package is built on top of GPflow [65]. The reachability performance results for BEACON using SVGP over 1000 evaluations for the 20d Rosenbrock problem is shown in Figure 16. The reachability metric is computed over 100 equally-spaced intervals over the outcome space. We see that BEACON achieves dramatically better results than all other methods including finding nearly 3x more diverse behaviors than the standard NS-EA method. It is worth noting that the use of the exact GP within BEACON already begins to considerably slow down at this scale, suggesting that the SVGP provides a nice balance between accuracy and computational cost.

To further investigate the improvements that can be gained in terms of execution time, we perform another experiment involving optimization of our proposed novelty-based acquisition function (4) defined using both exact GP and SVGP models given different amounts of training data. The CPU time to optimize the acquisition function using the Adam solver over 100 iterations (averaged over 10 randomly drawn datasets) is shown in Figure 17. We see that the SVGP exhibits much better scaling than the exact GP, with the CPU time for 5000 points with the SVGP being less than that of 1000 points with the exact GP. These times could be further improved by taking advantage of GPUs, however, the key takeaway is that SVGPs provide an effective path toward efficient NS on large evaluation budgets.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (2024)

Abstract

1 Introduction

2 Related Work

3 Problem Setup

4 Proposed Method

4.1 Multi-Output Gaussian Processes

4.2 A Thompson Sampling-based Acquisition Function for Novelty Search

4.3 The BEACON Algorithm

4.4 Gradient-based Maximization of Proposed Acquisition Function

4.5 Tackling High-Dimensional Problems using Strong Priors

SAAS.

GAUCHE.

5 Numerical Experiments

5.1 Synthetic Functions

5.2 Discovery of Metal Organic Frameworks with Diverse Properties

Hydrogen uptake capacity.

Nitrogen uptake capacity.

Sour gas sweetening.

Joint CO2 and CH4 gas uptake capacity.

5.3 Solubility of Small-Molecule Organic Compounds

Water solubility.

ESOL.

LogD.

6 Conclusion

References

Appendix A Additional Implementation Details

A.1 BEACON

A.2 NS-EA

A.3 NS-DEA

A.4 NS-FS

A.5 MaxVar

A.6 Sobol

A.7 RS

A.8 Runtimes and Licenses

Appendix B Ablation Studies

B.1 Impact of the Size of Behavior Space

B.2 Impact of Number of Nearest Neighbors

B.3 Importance of Filtering Observation Noise

B.4 Hyperparameter Study for NS-EA

Appendix C Additional Details on Numerical Experiments

C.1 Details on Synthetic Test Problems

C.1.1 Ackley

C.1.2 Rosenbrock

C.1.3 Styblinski-Tang

C.1.4 Multi-Output Plus

C.2 Metal Organic Framework (MOF) Properties

C.2.1 Hydrogen uptake capacity

C.2.2 Nitrogen uptake capacity

C.2.3 Sour gas sweetening

C.2.4 Joint carbon dioxide and methane uptake capacity

C.3 Solubility of Organic Molecules

C.3.1 Water solubility

C.3.2 ESOL

C.3.3 LogD

Appendix D Additional Numerical Example: Maze Problem

Appendix E Scaling BEACON to Large Evaluations Budgets

References

Joint CO₂ and CH₄ gas uptake capacity.