BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (2024)

Wei-Ting Tang
The Ohio State University
tang.1856@osu.edu
&Ankush Chakrabarty
Mitsubishi Electric Research Laboratories
achakrabarty@ieee.org
Joel A. Paulson
The Ohio State University
paulson.82@osu.edu

Abstract

Novelty search (NS) refers to a class of exploration algorithms that automatically uncover diverse system behaviors through simulations or experiments. Systematically obtaining diverse outcomes is a key component in many real-world design problems such as material and drug discovery, neural architecture search, reinforcement learning, and robot navigation. Since the relationship between the inputs and outputs (i.e., behaviors) of these complex systems is typically not available in closed form, NS requires a black-box perspective. Consequently, popular NS algorithms rely on evolutionary optimization and other meta-heuristics that require intensive sampling of the input space, which is impractical when the system is expensive to evaluate.We propose a Bayesian optimization inspired algorithm for sample-efficient NS that is specifically designed for such expensive black-box systems. Our approach models the input-to-behavior mapping with multi-output Gaussian processes (MOGP) and selects the next point to evaluate by maximizing a novelty metric that depends on a posterior sample drawn from the MOGP that promotes both exploration and exploitation. By leveraging advances in efficient posterior sampling and high-dimensional Gaussian process modeling, we discuss how our approach can be made scalable with respect to both amount of data and number of inputs. We test our approach on ten synthetic benchmark problems and eight real-world problems (with up to 2133 inputs) including new applications such as discovery of diverse metal organic frameworks for use in clean energy technology. We show that our approach greatly outperforms existing NS algorithms by finding substantially larger sets of diverse behaviors under limited sample budgets.

1 Introduction

Search (or optimization) algorithms are used within virtually every field of science and engineering to automatically find a set of high-quality solutions from a (potentially infinite) set of candidates. Most optimization algorithms measure quality as determined by the user who must define one or more so-called “objective functions” that can be used to rank the candidates [1]. A particularly hard class of problems is when the objective functions are defined as the output of a black-box system (unknown structure, without access to gradient) that is noisy and expensive to evaluate. Such problems can be tackled with Bayesian optimization (BO), which has been shown to empirically outperform other derivative-free global optimization methods when the objectives exhibit these challenging characteristics [2, 3]. BO, however, requires an appropriate choice of the objective functions in advance and, unfortunately, the best choice of objectives is not always obvious, especially when simultaneously optimizing several properties (using, e.g., multi-objective BO [4]).Furthermore, the choice of objectives in BO can strongly bias the selection of points. If these objectives are not properly chosen, they may result in missing key (hidden) states or behaviors in the system, which are crucial for unexpected discoveries to occur during the sampling process.

Novelty search (NS) is an idea for overcoming the need to select pre-defined objective functions [5, 6]; it suggests abandoning the goal of improving specific performance metrics and instead intentionally search for diverse system outcomes. NS-like methods have been shown to be useful in several important real-world applications areas including scientific discovery [7], material design [8], and reinforcement learning [9]. The NS perspective has also been found to be very useful in so-called “deceptive problems” wherein the optimization path taken towards a specific objective can easily get stuck in suboptimal solutions. Since the mapping between inputs and system outcomes is generally a black box, established NS methods rely on meta-heuristics, such as evolutionary algorithms, to select new evaluation points. However, these methods are known to be sample inefficient, which limits their use on expensive-to-evaluate systems. Since many NS applications of interest involve expensive evaluations (e.g., experimental discovery of new out-of-trend drug compounds), we need to develop more efficient alternatives for NS. We seek an algorithm capable of achieving high efficiency by preserving the essence of BO while avoiding the need for an explicit goal-oriented objective function.

This paper proposes a new BO-inspired active learning approach for NS. Just like standard BO, we construct Gaussian process (GP) surrogate models [10] for the unknown functions. In our case, however, the unknown functions are not objectives but instead represent outcomes that a user wishes to explore. The outcome space can be freely defined by the user, meaning it can be a vector of as many elements as desired (e.g., the efficacy, size, synthesizability, and stability of a drug compound).

Our main contributions are summarized as follows:

  • We introduce a new active learning approach, referred to as BEACON (Bayesian Exploration Algorithm for outCOme Novelty), for NS over noisy expensive black-box systems that aims to discover unseen behaviors of the system using a minimal number of evaluations.

  • We propose a novel Thompson sampling-based acquisition function for NS applications that, to our knowledge, is the first to address the exploration-exploitation tradeoff, handle stochastic observations noise, and is suitable for gradient-based optimization.

  • We discuss two different strategies for extending BEACON to high-dimensional problems including a general approach that uses fully Bayesian sparisty-inducing function priors and a specialized approach for chemistry applications.

  • We conduct extensive experiments on several synthetic and real-world problems that demonstrate the substantial benefits BEACON can achieve in terms of identified behaviors over relevant existing NS works. These include a first-time application of NS to discovery of metal organic frameworks that are useful materials in clean energy applications. We also consider challenging molecular discovery benchmark problems with up to 2133 dimensions.

2 Related Work

Our work is inspired by the BO framework for global optimization of expensive and noisy black-box functions that originated from [11, 12]. It has regained popularity in recent years due to its outstanding performance on tasks such as hyperparameter tuning in machine learning methods [13]. Readers are referred to [2] and [3] for a recent review and tutorial introduction to BO. BEACON is similar to multi-objective BO [4] in the sense that they both develop multi-output GP models for the different functions, however, they majorly differ in their choice of acquisition functions. Multi-objective BO aims to learn the Pareto front between competing objectives while BEACON aims to fully explore unseen regions of the outcome space. We argue, similarly to [7], the latter is more appropriate in discovery applications. There has been recent work on improving the solution diversity in BO through the use of a user-defined diversity constraint [14]. Although similar in spirit to BEACON, this method still uses a specific objective function to guide the search and focuses on a local version of BO. BEACON is an objective-free approach aimed at global search of the design space.

Our work directly builds upon the NS method proposed in [6] (which we refer to as NS-EA) that optimizes a distance-based novelty metric using a traditional evolutionary algorithm through random evolution of a maintained population of points. Since this initial work, there have been many research efforts to deploy NS to solve previously unsolved deceptive problems [15, 16, 17]. There have also been extensions of NS-EA to simultaneously assess novelty and fitness [18]. NS is also closely related to so-called “illumination algorithms” such as MAP-Elites [19] that maintains an archive of diverse solutions with respect to different behaviors of interest.The major downside in all of these methods is that they (i) rely on sample-inefficient strategies to search the input space such that they are not applicable to expensive systems and (ii) assume the observations are noise-free such that they cannot handle intrinsic noise in the system. BEACON addresses both of these limitations in NS (for the first time, to our knowledge) by leveraging ideas from the BO literature.

3 Problem Setup

We consider the behavior of a system to be characterized by a vectored-valued black-box function 𝒇:𝒳𝒪:𝒇𝒳𝒪\boldsymbol{f}:\mathcal{X}\to\mathcal{O}bold_italic_f : caligraphic_X → caligraphic_O with 𝒇=(f(1),,f(n))𝒇superscript𝑓1superscript𝑓𝑛\boldsymbol{f}=(f^{(1)},\ldots,f^{(n)})bold_italic_f = ( italic_f start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_f start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ) that maps an input space 𝒳𝒳\mathcal{X}caligraphic_X to a possibly multi-output outcome space 𝒪𝒪\mathcal{O}caligraphic_O that are, respectively, compact subsets of dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Although 𝒪𝒪\mathcal{O}caligraphic_O can be a continuous set, we assume that neighboring values in outcome space share similar behaviors. For example, robot morphologies with similar height, weight, and energy consumption per distance moved are treated as being from the same group. To express this mathematically, we define a discrete (finite) behavior space \mathcal{B}caligraphic_B that is an ϵitalic-ϵ\epsilonitalic_ϵ-cover of 𝒪𝒪\mathcal{O}caligraphic_O, i.e., for some ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, 𝒚𝒪,𝒚such that𝒚𝒚ϵformulae-sequencefor-all𝒚𝒪superscript𝒚such thatnorm𝒚superscript𝒚italic-ϵ\forall\boldsymbol{y}\in\mathcal{O},~{}\exists\boldsymbol{y}^{\prime}\in%\mathcal{B}~{}\text{such that}~{}\|\boldsymbol{y}-\boldsymbol{y}^{\prime}\|\leq\epsilon∀ bold_italic_y ∈ caligraphic_O , ∃ bold_italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_B such that ∥ bold_italic_y - bold_italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ ≤ italic_ϵ111Since our proposed algorithm works directly in outcome space, the exact choice of ϵitalic-ϵ\epsilonitalic_ϵ does not impact the search process (but does impact how performance is measured). We examine how ϵitalic-ϵ\epsilonitalic_ϵ impacts performance in Appendix B.1 for which we find that our method is fairly robust to the neighborhood size..

We are interested in identifying inputs that cover/spread as much of the behavior space \mathcal{B}caligraphic_B as possible. Since 𝒇𝒇\boldsymbol{f}bold_italic_f is assumed to expensive to evaluate and unmodeled, we look to accomplish this task by sequentially evaluating the outcome function (through simulations or experiments) over a finite and discrete number of samples indexed by t=1,,T𝑡1𝑇t=1,\ldots,Titalic_t = 1 , … , italic_T. For every query point 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we receive a noisy measurement of the outcome function 𝒚t=𝒇(𝒙t)+𝜼tsubscript𝒚𝑡𝒇subscript𝒙𝑡subscript𝜼𝑡\boldsymbol{y}_{t}=\boldsymbol{f}(\boldsymbol{x}_{t})+\boldsymbol{\eta}_{t}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + bold_italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT where 𝜼t𝒩(0,σ2𝐈n)similar-tosubscript𝜼𝑡𝒩0superscript𝜎2subscript𝐈𝑛\boldsymbol{\eta}_{t}\sim\mathcal{N}(0,\sigma^{2}\mathbf{I}_{n})bold_italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is Gaussian noise.Letting 𝝋:𝒪:𝝋𝒪\boldsymbol{\varphi}:\mathcal{O}\to\mathcal{B}bold_italic_φ : caligraphic_O → caligraphic_B be the function that maps the outcome of the system to a particular behavior, we introduce the concept of the behavior gap BGt=1|{𝝋(𝒇(𝒙i))}i=1t|/||subscriptBG𝑡1superscriptsubscript𝝋𝒇subscript𝒙𝑖𝑖1𝑡\text{BG}_{t}=1-|\{\boldsymbol{\varphi}(\boldsymbol{f}(\boldsymbol{x}_{i}))\}_%{i=1}^{t}|/|\mathcal{B}|BG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - | { bold_italic_φ ( bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | / | caligraphic_B |, which measures the fraction of unobserved behaviors in the system.We seek to minimize the cumulative behavior gap t=1TBGtsuperscriptsubscript𝑡1𝑇subscriptBG𝑡\sum_{t=1}^{T}\text{BG}_{t}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT BG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, or equivalently, maximize the fraction of unique behaviors observed at every iteration.

As commonly done in the BO literature, we will model the black-box function 𝒇𝒇\boldsymbol{f}bold_italic_f as being drawn from a multi-output Gaussian process (MOGP) prior [20]. This is a valid assumption as long as 𝒇𝒇\boldsymbol{f}bold_italic_f satisfies certain smoothness properties such as belonging to some Reproducing Kernel Hilbert Space (RKHS) [21]. The MOGP can be used to derive a posterior distribution (𝒇|𝒟t)conditional𝒇subscript𝒟𝑡\mathbb{P}(\boldsymbol{f}|\mathcal{D}_{t})blackboard_P ( bold_italic_f | caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) over the true function 𝒇𝒇\boldsymbol{f}bold_italic_f given observed data 𝒟t={(𝒙i,𝒚i)}i=1tsubscript𝒟𝑡superscriptsubscriptsubscript𝒙𝑖subscript𝒚𝑖𝑖1𝑡\mathcal{D}_{t}=\{(\boldsymbol{x}_{i},\boldsymbol{y}_{i})\}_{i=1}^{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. The posterior mean and covariance function can still be calculated analytically in the multi-output case, as discussed further in the next section.

4 Proposed Method

This section describes our approach. We first present the statistical model of the outcome function that accounts for its multi-output structure. We then define our acquisition function defined in terms of a Thompson sample that leverages the inherent randomness in the statistical model to push the query points toward discovering new system behaviors. After summarizing our proposed algorithm, we describe how our acquisition function can be efficiently maximized and end this section by extending the method to high-dimensional problems by incorporating strong priors.

4.1 Multi-Output Gaussian Processes

Gaussian process (GP) models are one of the most popular non-parametric regression approaches in machine learning [10]. The learning process operates by assuming the function values at any collection of inputs are multivariate Gaussian random variables. A GP prior is fully specified by a prior mean function μ𝜇\muitalic_μ and covariance (or kernel) function κ𝜅\kappaitalic_κ.A general way to express a MOGP for 𝒇𝒇\boldsymbol{f}bold_italic_f is to define a function h(𝒙,j)=f(j)(𝒙)𝒙𝑗superscript𝑓𝑗𝒙h(\boldsymbol{x},j)=f^{(j)}(\boldsymbol{x})italic_h ( bold_italic_x , italic_j ) = italic_f start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ( bold_italic_x ) that takes an additional input j𝒥={1,,n}𝑗𝒥1𝑛j\in\mathcal{J}=\{1,\ldots,n\}italic_j ∈ caligraphic_J = { 1 , … , italic_n }, which merely returns the corresponding outcome function. We can now equivalently represent the MOGP prior as a single-output GP prior on h𝒢𝒫(μ,κ)similar-to𝒢𝒫𝜇𝜅h\sim\mathcal{GP}(\mu,\kappa)italic_h ∼ caligraphic_G caligraphic_P ( italic_μ , italic_κ ) over the extended input space 𝒳×𝒥𝒳𝒥\mathcal{X}\times\mathcal{J}caligraphic_X × caligraphic_J, e.g., [22]. Since the noise is Gaussian, the posterior h|𝒜𝒢𝒫(μ𝒜,κ𝒜)similar-toconditional𝒜𝒢𝒫subscript𝜇𝒜subscript𝜅𝒜h|\mathcal{A}\sim\mathcal{GP}(\mu_{\mathcal{A}},\kappa_{\mathcal{A}})italic_h | caligraphic_A ∼ caligraphic_G caligraphic_P ( italic_μ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT , italic_κ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT ) for any collection of N𝑁Nitalic_N observations 𝒜={(𝒙i,ji,yi)}i=1N𝒜superscriptsubscriptsubscript𝒙𝑖subscript𝑗𝑖subscript𝑦𝑖𝑖1𝑁\mathcal{A}=\{(\boldsymbol{x}_{i},j_{i},y_{i})\}_{i=1}^{N}caligraphic_A = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT from this GP, where 𝒙i𝒳subscript𝒙𝑖𝒳\boldsymbol{x}_{i}\in\mathcal{X}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X, ji𝒥subscript𝑗𝑖𝒥j_{i}\in\mathcal{J}italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_J, and yi=h(𝒙i,ji)+ηi(j)subscript𝑦𝑖subscript𝒙𝑖subscript𝑗𝑖subscriptsuperscript𝜂𝑗𝑖y_{i}=h(\boldsymbol{x}_{i},j_{i})+\eta^{(j)}_{i}\in\mathbb{R}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_h ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_η start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R, remains a GP with μ𝒜subscript𝜇𝒜\mu_{\mathcal{A}}italic_μ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT and covariance κ𝒜subscript𝜅𝒜\kappa_{\mathcal{A}}italic_κ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT given by

μ𝒜(𝒙,j)subscript𝜇𝒜𝒙𝑗\displaystyle\mu_{\mathcal{A}}(\boldsymbol{x},j)italic_μ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT ( bold_italic_x , italic_j )=μ(𝒙,j)+𝜿𝒜(𝒙,j)(𝑲𝒜+σ2𝐈N)1(𝒚𝝁𝒜),absent𝜇𝒙𝑗superscriptsubscript𝜿𝒜top𝒙𝑗superscriptsubscript𝑲𝒜superscript𝜎2subscript𝐈𝑁1𝒚subscript𝝁𝒜\displaystyle=\mu(\boldsymbol{x},j)+\boldsymbol{\kappa}_{\mathcal{A}}^{\top}(%\boldsymbol{x},j)\left(\boldsymbol{K}_{\mathcal{A}}+\sigma^{2}\mathbf{I}_{N}%\right)^{-1}\left(\boldsymbol{y}-\boldsymbol{\mu}_{\mathcal{A}}\right),= italic_μ ( bold_italic_x , italic_j ) + bold_italic_κ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_x , italic_j ) ( bold_italic_K start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_y - bold_italic_μ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT ) ,(1a)
κ𝒜((𝒙,j),(𝒙,j))subscript𝜅𝒜𝒙𝑗superscript𝒙superscript𝑗\displaystyle\kappa_{\mathcal{A}}((\boldsymbol{x},j),(\boldsymbol{x}^{\prime},%j^{\prime}))italic_κ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT ( ( bold_italic_x , italic_j ) , ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) )=κ((𝒙,j),(𝒙,j))𝜿𝒜(𝒙,j)(𝑲𝒜+σ2𝐈)1𝜿𝒜(𝒙,j),absent𝜅𝒙𝑗superscript𝒙superscript𝑗superscriptsubscript𝜿𝒜top𝒙𝑗superscriptsubscript𝑲𝒜superscript𝜎2𝐈1subscript𝜿𝒜superscript𝒙superscript𝑗\displaystyle=\kappa((\boldsymbol{x},j),(\boldsymbol{x}^{\prime},j^{\prime}))-%\boldsymbol{\kappa}_{\mathcal{A}}^{\top}(\boldsymbol{x},j)\left(\boldsymbol{K}%_{\mathcal{A}}+\sigma^{2}\mathbf{I}\right)^{-1}\boldsymbol{\kappa}_{\mathcal{A%}}(\boldsymbol{x}^{\prime},j^{\prime}),= italic_κ ( ( bold_italic_x , italic_j ) , ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) - bold_italic_κ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_x , italic_j ) ( bold_italic_K start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_κ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ,(1b)

where 𝒚=[y1,,yN]𝒚superscriptsubscript𝑦1subscript𝑦𝑁top\boldsymbol{y}=[y_{1},\ldots,y_{N}]^{\top}bold_italic_y = [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, 𝝁𝒜=[μ(x1),,μ(xN)]subscript𝝁𝒜superscript𝜇subscript𝑥1𝜇subscript𝑥𝑁top\boldsymbol{\mu}_{\mathcal{A}}=[\mu(x_{1}),\ldots,\mu(x_{N})]^{\top}bold_italic_μ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT = [ italic_μ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_μ ( italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, 𝜿𝒜(𝒙,j)Nsubscript𝜿𝒜𝒙𝑗superscript𝑁\boldsymbol{\kappa}_{\mathcal{A}}(\boldsymbol{x},j)\in\mathbb{R}^{N}bold_italic_κ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT ( bold_italic_x , italic_j ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is the vector of covariance values between the test input (𝒙,j)𝒙𝑗(\boldsymbol{x},j)( bold_italic_x , italic_j ) and the observed inputs in 𝒜𝒜\mathcal{A}caligraphic_A, and 𝑲𝒜N×Nsubscript𝑲𝒜superscript𝑁𝑁\boldsymbol{K}_{\mathcal{A}}\in\mathbb{R}^{N\times N}bold_italic_K start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT is the covariance matrix between all observed inputs in 𝒜𝒜\mathcal{A}caligraphic_A.

4.2 A Thompson Sampling-based Acquisition Function for Novelty Search

We are interested in discovering new behaviors via exploring unseen regions of the outcome space 𝒪𝒪\mathcal{O}caligraphic_O. An established approach for tackling this problem is to attempt to maximize a novelty metric that biases the search process toward observing new things. Although there are many different ways to measure novelty, a popular approach is to evaluate the average distance to the k𝑘kitalic_k-nearest neighbors of observed outcomes given any set of N𝑁Nitalic_N datapoints 𝒟={(𝒙i,𝒚i)}i=1N𝒟superscriptsubscriptsubscript𝒙𝑖subscript𝒚𝑖𝑖1𝑁\mathcal{D}=\{(\boldsymbol{x}_{i},\boldsymbol{y}_{i})\}_{i=1}^{N}caligraphic_D = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT

ρ(𝒙|𝒟)=1ki=1kdist(𝒇(𝒙),𝒚i),𝜌conditional𝒙𝒟1𝑘superscriptsubscript𝑖1𝑘dist𝒇𝒙subscriptsuperscript𝒚𝑖\displaystyle\rho(\boldsymbol{x}|\mathcal{D})=\frac{1}{k}\sum_{i=1}^{k}\text{%dist}(\boldsymbol{f}(\boldsymbol{x}),\boldsymbol{y}^{\star}_{i}),italic_ρ ( bold_italic_x | caligraphic_D ) = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT dist ( bold_italic_f ( bold_italic_x ) , bold_italic_y start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,(2)

where {𝒚1,,𝒚k}{𝒚1,,𝒚N}subscriptsuperscript𝒚1subscriptsuperscript𝒚𝑘subscript𝒚1subscript𝒚𝑁\{\boldsymbol{y}^{\star}_{1},\ldots,\boldsymbol{y}^{\star}_{k}\}\subset\{%\boldsymbol{y}_{1},\ldots,\boldsymbol{y}_{N}\}{ bold_italic_y start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_y start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ⊂ { bold_italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } denote the set of k𝑘kitalic_k closest outcomes to 𝒇(𝒙)𝒇𝒙\boldsymbol{f}(\boldsymbol{x})bold_italic_f ( bold_italic_x ) in the outcome space and dist()dist\text{dist}(\cdot)dist ( ⋅ ) is any valid distance metric defined over the outcome space 𝒪𝒪\mathcal{O}caligraphic_O. Although ρt(𝒙)subscript𝜌𝑡𝒙\rho_{t}(\boldsymbol{x})italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) is intuitively a good metric, it exhibits two major challenges in our problem setting.First, 𝒇(𝒙)𝒇𝒙\boldsymbol{f}(\boldsymbol{x})bold_italic_f ( bold_italic_x ) is a black-box function and so we cannot actually compute ρt(𝒙)subscript𝜌𝑡𝒙\rho_{t}(\boldsymbol{x})italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) until after an outcome function evaluation has taken place. Standard NS methods rely on evolutionary algorithms to attempt to maximize ρt(𝒙)subscript𝜌𝑡𝒙\rho_{t}(\boldsymbol{x})italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ), which require too many function evaluations for expensive 𝒇𝒇\boldsymbol{f}bold_italic_f. Second, this metric can degrade in the presence of noisy evaluations since the true behavior for any observed outcome might be miscategorized, i.e., there may exist realizations of 𝜼𝜼\boldsymbol{\eta}bold_italic_η such that 𝝋(𝒇(𝒙)+𝜼)𝝋(𝒇(𝒙))𝝋𝒇𝒙𝜼𝝋𝒇𝒙\boldsymbol{\varphi}(\boldsymbol{f}(\boldsymbol{x})+\boldsymbol{\eta})\neq%\boldsymbol{\varphi}(\boldsymbol{f}(\boldsymbol{x}))bold_italic_φ ( bold_italic_f ( bold_italic_x ) + bold_italic_η ) ≠ bold_italic_φ ( bold_italic_f ( bold_italic_x ) ). Note that we analyze how noise degrades NS performance in Appendix B.3.

We can overcome these challenges with a MOGP model for 𝒇|𝒟𝒪𝒢𝒫(𝝁𝒟,𝜿𝒟)similar-toconditional𝒇𝒟𝒪𝒢𝒫subscript𝝁𝒟subscript𝜿𝒟\boldsymbol{f}|\mathcal{D}\sim\mathcal{MOGP}(\boldsymbol{\mu}_{\mathcal{D}},%\boldsymbol{\kappa}_{\mathcal{D}})bold_italic_f | caligraphic_D ∼ caligraphic_M caligraphic_O caligraphic_G caligraphic_P ( bold_italic_μ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT , bold_italic_κ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ) where 𝝁𝒟()=(μ𝒟(,1),,μ𝒟(,n))subscript𝝁𝒟subscript𝜇𝒟1subscript𝜇𝒟𝑛\boldsymbol{\mu}_{\mathcal{D}}(\cdot)=(\mu_{\mathcal{D}}(\cdot,1),\ldots,\mu_{%\mathcal{D}}(\cdot,n))bold_italic_μ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( ⋅ ) = ( italic_μ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( ⋅ , 1 ) , … , italic_μ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( ⋅ , italic_n ) ) is a vector function and 𝜿𝒟(,)subscript𝜿𝒟\boldsymbol{\kappa}_{\mathcal{D}}(\cdot,\cdot)bold_italic_κ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( ⋅ , ⋅ ) is a matrix function with elements [𝜿𝒟(,)]i,j=[κ𝒟((,i),(,j))]subscriptdelimited-[]subscript𝜿𝒟𝑖𝑗delimited-[]subscript𝜅𝒟𝑖𝑗[\boldsymbol{\kappa}_{\mathcal{D}}(\cdot,\cdot)]_{i,j}=[\kappa_{\mathcal{D}}((%\cdot,i),(\cdot,j))][ bold_italic_κ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( ⋅ , ⋅ ) ] start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = [ italic_κ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( ( ⋅ , italic_i ) , ( ⋅ , italic_j ) ) ] for all i,j{1,,n}𝑖𝑗1𝑛i,j\in\{1,\ldots,n\}italic_i , italic_j ∈ { 1 , … , italic_n }.The MOGP serves two key purposes: (i) it enables us to filter observation noise by replacing measured data with a surrogate prediction and (ii) it provides a way for us to fantasize future outcome realizations while accounting for uncertainty. The latter step is accomplished through Thompson Sampling (TS), which is a classical strategy for decision-making under uncertainty [23].Our acquisition function can thus be expressed as follows

αNS(𝒙|𝒈,𝒟)=1ki=1kdist(𝒈(𝒙),𝝁𝒟(𝒙i)),subscript𝛼NSconditional𝒙𝒈𝒟1𝑘superscriptsubscript𝑖1𝑘dist𝒈𝒙subscript𝝁𝒟subscriptsuperscript𝒙𝑖\displaystyle\alpha_{\text{NS}}(\boldsymbol{x}|\boldsymbol{g},\mathcal{D})=%\frac{1}{k}\sum_{i=1}^{k}\text{dist}(\boldsymbol{g}(\boldsymbol{x}),%\boldsymbol{\mu}_{\mathcal{D}}(\boldsymbol{x}^{\star}_{i})),italic_α start_POSTSUBSCRIPT NS end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_g , caligraphic_D ) = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT dist ( bold_italic_g ( bold_italic_x ) , bold_italic_μ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ,(3)

where 𝒈(𝒙)𝒇|𝒟similar-to𝒈𝒙conditional𝒇𝒟\boldsymbol{g}(\boldsymbol{x})\sim\boldsymbol{f}|\mathcal{D}bold_italic_g ( bold_italic_x ) ∼ bold_italic_f | caligraphic_D is a sample function realization from the MOGP posterior and {𝒙1,,𝒙k}{𝒙1,,𝒙N}subscriptsuperscript𝒙1superscriptsubscript𝒙𝑘subscript𝒙1subscript𝒙𝑁\{\boldsymbol{x}^{\star}_{1},\ldots,\boldsymbol{x}_{k}^{\star}\}\subset\{%\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{N}\}{ bold_italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT } ⊂ { bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } are the set of k𝑘kitalic_k closest outcomes to 𝒈(𝒙)𝒈𝒙\boldsymbol{g}(\boldsymbol{x})bold_italic_g ( bold_italic_x ) in terms of distance to the posterior mean predictions {𝝁𝒟(𝒙1),,𝝁𝒟(𝒙N)}subscript𝝁𝒟subscript𝒙1subscript𝝁𝒟subscript𝒙𝑁\{\boldsymbol{\mu}_{\mathcal{D}}(\boldsymbol{x}_{1}),\ldots,\boldsymbol{\mu}_{%\mathcal{D}}(\boldsymbol{x}_{N})\}{ bold_italic_μ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , bold_italic_μ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) }.Note that we use the method from [24] to efficiently generate posterior samples of the MOGP using a combination of weight- and function-space views to perform decoupled sampling. This directly provides differentiable high-accuracy approximations of 𝒈𝒈\boldsymbol{g}bold_italic_g; we discuss a more efficient differentiable of the k𝑘kitalic_k-nearest neighbor computation in Section 4.4.

4.3 The BEACON Algorithm

Algorithm 1, which we dub ‘Bayesian Exploration Algorithm for outCOme Novelty’ (BEACON), sequentially runs a combination of the ideas from the previous section. A visual illustration of BEACON applied to a simple test problem is shown in Figure 1. Inspired by [25], we present an asynchronous form of BEACON that takes advantage of M1𝑀1M\geq 1italic_M ≥ 1 parallel workers wherein, once a worker finishes a job, it becomes available to begin a new evaluation.This form is useful in practice since we are motivated by problems that can exhibit asynchronicity such as material discovery.

1:Input: Outcome function 𝒇𝒇\boldsymbol{f}bold_italic_f, input domain 𝒳𝒳\mathcal{X}caligraphic_X, initial data 𝒟𝒟\mathcal{D}caligraphic_D, budget T𝑇Titalic_T, number of workers M𝑀Mitalic_M, number of nearest neighbors k𝑘kitalic_k, outcome distance metric dist()dist\text{dist}(\cdot)dist ( ⋅ ), MOGP prior 𝒪𝒢𝒫(𝝁,𝜿)𝒪𝒢𝒫𝝁𝜿\mathcal{MOGP}(\boldsymbol{\mu},\boldsymbol{\kappa})caligraphic_M caligraphic_O caligraphic_G caligraphic_P ( bold_italic_μ , bold_italic_κ ).

2:Initialize: Data 𝒟0𝒟subscript𝒟0𝒟\mathcal{D}_{0}\leftarrow\mathcal{D}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← caligraphic_D and surrogate model 𝒪𝒢𝒫0𝒪𝒢𝒫(𝝁𝒟0,𝜿𝒟0)𝒪𝒢subscript𝒫0𝒪𝒢𝒫subscript𝝁subscript𝒟0subscript𝜿subscript𝒟0\mathcal{MOGP}_{0}\leftarrow\mathcal{MOGP}(\boldsymbol{\mu}_{\mathcal{D}_{0}},%\boldsymbol{\kappa}_{\mathcal{D}_{0}})caligraphic_M caligraphic_O caligraphic_G caligraphic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← caligraphic_M caligraphic_O caligraphic_G caligraphic_P ( bold_italic_μ start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_κ start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ).

3:fort=1,2,,T𝑡12𝑇t=1,2,\ldots,Titalic_t = 1 , 2 , … , italic_Tdo

4:ift>M𝑡𝑀t>Mitalic_t > italic_Mthen

5:Wait for a worker to finish its evaluation.

6:𝒟t𝒟t1{(𝒙,𝒚)}subscript𝒟𝑡subscript𝒟𝑡1superscript𝒙superscript𝒚\mathcal{D}_{t}\leftarrow\mathcal{D}_{t-1}\cup\{(\boldsymbol{x}^{\prime},%\boldsymbol{y}^{\prime})\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∪ { ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } where (𝒙,𝒚)superscript𝒙superscript𝒚(\boldsymbol{x}^{\prime},\boldsymbol{y}^{\prime})( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) are the worker’s previous query point 𝒙superscript𝒙\boldsymbol{x}^{\prime}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and its corresponding potentially noisy observation 𝒚=𝒇(𝒙)+𝜼superscript𝒚𝒇superscript𝒙𝜼\boldsymbol{y}^{\prime}=\boldsymbol{f}(\boldsymbol{x}^{\prime})+\boldsymbol{\eta}bold_italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_italic_f ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + bold_italic_η.

7:Update surrogate model with new data 𝒪𝒢𝒫t𝒪𝒢𝒫(𝝁𝒟t,𝜿𝒟t)𝒪𝒢subscript𝒫𝑡𝒪𝒢𝒫subscript𝝁subscript𝒟𝑡subscript𝜿subscript𝒟𝑡\mathcal{MOGP}_{t}\leftarrow\mathcal{MOGP}(\boldsymbol{\mu}_{\mathcal{D}_{t}},%\boldsymbol{\kappa}_{\mathcal{D}_{t}})caligraphic_M caligraphic_O caligraphic_G caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← caligraphic_M caligraphic_O caligraphic_G caligraphic_P ( bold_italic_μ start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_κ start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ).

8:else (at least one worker has not been assigned a job)

9:𝒟t𝒟t1subscript𝒟𝑡subscript𝒟𝑡1\mathcal{D}_{t}\leftarrow\mathcal{D}_{t-1}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← caligraphic_D start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and 𝒪𝒢𝒫t𝒪𝒢𝒫t1𝒪𝒢subscript𝒫𝑡𝒪𝒢subscript𝒫𝑡1\mathcal{MOGP}_{t}\leftarrow\mathcal{MOGP}_{t-1}caligraphic_M caligraphic_O caligraphic_G caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← caligraphic_M caligraphic_O caligraphic_G caligraphic_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT.

10:endif

11:𝒙targmax𝒙𝒳αNS(𝒙|𝒈,𝒟t)subscript𝒙𝑡subscriptargmax𝒙𝒳subscript𝛼NSconditional𝒙𝒈subscript𝒟𝑡\boldsymbol{x}_{t}\leftarrow\operatorname*{argmax}_{\boldsymbol{x}\in\mathcal{%X}}\alpha_{\text{NS}}(\boldsymbol{x}|\boldsymbol{g},\mathcal{D}_{t})bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← roman_argmax start_POSTSUBSCRIPT bold_italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT NS end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_g , caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) where 𝒈𝒪𝒢𝒫tsimilar-to𝒈𝒪𝒢subscript𝒫𝑡\boldsymbol{g}\sim\mathcal{MOGP}_{t}bold_italic_g ∼ caligraphic_M caligraphic_O caligraphic_G caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a posterior sample.

12:Re-deploy worker to evaluate 𝒇𝒇\boldsymbol{f}bold_italic_f at 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

13:endfor

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (1)

4.4 Gradient-based Maximization of Proposed Acquisition Function

An important step in Algorithm 1 is maximization of our proposed acquisition function αNSsubscript𝛼NS\alpha_{\text{NS}}italic_α start_POSTSUBSCRIPT NS end_POSTSUBSCRIPT in line 11. To ensure this step can be efficiently carried out, we would like to take advantage of gradient-based deterministic optimization algorithms, such as L-BFGS-B [26], with multiple randomly generated restarts. It turns out that this is possible by re-expressing (3) in terms of the sort operator:

αNS(𝒙|𝒈,𝒟)=1k𝐞ksort([dist(𝒈(𝒙),𝝁𝒟(𝒙1)),,dist(𝒈(𝒙),𝝁𝒟(𝒙N))]),subscript𝛼NSconditional𝒙𝒈𝒟1𝑘superscriptsubscript𝐞𝑘topsortsuperscriptdist𝒈𝒙subscript𝝁𝒟subscript𝒙1dist𝒈𝒙subscript𝝁𝒟subscript𝒙𝑁top\displaystyle\alpha_{\text{NS}}(\boldsymbol{x}|\boldsymbol{g},\mathcal{D})=%\frac{1}{k}\mathbf{e}_{k}^{\top}\text{sort}\left(\left[\text{dist}(\boldsymbol%{g}(\boldsymbol{x}),\boldsymbol{\mu}_{\mathcal{D}}(\boldsymbol{x}_{1})),\ldots%,\text{dist}(\boldsymbol{g}(\boldsymbol{x}),\boldsymbol{\mu}_{\mathcal{D}}(%\boldsymbol{x}_{N}))\right]^{\top}\right),italic_α start_POSTSUBSCRIPT NS end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_g , caligraphic_D ) = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT sort ( [ dist ( bold_italic_g ( bold_italic_x ) , bold_italic_μ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , … , dist ( bold_italic_g ( bold_italic_x ) , bold_italic_μ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ,(4)

where 𝐞ksubscript𝐞𝑘\mathbf{e}_{k}bold_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a vector whose first k𝑘kitalic_k entries are equal to 1 and the remaining entries are equal to 0 and sort()sort\text{sort}(\cdot)sort ( ⋅ ) denotes a mapping that sorts its input in descending order. As discussed in [27], the standard sort operator is continuous and almost everywhere differentiable (with non-zero gradients).

The cost of the sort operator in (4) does grow with the size of its input that can become significant for large N𝑁Nitalic_N. Although we are motivated by such applications, it is worth noting this cost can be reduced by only keeping track of outcomes that lead to unique behaviors and then calculating the k𝑘kitalic_k-nearest neighbors to only those outcomes. The size of this distance vector is at most |||\mathcal{B}|| caligraphic_B |, which is constant (independent of N𝑁Nitalic_N). In Appendix E, we show how combining this idea with sparse GP approximations allows BEACON to scale to high evaluation budgets (order 10,000 or more).

4.5 Tackling High-Dimensional Problems using Strong Priors

An advantage of BEACON is its close relationship to the BO framework such that it can leverage recent advances that have resulted in significant performance boosts for various settings. We describe two particularly useful examples in this section that can help BEACON scale to high-dimensional problems (d10much-greater-than𝑑10d\gg 10italic_d ≫ 10) if certain substructure in 𝒇𝒇\boldsymbol{f}bold_italic_f can be exploited. Other advances, such as accounting for unknown GP hyperparameters [28] and black-box constraints [29], can in principle also be incorporated into BEACON with relatively minor modifications.

SAAS.

The sparse axis-aligned subspace (SAAS) function prior was introduced in [30] as straightforward way to introduce a flexible amount of sparsity into the GP surrogate model. The core assumption in SAAS is that the inputs exhibit a hierarchy of relevance (i.e., certain inputs are more important than others toward predicting 𝒇𝒇\boldsymbol{f}bold_italic_f). This is achieved by SAAS through enforcing a strong prior on the lengthscale hyperparameters that appear in the GP, with the default position being a given dimension is mostly unimportant.Since it is a general-purpose prior, we recommend using SAAS as the default choice for problems with a large number of inputs. We study the impact of SAAS on a high-dimensional molecular discovery problem in Section 5.3.

GAUCHE.

The SAAS prior may not perform well if the hierarchy of relevance assumption is not satisfied. This is likely to be the case in which the high-dimensional inputs are highly correlated such as image data wherein the pixels typically exhibit strong spatial correlation. In such cases, one must resort to an alternative approach. Since our work is partly motivated by material and molecule discovery problems, we briefly describe an interesting alternative suited towards these applications. Specifically, the GAUCHE library was recently developed in [31], which provides a GP framework specifically designed for non-continuous molecular inputs (including protein and chemical reaction representations). Therefore, we can directly port over any such representation in GAUCHE to be used by BEACON, which we explore on an aqueous solubility problem in Section 5.3.

5 Numerical Experiments

In this section, we compare BEACON against established NS algorithms and their variants including the standard evolutionary-based method (NS-EA) [6], a distance-enhanced evolutionary method (NS-DEA) [32], and a feature space-based method (NS-FS). We also compare with the performance of three other algorithms: one that chooses points uniformly at random over 𝒳𝒳\mathcal{X}caligraphic_X (RS); one that chooses points using low-discrepancy quasi-random Sobol samples over 𝒳𝒳\mathcal{X}caligraphic_X (Sobol); and a maximum variance active learning strategy (MaxVar).Implementation details for BEACON and the benchmark algorithms are provided in Appendix A.

We evaluate performance on 10 synthetic and 8 real-world problems, which are described below or in Appendices C and D (covers a reinforcement learning problem for maze navigation). In all problems, we generate an initial set of 10 points uniformly at random over 𝒳𝒳\mathcal{X}caligraphic_X as a first stage to seed the algorithms. Then, we use the algorithms to select an additional 80 to 300 points for evaluation, which is depicted in the figures below. All experiments are replicated 20 times.We use reachability as our performance metric, which is defined as Reacht=1BGtsubscriptReach𝑡1subscriptBG𝑡\text{Reach}_{t}=1-\text{BG}_{t}Reach start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - BG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (the fraction of total behaviors observed). Figures 2 and 3 plot the mean of ReachtsubscriptReach𝑡\text{Reach}_{t}Reach start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT plus and minus one standard deviation computed over the independent replicates. Information on the runtimes are provided in the supplement and the code that can be used to replicate our results has been provided in a supplemental zip file.

5.1 Synthetic Functions

We create three synthetic test functions from widely used benchmark functions in the global optimization literature – Ackley, Rosenbrock, and Styblinski-Tang – by adapting them to the NS problem setting. We treat the output of these functions as our outcomes and partition the range of outcomes into 25 equally-spaced intervals to define the corresponding behaviorsFurther details of these functions are provided in Appendix C. The reachability performance of all algorithms across 9 problems (3 synthetic functions, each with 3 input dimensions d{4,8,12}𝑑4812d\in\{4,8,12\}italic_d ∈ { 4 , 8 , 12 }) is shown in Figure 2. We see that BEACON consistently outperforms the other algorithms in all cases, achieving near perfect reachability values of 1 in several cases. We further see that the performance gap increases as problem dimension increases.To test BEACON in the multi-output setting, we also created a challenging Multi-Output Plus function with a long-tail joint distribution. The results are shown in Appendix C.1.4 wherein we again observe superior performance by BEACON.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (2)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (3)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (4)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (5)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (6)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (7)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (8)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (9)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (10)

5.2 Discovery of Metal Organic Frameworks with Diverse Properties

Metal organic frameworks (MOFs) are an extremely diverse set of porous crystalline materials made from metal ions and organic linker. Currently, we do not have a theoretical framework to evaluate the chemical/property diversity of the effectively infinite number of MOFs that can be made by tuning degrees of freedom available in terms of different linkers, metal nodes, defects, and so forth [33]. We argue that NS could provide a pathway to generate more diverse libraries of MOFs. For example, recent work [34] has shown how biases in existing libraries can lead to incorrect conclusions when used to train machine learning-based models for the purposes of screening. Since the properties of MOFs can be expensive to evaluate computationally and experimentally, BEACON is an attractive NS approach for these settings. We consider four MOF-related problems below that are based on existing experimentally-determined databases (further details in Appendix C).

Hydrogen uptake capacity.

Hydrogen uptake capacity is an important property in clean energy applications. We use the dataset from [35] that consists of 98,000 unique MOF structures. We develop a 7-dimensional feature representation for the input MOF.

Nitrogen uptake capacity.

Nitrogen uptake capacity is an important property for reducing the cost of natural gas production from renewable feedstocks. We use the dataset from [36] consisting of 5,224 unique MOF structures, with a 20-dimensional feature representation for the input MOF.

Sour gas sweetening.

Sour natural gas contains significant amounts of hydrogen sulfide, which poses significant health and safety risks. MOFs are actively being considered as a material for adsorption-based separation processes that can efficiently remove hydrogen sulfide at low cost. We use the dataset from [37] that consists of 1,600 unique MOF structures. We develop a 12-dimensional feature representation for the input MOF, consisting of both chemical and structural features.

Joint CO2 and CH4 gas uptake capacity.

We also consider a multi-output MOF discovery application that treats both carbon dioxide (CO2) uptake capacity y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and methane (CH4) uptake capacity y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as outcomes. We use the dataset from [34] that consists of 7,000 unique MOF structures. We develop a 25-dimensional feature representation for the input MOF.The highly skewed joint outcome distribution is displayed in Appendix C.2.4.

The performance of the algorithms on all MOF-related experiments is shown in Figure 3a–d. Since the MOF problems are defined over discrete input spaces, we are unable to run NS-EA, NS-DEA, and NS-FS as they are only applicable to continuous inputs. We still observe that BEACON outperforms all other methods achieving (on average) the highest possible reachability of 1 in all cases.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (11)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (12)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (13)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (14)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (15)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (16)

5.3 Solubility of Small-Molecule Organic Compounds

Solubility is a chemical property describing how well a substance (the solute) forms a solution with another substance (the solvent), and is important in many applications including drug discovery, pollutant transport, and nuclear reprocessing, to name a few [38]. We explore the application of NS to three problems for finding organic compounds with diverse solubility properties.

Water solubility.

We consider the dataset from [39], consisting of 900 organic compounds. We select a 14-dimensional representation of the compounds using molecular descriptors.

ESOL.

We consider the aqueous solubility dataset with 1128 organic small molecules from [40]. To demonstrate BEACON’s ability to scale to high-dimensional problems, we do not use molecular descriptors and instead use a 2133-dimensional representation derived from one-hot encoding fragprint vectors. We use the Tanimoto-fragprint covariance function [31] to handle this binary input.

LogD.

The octanol-water partition distribution coefficient (LogD) is important in drug discovery applications since influences several of drug candidate’s properties such as solubility, permeability, and toxicity. LogD can be influenced by many factors such that we cannot directly choose a low-dimensional representation using molecular descriptors. Instead, as done in previous work [41], we consider a 125-dimensional feature representation, for a dataset consisting of 2,070 molecules, that is not amenable to standard GP modeling. Similarly to [42], we rely on the SAAS function prior that assumes only a small number of sensitive features exists.

The performance of the ESOL and LogD experiments can be found in Figure 3e–f (results for the water solubility case are shown in Appendix C.3.1). Again, we find that BEACON significantly outperforms all other methods for all cases. We also see that the SAAS prior and Tanimoto-fragprint covariance function improve the performance of BEACON in high-dimensional settings.

6 Conclusion

This paper develops an approach for efficient novelty search (NS) for systems whose outcomes are defined in terms of expensive black-box functions with noisy observations. Such problems arise in several important applications including material design, drug discovery, and reinforcement learning. Current NS methods, which are based on sample-inefficient evolutionary algorithms, struggle in the noisy low-data regime of interest in this work. Inspired by the principles of Bayesian optimization, we develop a surrogate-based NS algorithm that intelligently selects new sample locations given past observations. We further discuss how our approach can be easily modified to scale to problems with high-dimensional input spaces and large datasets. Our extensive numerical experiments show that our proposed approach can dramatically outperform existing NS methods on synthetic and real-world problems. This includes, to our knowledge, a first time application of NS to metal organic framework discovery, which are important materials in emerging clean energy technologies.

Although we empirically observe significant benefits with our proposed approach, there are two limitations worth noting. First, it requires more computation than alternative NS methods, which we discuss in the supplement. This is not a concern for expensive systems (evaluations on the order of several minutes or longer), as the improved sample efficiency more than compensates for the increased computation; however, for cheaper functions, it is not clear if our approach would still be favored. This computation gap also widens as more data is collected, though this can be controlled through appropriate choice of surrogate model (also explored in the supplement). Second, we have not established any theoretical convergence results for our method. There has been some recent work toward building a theoretical framework for NS [43], but additional work is needed to fit our approach into this framework. We believe, however, this is an exciting direction for future work.

Since our approach can accelerate the discovery of diverse behaviors in expensive black-box systems, it can be useful in many applications. This includes several areas that benefit society like mitigating climate change (through better material design) and improving public health (through faster drug discovery) but also potentially harmful areas such as weapon design. Therefore, it is important for society to establish guardrails to help ensure appropriate use of our proposed methodology.

References

  • [1]StuartJ Russell and Peter Norvig.Artificial intelligence: A modern approach.Pearson, 2016.
  • [2]Bobak Shahriari, Kevin Swersky, Ziyu Wang, RyanP Adams, and Nando DeFreitas.Taking the human out of the loop: A review of Bayesianoptimization.Proceedings of the IEEE, 104(1):148–175, 2015.
  • [3]PeterI Frazier.A tutorial on Bayesian optimization.arXiv preprint arXiv:1807.02811, 2018.
  • [4]Samuel Daulton, Maximilian Balandat, and Eytan Bakshy.Differentiable expected hypervolume improvement for parallelmulti-objective bayesian optimization.Advances in Neural Information Processing Systems,33:9851–9864, 2020.
  • [5]Joel Lehman and KennethO Stanley.Abandoning objectives: Evolution through the search for noveltyalone.Evolutionary Computation, 19(2):189–223, 2011.
  • [6]Joel Lehman and KennethO Stanley.Novelty search and the problem with objectives.Genetic Programming Theory and Practice IX, pages 37–56, 2011.
  • [7]Jonathan Grizou, LaurieJ Points, Abhishek Sharma, and Leroy Cronin.A curious formulation robot enables the discovery of a novelprotocell behavior.Science Advances, 6(5):eaay4237, 2020.
  • [8]Kei Terayama, Masato Sumita, Ryo Tamura, DanielT Payne, MandeepK Chahal,Shinsuke Ishihara, and Koji Tsuda.Pushing property limits in materials discovery via boundlessobjective-free exploration.Chemical Science, 11(23):5959–5968, 2020.
  • [9]EthanC Jackson and Mark Daley.Novelty search for deep reinforcement learning policy network weightsby action sequence edit metric distance.In Proceedings of the Genetic and Evolutionary ComputationConference Companion, pages 173–174, 2019.
  • [10]ChristopherKI Williams and CarlEdward Rasmussen.Gaussian Processes for Machine Learning, volume2.MIT Press Cambridge, MA, 2006.
  • [11]AGZhilinskas.Single-step Bayesian search method for an extremum of functions ofa single variable.Cybernetics, 11(1):160–166, 1975.
  • [12]Jonas Mockus.On Bayesian methods for seeking the extremum.In Proceedings of the IFIP Technical Conference, pages400–404, 1974.
  • [13]Jasper Snoek, Hugo Larochelle, and RyanP Adams.Practical Bayesian optimization of machine learning algorithms.Advances in Neural Information Processing Systems, 25, 2012.
  • [14]Natalie Maus, Kaiwen Wu, David Eriksson, and Jacob Gardner.Discovering many diverse solutions with Bayesian optimization.arXiv preprint arXiv:2210.10953, 2022.
  • [15]Sebastian Risi, CharlesE Hughes, and KennethO Stanley.Evolving plastic neural networks with novelty search.Adaptive Behavior, 18(6):470–491, 2010.
  • [16]Jean-Baptiste Mouret.Novelty-based multiobjectivization.In New Horizons in Evolutionary Robotics: Extended Contributionsfrom the 2009 EvoDeRob Workshop, pages 139–154. Springer, 2011.
  • [17]Jorge Gomes, Pedro Mariano, and AndersLyhne Christensen.Devising effective novelty search algorithms: A comprehensiveempirical study.In Proceedings of the 2015 Annual Conference on Genetic andEvolutionary Computation, pages 943–950, 2015.
  • [18]Joel Lehman and KennethO Stanley.Evolving a diversity of virtual creatures through novelty search andlocal competition.In Proceedings of the 13th Annual Conference on Genetic andEvolutionary Computation, pages 211–218, 2011.
  • [19]Jean-Baptiste Mouret and Jeff Clune.Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909, 2015.
  • [20]Haitao Liu, Jianfei Cai, and Yew-Soon Ong.Remarks on multi-output Gaussian process regression.Knowledge-Based Systems, 144:102–121, 2018.
  • [21]SayakRay Chowdhury and Aditya Gopalan.On kernelized multi-armed bandits.In International Conference on Machine Learning, pages844–853. PMLR, 2017.
  • [22]Akshay Kudva, Wei-Ting Tang, and JoelA Paulson.Robust Bayesian optimization for flexibility analysis of expensivesimulation-based models with rigorous uncertainty bounds.Computers & Chemical Engineering, 181:108515, 2024.
  • [23]WilliamR Thompson.On the likelihood that one unknown probability exceeds another inview of the evidence of two samples.Biometrika, 25(3-4):285–294, 1933.
  • [24]James Wilson, Viacheslav Borovitskiy, Alexander Terenin, Peter Mostowsky, andMarc Deisenroth.Efficiently sampling functions from Gaussian process posteriors.In International Conference on Machine Learning, pages10292–10302. PMLR, 2020.
  • [25]Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and BarnabásPóczos.Parallelised Bayesian optimisation via Thompson sampling.In International Conference on Artificial Intelligence andStatistics, pages 133–142. PMLR, 2018.
  • [26]RichardH Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu.A limited memory algorithm for bound constrained optimization.SIAM Journal on Scientific Computing, 16(5):1190–1208, 1995.
  • [27]Sebastian Prillo and Julian Eisenschlos.Softsort: A continuous relaxation for the argsort operator.In International Conference on Machine Learning, pages7793–7802. PMLR, 2020.
  • [28]Felix Berkenkamp, AngelaP Schoellig, and Andreas Krause.No-regret Bayesian optimization with unknown hyperparameters.Journal of Machine Learning Research, 20(50):1–24, 2019.
  • [29]JacobR Gardner, MattJ Kusner, ZhixiangEddie Xu, KilianQ Weinberger, andJohnP Cunningham.Bayesian optimization with inequality constraints.In International Conference on Machine Learning, volume 2014,pages 937–945, 2014.
  • [30]David Eriksson and Martin Jankowiak.High-dimensional Bayesian optimization with sparse axis-alignedsubspaces.In Uncertainty in Artificial Intelligence, pages 493–503.PMLR, 2021.
  • [31]Ryan-Rhys Griffiths, Leo Klarner, Henry Moss, Aditya Ravuri, Sang Truong,Yuanqi Du, Samuel Stanton, Gary Tom, Bojana Rankovic, Arian Jamasb, etal.GAUCHE: A library for Gaussian processes in chemistry.Advances in Neural Information Processing Systems, 36, 2024.
  • [32]Stephane Doncieux, Giuseppe Paolo, Alban Laflaquière, and Alexandre Coninx.Novelty search makes evolvability inevitable.In Proceedings of the 2020 Genetic and Evolutionary ComputationConference, pages 85–93, 2020.
  • [33]Sangwon Lee, Baekjun Kim, Hyun Cho, Hooseung Lee, SarahYunmi Lee, EunSeonCho, and Jihan Kim.Computational screening of trillions of metal–organic frameworks forhigh-performance methane storage.ACS Applied Materials & Interfaces, 13(20):23647–23654, 2021.
  • [34]SeyedMohamad Moosavi, Aditya Nandy, KevinMaik Jablonka, Daniele Ongari,JonPaul Janet, PeterG Boyd, Yongjin Lee, Berend Smit, and HeatherJ Kulik.Understanding the diversity of the metal-organic framework ecosystem.Nature Communications, 11(1):1–10, 2020.
  • [35]Sumedh Ghude and Chandra Chowdhury.Exploring hydrogen storage capacity in metal-organic frameworks: ABayesian optimization approach.Chemistry–A European Journal, 29(69):e202301840, 2023.
  • [36]Hilal Daglar and Seda Keskin.Combining machine learning and molecular simulations to unlock gasseparation potentials of MOF membranes and MOF/polymer MMMs.ACS Applied Materials & Interfaces, 14(28):32134–32148, 2022.
  • [37]YongchulG Chung, Emmanuel Haldoupis, BenjaminJ Bucior, Maciej Haranczyk,Seulchan Lee, Hongda Zhang, KonstantinosD Vogiatzis, Marija Milisavljevic,Sanliang Ling, JeffreyS Camp, etal.Advances, updates, and analytics for the computation-ready,experimental metal–organic framework database: CoRE MOF 2019.Journal of Chemical & Engineering Data, 64(12):5985–5998,2019.
  • [38]KetanT Savjani, AnuradhaK Gajjar, and JignasaK Savjani.Drug solubility: Importance and enhancement techniques.International Scholarly Research Notices, 2012, 2012.
  • [39]Samuel Boobier, DavidRJ Hose, AJohn Blacker, and BaoN Nguyen.Machine learning with physicochemical relationships: solubilityprediction in organic solvents and water.Nature Communications, 11(1):5753, 2020.
  • [40]JohnS Delaney.ESOL: Estimating aqueous solubility directly from molecularstructure.Journal of Chemical Information and Computer Sciences,44(3):1000–1005, 2004.
  • [41]Zaw-Myo Win, AllenMY Cheong, and WScott Hopkins.Using machine learning to predict partition coefficient (Log P) anddistribution coefficient (Log D) with molecular descriptors and liquidchromatography retention time.Journal of Chemical Information and Modeling, 63(7):1906–1913,2023.
  • [42]Farshud Sorourifar, Thomas Banker, and JoelA Paulson.Accelerating black-box molecular property optimization by adaptivelylearning sparse subspaces.arXiv preprint arXiv:2401.01398, 2024.
  • [43]Stephane Doncieux, Alban Laflaquière, and Alexandre Coninx.Novelty search: A theoretical perspective.In Proceedings of the 2019 Genetic and Evolutionary ComputationConference, pages 99–106, 2019.
  • [44]Maximilian Balandat, Brian Karrer, Daniel Jiang, Samuel Daulton, Ben Letham,AndrewG Wilson, and Eytan Bakshy.BoTorch: A framework for efficient Monte-Carlo Bayesianoptimization.Advances in Neural Information Processing Systems,33:21524–21538, 2020.
  • [45]Niranjan Srinivas, Andreas Krause, ShamM Kakade, and Matthias Seeger.Gaussian process optimization in the bandit setting: No regret andexperimental design.arXiv preprint arXiv:0912.3995, 2009.
  • [46]Il’yaMeerovich Sobol’.On the distribution of points in a cube and the approximateevaluation of integrals.Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki,7(4):784–802, 1967.
  • [47]Momin Jamil and Xin-She Yang.A literature survey of benchmark functions for global optimisationproblems.International Journal of Mathematical Modelling and NumericalOptimisation, 4(2):150–194, 2013.
  • [48]SachinP Shet, SShanmuga Priya, KSudhakar, and Muhammad Tahir.A review on current trends in potential use of metal-organicframework for hydrogen storage.International Journal of Hydrogen Energy, 46(21):11782–11803,2021.
  • [49]Xiaofei Wu, Bin Yuan, Zongbi Bao, and Shuguang Deng.Adsorption of carbon dioxide, methane and nitrogen on anultramicroporous copper metal–organic framework.Journal of Colloid and Interface Science, 430:78–84, 2014.
  • [50]Ryther Anderson, Achay Biong, and DiegoA Gómez-Gualdrón.Adsorption isotherm predictions for multiple molecules in MOFsusing the same deep learning model.Journal of Chemical Theory and Computation, 16(2):1271–1283,2020.
  • [51]Antonio Peluso, Nicola Gargiulo, Paolo Aprea, Francesco Pepe, and DomenicoCaputo.Nanoporous materials as H2S adsorbents for biogas purification: Areview.Separation & Purification Reviews, 48(1):78–89, 2019.
  • [52]MansiS Shah, Michael Tsapatsis, and JIlja Siepmann.Hydrogen sulfide capture: From absorption in polar liquids to oxide,zeolite, and metal–organic framework adsorbents and membranes.Chemical Reviews, 117(14):9755–9803, 2017.
  • [53]EunHyun Cho, Xuepeng Deng, Changlong Zou, and Li-Chiang Lin.Machine learning-aided computational study of metal–organicframeworks for sour gas sweetening.The Journal of Physical Chemistry C, 124(50):27580–27591,2020.
  • [54]Yabing He, Wei Zhou, Guodong Qian, and Banglin Chen.Methane storage in metal–organic frameworks.Chemical Society Reviews, 43(16):5657–5678, 2014.
  • [55]Xuanjun Wu, Sichen Xiang, Jiaqi Su, and Weiquan Cai.Understanding quantitative relationship between methane storagecapacities and characteristic properties of metal–organic frameworks basedon machine learning.The Journal of Physical Chemistry C, 123(14):8550–8559, 2019.
  • [56]Michael Fernandez, PeterG Boyd, ThomasD Daff, MohammadZein Aghaji, and TomKWoo.Rapid and accurate machine learning recognition of high performingmetal organic frameworks for CO2 capture.The Journal of Physical Chemistry Letters, 5(17):3056–3060,2014.
  • [57]Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi.Deep architectures and deep learning in chemoinformatics: Theprediction of aqueous solubility for drug-like molecules.Journal of Chemical Information and Modeling, 53(7):1563–1575,2013.
  • [58]Tao Wang, Mian-Bin Wu, Ri-Hao Zhang, Zheng-Jie Chen, Chen Hua, Jian-Ping Lin,and Li-Rong Yang.Advances in computational structure-based drug design and applicationin drug discovery.Current Topics in Medicinal Chemistry, 16(9):901–916, 2016.
  • [59]DavidH Kenney, RandyC Paffenroth, MichaelT Timko, and AndrewR Teixeira.Dimensionally reduced machine learning model for predicting singlecomponent octanol–water partition coefficients.Journal of Cheminformatics, 15(1):9, 2023.
  • [60]Nadin Ulrich, Kai-Uwe Goss, and Andrea Ebert.Exploring the octanol–water partition coefficient dataset using deeplearning techniques and data augmentation.Communications Chemistry, 4(1):90, 2021.
  • [61]Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman,Jie Tang, and Wojciech Zaremba.Openai gym.arXiv preprint arXiv:1606.01540, 2016.
  • [62]DonaldR Jones, Matthias Schonlau, and WilliamJ Welch.Efficient global optimization of expensive black-box functions.Journal of Global Optimization, 13:455–492, 1998.
  • [63]Jacob Gardner, Geoff Pleiss, KilianQ Weinberger, David Bindel, and AndrewGWilson.Gpytorch: Blackbox matrix-matrix Gaussian process inference withGPU acceleration.Advances in Neural Information Processing Systems, 31, 2018.
  • [64]Sattar Vakili, Henry Moss, Artem Artemev, Vincent Dutordoir, and VictorPicheny.Scalable Thompson sampling using sparse Gaussian process models.Advances in Neural Information Processing Systems,34:5631–5643, 2021.
  • [65]Alexander G deG Matthews, Mark Van DerWilk, Tom Nickson, Keisuke Fujii,Alexis Boukouvalas, Pablo Le, Zoubin Ghahramani, James Hensman, etal.GPflow: A Gaussian process library using TensorFlow.Journal of Machine Learning Research, 18(40):1–6, 2017.

Appendix A Additional Implementation Details

A.1 BEACON

All MOGPs in our experiments have a covariance function κ((𝒙,j),(𝒙,j))=δjjκj(𝒙,𝒙)𝜅𝒙𝑗superscript𝒙superscript𝑗subscript𝛿𝑗superscript𝑗subscript𝜅𝑗𝒙superscript𝒙\kappa((\boldsymbol{x},j),(\boldsymbol{x}^{\prime},j^{\prime}))=\delta_{jj^{%\prime}}\kappa_{j}(\boldsymbol{x},\boldsymbol{x}^{\prime})italic_κ ( ( bold_italic_x , italic_j ) , ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) = italic_δ start_POSTSUBSCRIPT italic_j italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_κ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where δjjsubscript𝛿𝑗superscript𝑗\delta_{jj^{\prime}}italic_δ start_POSTSUBSCRIPT italic_j italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is the Kronecker delta, which implies all outputs are modeled independently. For the synthetic experiments, the GPs have a constant mean function and a standard Radial Basis Function (RBF) covariance function of the form:

κRBF(𝒙,𝒙)=θ0exp(12(𝒙𝒙)Θ(𝒙𝒙)),subscript𝜅RBF𝒙superscript𝒙subscript𝜃012superscript𝒙superscript𝒙topΘ𝒙superscript𝒙\displaystyle\kappa_{\text{RBF}}(\boldsymbol{x},\boldsymbol{x}^{\prime})=%\theta_{0}\exp\left(-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{x}^{\prime})^{\top%}\Theta(\boldsymbol{x}-\boldsymbol{x}^{\prime})\right),italic_κ start_POSTSUBSCRIPT RBF end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Θ ( bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ,

where Θ=diag(1,,d)Θdiagsubscript1subscript𝑑\Theta=\text{diag}(\ell_{1},\ldots,\ell_{d})roman_Θ = diag ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) with isubscript𝑖\ell_{i}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denoting the lengthscale parameter for the i𝑖iitalic_ith dimension and θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is an output scale parameter. For the real-world MOF and water solubility case studies, we replace the RBF kernel with a standard Matérn 5/2 covariance function. For the ESOL and logD case studies, we use a SAAS function prior [30] and a Tanimoto-fragprint covariance function [31], respectively, to account for their high-dimensional nature. We estimate the GP hyperparameters (and noise variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) using the fit_gpytorch_mll function in BoTorch [44] (corresponds to maximum likelihood estimation) using standard settings.

For problems with continuous 𝒳𝒳\mathcal{X}caligraphic_X, we use the efficient Thompson sampling (TS) method in [24] that yields a high-accuracy continuously differentiable approximation of 𝒈𝒇|𝒟similar-to𝒈conditional𝒇𝒟\boldsymbol{g}\sim\boldsymbol{f}|\mathcal{D}bold_italic_g ∼ bold_italic_f | caligraphic_D. We then develop a PyTorch-based implementation of αNSsubscript𝛼NS\alpha_{\text{NS}}italic_α start_POSTSUBSCRIPT NS end_POSTSUBSCRIPT in (4) that we maximize using the SciPy implementation of L-BFGS-B [26] over a random set of multi-start initial conditions, as done in BoTorch’s optimize_acqf function. For problems with discrete 𝒳𝒳\mathcal{X}caligraphic_X, we simply exhaustively evaluate the TS at all candidates 𝒙𝒳𝒙𝒳\boldsymbol{x}\in\mathcal{X}bold_italic_x ∈ caligraphic_X and find the one that leads to the largest αNSsubscript𝛼NS\alpha_{\text{NS}}italic_α start_POSTSUBSCRIPT NS end_POSTSUBSCRIPT.

A.2 NS-EA

NS-EA refers to a standard evolutionary algorithm for NS applied to the novelty metric in (2) that was originally proposed in [6]. We follow the implementation of NS-EA described in [32] that starts with an initial population of size npopsubscript𝑛popn_{\text{pop}}italic_n start_POSTSUBSCRIPT pop end_POSTSUBSCRIPT and then evolves each generation by randomly mutating the existing genomes. The algorithm selectively retains only the most novel genomes from the current population and the offspring, maintaining a constant population size throughout the evolutionary process. The specific Python implementation of NS-EA that we use is available at: https://github.com/alaflaquiere/simple-ns.git.

A.3 NS-DEA

NS-DEA refers to a modified version of NS-EA from [32] that replaces the novelty metric in (2) with a new metric ‘Distance to Explored Area’ (DEA) that measures the distance between an individual’s outcome and the convex hull of previously sampled outcomes. The central idea behind NS-DEA is that we want to sample points clearly outside of the connected region of behaviors that have already been observed. Similarly to NS-EA, we use the Python implementation of NS-DEA available at: https://github.com/alaflaquiere/simple-ns.git.

A.4 NS-FS

We propose a simple variant of NS-EA, referred to as NS-FS, that modifies the novelty metric (2) to be over the input space instead of outcome space. The acquisition function for this case is defined as

αNS-FS(𝒙)=1ki=1kdist𝒳(𝒙,𝒙i),subscript𝛼NS-FS𝒙1𝑘superscriptsubscript𝑖1𝑘subscriptdist𝒳𝒙superscriptsubscript𝒙𝑖absent\displaystyle\alpha_{\text{NS-FS}}(\boldsymbol{x})=\frac{1}{k}\sum_{i=1}^{k}%\text{dist}_{\mathcal{X}}(\boldsymbol{x},\boldsymbol{x}_{i}^{\star\star}),italic_α start_POSTSUBSCRIPT NS-FS end_POSTSUBSCRIPT ( bold_italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT dist start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ ⋆ end_POSTSUPERSCRIPT ) ,

where dist𝒳()subscriptdist𝒳\text{dist}_{\mathcal{X}}(\cdot)dist start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ( ⋅ ) denotes a distance function that operates over the input space 𝒳𝒳\mathcal{X}caligraphic_X and {𝒙1,,𝒙k}superscriptsubscript𝒙1absentsuperscriptsubscript𝒙𝑘absent\{\boldsymbol{x}_{1}^{\star\star},\ldots,\boldsymbol{x}_{k}^{\star\star}\}{ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ ⋆ end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ ⋆ end_POSTSUPERSCRIPT } denotes the top k𝑘kitalic_k nearest neighbors to 𝒙𝒙\boldsymbol{x}bold_italic_x as measured by dist𝒳()subscriptdist𝒳\text{dist}_{\mathcal{X}}(\cdot)dist start_POSTSUBSCRIPT caligraphic_X end_POSTSUBSCRIPT ( ⋅ ). This metric incentives exploration of new regions of the input space, however, it is independent of the observed outcome values and so we expect it to perform worse than outcome-based alternative novelty metrics. Since αNS-FSsubscript𝛼NS-FS\alpha_{\text{NS-FS}}italic_α start_POSTSUBSCRIPT NS-FS end_POSTSUBSCRIPT is only defined in terms of the input, it can easily be maximized for continuous 𝒙𝒳𝒙𝒳\boldsymbol{x}\in\mathcal{X}bold_italic_x ∈ caligraphic_X using gradient-based optimization algorithms.

A.5 MaxVar

The MaxVar algorithm is an approach commonly utilized in the field of active learning that aims to explore regions of the sample space whose predictions are most uncertain according to some model. This approach has also been explored in the context of BO [45]. Here, we take a standard MaxVar approach defined in terms of the same MOGP model used by BEACON, but replaces the acquisition function by the sum of the diagonal of the predicted posterior covariance matrix, i.e.,

αMaxVar(𝒙|𝒟)=tr(𝚺𝒟(𝒙)),subscript𝛼MaxVarconditional𝒙𝒟trsubscript𝚺𝒟𝒙\displaystyle\alpha_{\text{MaxVar}}(\boldsymbol{x}|\mathcal{D})=\text{tr}(%\boldsymbol{\Sigma}_{\mathcal{D}}(\boldsymbol{x})),italic_α start_POSTSUBSCRIPT MaxVar end_POSTSUBSCRIPT ( bold_italic_x | caligraphic_D ) = tr ( bold_Σ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_x ) ) ,

where 𝚺𝒟(𝒙)=𝜿𝒟(𝒙,𝒙)subscript𝚺𝒟𝒙subscript𝜿𝒟𝒙𝒙\boldsymbol{\Sigma}_{\mathcal{D}}(\boldsymbol{x})=\boldsymbol{\kappa}_{%\mathcal{D}}(\boldsymbol{x},\boldsymbol{x})bold_Σ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_x ) = bold_italic_κ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x ) and tr()tr\text{tr}(\cdot)tr ( ⋅ ) denotes the trace operator. The logic behind using MaxVar for NS tasks is that, once we have constructed a globally accurate (cheap) surrogate model, we can easily identify 𝒙𝒳𝒙𝒳\boldsymbol{x}\in\mathcal{X}bold_italic_x ∈ caligraphic_X that lead to new behaviors. Although sufficient for NS, it is not necessary to learn such an accurate model of 𝒇𝒇\boldsymbol{f}bold_italic_f to find new behaviors, which is the key advantage of BEACON.

A.6 Sobol

Sobol refers to the simple algorithm in which one draws a sequence of quasi-random low-discrepancy Sobol points from the input domain 𝒳𝒳\mathcal{X}caligraphic_X [46] to generate the points {𝒙1,𝒙2,,𝒙T}subscript𝒙1subscript𝒙2subscript𝒙𝑇\{\boldsymbol{x}_{1},\boldsymbol{x}_{2},\ldots,\boldsymbol{x}_{T}\}{ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }. Sobol sampling is a commonly used baseline method in the global optimization literature since it is guaranteed to densely sample 𝒳𝒳\mathcal{X}caligraphic_X in the limit of infinite experimental budgets T𝑇T\to\inftyitalic_T → ∞.

A.7 RS

RS refers to the simple algorithm in which one independently draws a sequence of points from a uniform distribution defined over the input domain 𝒳𝒳\mathcal{X}caligraphic_X to generate points {𝒙1,𝒙2,,𝒙T}subscript𝒙1subscript𝒙2subscript𝒙𝑇\{\boldsymbol{x}_{1},\boldsymbol{x}_{2},\ldots,\boldsymbol{x}_{T}\}{ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }. Similarly to Sobol, RS is able to densely sample 𝒳𝒳\mathcal{X}caligraphic_X in the limit of infinite experimental budgets T𝑇T\to\inftyitalic_T → ∞.

A.8 Runtimes and Licenses

The average runtimes of the considered NS methods for the synthetic problems are summarized in Table 1. These times were computed by running the algorithms on a CPU with an Intel(R) Core(TM) i7-10700K CPU 3.8 GHz CPU processor. Although BEACON is more expensive than the other standard NS methods, the average runtime per iteration remains less than 1 second. This additional computation is more than compensated for by BEACON’s significantly improved performance, especially for problems where each outcome evaluation takes several minutes or more. For example, in material and drug discovery/design applications, it is not uncommon for an evaluation to take multiple hours to days (through high-fidelity simulations or experiments). It is also worth noting that BEACON requires less cost compared to MaxVar; the major cost of both methods is the training of the GP model and optimization of the acquisition function. This highlights the importance of the choice of acquisition function (as BEACON consistently outperforms MaxVar) and the computational efficiency of our proposed αNSsubscript𝛼NS\alpha_{\text{NS}}italic_α start_POSTSUBSCRIPT NS end_POSTSUBSCRIPT in (4) due to efficient TS and the simple sort formulation.

BEACONMaxVarNS-EANS-DEANS-FSSobolRS
Ackley4D0.1610.1420.0010.0010.0010.0000.000
Ackley8D0.2330.3390.0010.0010.0010.0000.000
Ackley12D0.3360.7770.0010.0010.0010.0000.000
Rosenbrock4D0.4370.8040.0010.0010.0010.0000.000
Rosenbrock8D0.7031.1800.0010.0010.0010.0000.000
Rosenbrock12D1.0181.2930.0010.0010.0010.0000.000
Styblinski-Tang4D0.2280.2500.0010.0010.0010.0000.000
Styblinski-Tang8D0.5400.8450.0010.0010.0010.0000.000
Styblinski-Tang12D1.0480.9160.0010.0010.0010.0000.000

We use BoTorch [44] to develop BEACON, which are under the MIT license. The synthetic test problems all have implementations in BoTorch while the real-world problems are all based on publicly available datasets that are cited in Appendix C below.Our code implementation is included in the supplemental material and will be released under the MIT license.

Appendix B Ablation Studies

B.1 Impact of the Size of Behavior Space

In this section, we study the impact of the choice of ϵitalic-ϵ\epsilonitalic_ϵ (directly related to grid size) that defines how nearby points in outcome space 𝒪𝒪\mathcal{O}caligraphic_O are divided into new behaviors \mathcal{B}caligraphic_B. In practice, a user does not have to have an actual value for ϵitalic-ϵ\epsilonitalic_ϵ selected – as long as some “clusters” exist in the outcome space that (once observed) can be treated as behaviors, BEACON will eventually uncover them through exploration of 𝒪𝒪\mathcal{O}caligraphic_O. Therefore, ϵitalic-ϵ\epsilonitalic_ϵ only impacts how calculate reachability (or equivalently the behavior gap). We plot the reachability performance for 3 different grid values (10, 50, 100) across the 4-dimensional versions of the Ackley and Rosenbrock synthetic test problems in Figure 4. BEACON continuous to be the best-performing algorithm for all grid sizes, highlighting BEACON’s robustness to the choice of \mathcal{B}caligraphic_B. It is interesting to note that gap between BEACON and the other algorithms does appear to increase with increasing grid size; however, more analysis would be needed to see if this trends holds for a larger set of problems of varying dimensions and complexity.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (17)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (18)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (19)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (20)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (21)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (22)

B.2 Impact of Number of Nearest Neighbors

The choice of the number of nearest neighbors k𝑘kitalic_k is a hyperparameter of our algorithm. We selected it to be k=10𝑘10k=10italic_k = 10 in all case studies, as we found that to be a robust choice. We study the impact of k𝑘kitalic_k on the 12-dimensional versions of the synthetic test problems in Figure 5 by calculating performance for k{1,5,10,20}𝑘151020k\in\{1,5,10,20\}italic_k ∈ { 1 , 5 , 10 , 20 }. In this experiment, we use 50 initial datapoints to train the GP and use 25 equally-spaced intervals to divide outcomes into behaviors. Surprisingly, we find that even a choice of k=1𝑘1k=1italic_k = 1 does reasonably well on Rosenbrock and Styblinski-Tang, though performance does start to drop for Ackley. We see negligible difference in performance between k=10𝑘10k=10italic_k = 10 and k=20𝑘20k=20italic_k = 20, so it appears k=10𝑘10k=10italic_k = 10 is a sufficiently large value in practice.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (23)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (24)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (25)

B.3 Importance of Filtering Observation Noise

As discussed in Section 4, the presence of noise in the outcome evaluation can lead to challenges in NS algorithms due to miscategorization of behaviors, which we explore in this section. Specifically, we compare the performance of BEACON (Algorithm 1) to a noiseless variant (BEACON-noiseless) in which the acquisition function in (3) is modified by replacing 𝝁𝒟(𝒙i)subscript𝝁𝒟superscriptsubscript𝒙𝑖\boldsymbol{\mu}_{\mathcal{D}}(\boldsymbol{x}_{i}^{\star})bold_italic_μ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) with 𝒚isuperscriptsubscript𝒚𝑖\boldsymbol{y}_{i}^{\star}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. We perform experiments on the 4d Ackley problem in which 50 initial data points are available and we divide the outcome space into 50 equally-spaced intervals to calculate reachability. Figure 6 shows the performance of BEACON and BEACON-noiseless for four different values of the noise standard deviation σ𝜎\sigmaitalic_σ. We see BEACON outperforms BEACON-noiseless in all cases and the gap between them widens as σ𝜎\sigmaitalic_σ increases. This study emphasizes the importance of accounting for observation noise in NS, which, to our knowledge, has not been considered by existing NS algorithms.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (26)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (27)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (28)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (29)

B.4 Hyperparameter Study for NS-EA

NS-EA is one of the most popular NS algorithms, which does have some hyperparameters that could potentially be tuned to improve performance in specific cases. Our results in Section 5 are based on the default settings of the implementation of NS-EA described in Appendix A. To ensure that these settings are reasonably robust for the problems considered in this work, we studied the impact of two key hyparparameters, mainly population size (default value of 10) and offspring size (default value of 10), in this section. The reachability performance of NS-EA for three population sizes {10,20,40}102040\{10,20,40\}{ 10 , 20 , 40 } (default offspring size of 10) on the synthetic test problems is shown in Figure 7. We see that the performance is fairly similar across these different population sizes, with smaller values doing slightly better on 4d problems and larger values doing slightly better on 12d problems. We similarly plot the reachability performance of NS-EA for three offspring sizes {10,20,40}102040\{10,20,40\}{ 10 , 20 , 40 } (default population size of 10) on the synthetic test problems in Figure 8. Again, performance is similar in all cases, with the default value yielding equal or better results in most cases. This study suggest that our performance analysis in Section 5 is not sensitive to the particular settings of the benchmark algorithms.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (30)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (31)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (32)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (33)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (34)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (35)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (36)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (37)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (38)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (39)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (40)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (41)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (42)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (43)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (44)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (45)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (46)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (47)

Appendix C Additional Details on Numerical Experiments

C.1 Details on Synthetic Test Problems

Here, we provide specific details for the synthetic test functions.

C.1.1 Ackley

The Ackley function is a widely used benchmark in the global optimization literature due to its highly multi-modal nature [47]. The D𝐷Ditalic_D-dimensional Ackley function, with 𝒙=(x1,,xD)𝒙subscript𝑥1subscript𝑥𝐷\boldsymbol{x}=(x_{1},\ldots,x_{D})bold_italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ), is defined by

f(𝒙)=aexp(b1Di=1Dxi2)exp(1Di=1Dcos(cxi))+a+exp(1),𝑓𝒙𝑎𝑏1𝐷superscriptsubscript𝑖1𝐷superscriptsubscript𝑥𝑖21𝐷superscriptsubscript𝑖1𝐷𝑐subscript𝑥𝑖𝑎1\displaystyle f(\boldsymbol{x})=-a\exp\left(-b\sqrt{\frac{1}{D}\sum_{i=1}^{D}x%_{i}^{2}}\right)-\exp\left(\frac{1}{D}\sum_{i=1}^{D}\cos(cx_{i})\right)+a+\exp%(1),italic_f ( bold_italic_x ) = - italic_a roman_exp ( - italic_b square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) - roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT roman_cos ( italic_c italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + italic_a + roman_exp ( 1 ) ,

where a=20𝑎20a=-20italic_a = - 20, b=0.2𝑏0.2b=0.2italic_b = 0.2, and c=2π𝑐2𝜋c=2\piitalic_c = 2 italic_π.

C.1.2 Rosenbrock

The Rosenbrock function is a widely used benchmark in the global optimization literature that is valley shaped [47]. The D𝐷Ditalic_D-dimensional Rosenbrock function, with 𝒙=(x1,,xD)𝒙subscript𝑥1subscript𝑥𝐷\boldsymbol{x}=(x_{1},\ldots,x_{D})bold_italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ), is defined by

f(𝒙)=i=1D1[100(xi+1xi2)2+(1xi)2].𝑓𝒙superscriptsubscript𝑖1𝐷1delimited-[]100superscriptsubscript𝑥𝑖1superscriptsubscript𝑥𝑖22superscript1subscript𝑥𝑖2\displaystyle f(\boldsymbol{x})=\sum_{i=1}^{D-1}\left[100(x_{i+1}-x_{i}^{2})^{%2}+(1-x_{i})^{2}\right].italic_f ( bold_italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT [ 100 ( italic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

C.1.3 Styblinski-Tang

The Styblinski-Tang function is a widely used benchmark in the global optimization literature [47]. The D𝐷Ditalic_D-dimensional Styblinski-Tang function, with 𝒙=(x1,,xD)𝒙subscript𝑥1subscript𝑥𝐷\boldsymbol{x}=(x_{1},\ldots,x_{D})bold_italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ), is defined by

f(𝒙)=12i=1D(xi416xi2+5xi).𝑓𝒙12superscriptsubscript𝑖1𝐷superscriptsubscript𝑥𝑖416superscriptsubscript𝑥𝑖25subscript𝑥𝑖\displaystyle f(\boldsymbol{x})=\frac{1}{2}\sum_{i=1}^{D}\left(x_{i}^{4}-16x_{%i}^{2}+5x_{i}\right).italic_f ( bold_italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT - 16 italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 5 italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

C.1.4 Multi-Output Plus

We constructed our own synthetic function defined over D=6𝐷6D=6italic_D = 6 inputs, i.e., 𝒙=(x1,,xD)𝒙subscript𝑥1subscript𝑥𝐷\boldsymbol{x}=(x_{1},\ldots,x_{D})bold_italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) with n=2𝑛2n=2italic_n = 2 outputs, i.e., 𝒚=(y1,y2)𝒚subscript𝑦1subscript𝑦2\boldsymbol{y}=(y_{1},y_{2})bold_italic_y = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Its is defined as follows

y1subscript𝑦1\displaystyle y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=sin(x1)cos(x2)+x3exp(x12)cos(x1+x2)+0.01sin(x4+x5+x6),absentsubscript𝑥1subscript𝑥2subscript𝑥3superscriptsubscript𝑥12subscript𝑥1subscript𝑥20.01subscript𝑥4subscript𝑥5subscript𝑥6\displaystyle=\sin(x_{1})\cos(x_{2})+x_{3}\exp(-x_{1}^{2})\cos(x_{1}+x_{2})+0.%01\sin(x_{4}+x_{5}+x_{6}),= roman_sin ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) roman_cos ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_exp ( - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_cos ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + 0.01 roman_sin ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) ,
y2subscript𝑦2\displaystyle y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT=sin(x4)cos(x5)+x6exp(x42)cos(x4+x5)+0.01cos(x1+x2+x3).absentsubscript𝑥4subscript𝑥5subscript𝑥6superscriptsubscript𝑥42subscript𝑥4subscript𝑥50.01subscript𝑥1subscript𝑥2subscript𝑥3\displaystyle=\sin(x_{4})\cos(x_{5})+x_{6}\exp(-x_{4}^{2})\cos(x_{4}+x_{5})+0.%01\cos(x_{1}+x_{2}+x_{3}).= roman_sin ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) roman_cos ( italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) + italic_x start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT roman_exp ( - italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_cos ( italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) + 0.01 roman_cos ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) .

The distribution of outcomes 𝒪𝒪\mathcal{O}caligraphic_O based on 10,000 input samples drawn from the space 𝒳=[5,5]D𝒳superscript55𝐷\mathcal{X}=[-5,5]^{D}caligraphic_X = [ - 5 , 5 ] start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT is shown in Figure 9. Notice how most of the outcomes lie near the center square around the origin, implying jointly long-tail distributions. This type of distribution is difficult for standard NS algorithms to deal with, as they tend to explore the high density center zone of the outcome distribution. This is apparent in the reachability performance results shown on the right plot in Figure 9 wherein BEACON outperforms the other algorithms by discovering a larger number of behaviors in a limited evaluation budget (finding nearly 80% while some established NS methods find less than 10%). The reachability metric is computed with respect to 100 intervals defined on a 10×10101010\times 1010 × 10 grid over the two outcomes.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (48)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (49)

C.2 Metal Organic Framework (MOF) Properties

Here, we provide some additional details for the realistic MOF problems.

C.2.1 Hydrogen uptake capacity

Hydrogen is a promising alternative clean energy source, however, it remains a challenge to successfully store it due to its low volumetric density. MOFs have been shown to have great promise as a hydrogen gas carrier due to their tunable surface area and porosity [48]. We explore the use of NS to identify MOFs with a wide-range of hydrogen uptake capacities. We consider the dataset from [35], which includes 98,000 total unique MOF structures. For this dataset, one can use d=7𝑑7d=7italic_d = 7 key features to represent the MOFs that are summarized in [35, Table 1] (e.g., density, volumetric surface area, pore volume). The outcome distribution for this case is shown in Figure 10, which exhibits a long tail making it difficult to explore using naive random sampling strategies. The reachability metric is computed over 25 equally-spaced intervals over the outcome space.

C.2.2 Nitrogen uptake capacity

Natural gas, which is primarily composed of methane, is considered a cleaner alternative to fossil fuels due to its lower carbon dioxide emissions during combustion. Methane, however, typically only constitutes 50% to 85% of the natural gas that is produced from biogas processes and landfills [49]. Nitrogen is a significant component of the remaining mixture whose presence decreases the thermal efficiency of natural gas combustion. Consequently, the separation of nitrogen from natural gas has become an important area of research over the past decade [50]. MOFs have emerged as a promising material for separating nitrogen from natural gas due to their ability to reduce energy cost compared to traditional methods such as cryogenic distillation. We explore the use of NS in this application using the dataset from [36], which includes 5,224 unique MOF structures. We use the same set of 20 descriptors reported in [36, Table 1] to represent the MOFs. The outcome distribution for this case is shown in Figure 10. The reachability metric is computed over 25 equally-spaced intervals over the outcome space.

C.2.3 Sour gas sweetening

Raw natural gas, also commonly referred to as sour natural gas, often contains a significant amount of hydrogen sulfide (H2S). Hydrogen sulfide is a colorless, flammable, and extremely hazardous gas known for its pungent “rotten egg” odor at low concentrations [51]. The removal of hydrogen sulfide from raw natural gas is an important unit operation; MOFs have been considered for adsorption-based separation of H2S due to their potential to improve efficiency and reduce cost [52, 53]. Here, we explore the use of NS to identify MOFs with a wide-range of hydrogen sulfide to methane selectivity values. We use the dataset from [37], which includes 1,600 unique MOF structures. Previous work constructed a machine learning model for this system in terms of 12 features; we use the same 12 features summarized in [37, Table 1]. The outcome distribution for this case is shown in Figure 10. The reachability metric is computed over 40 equally-spaced intervals over the outcome space.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (50)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (51)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (52)

C.2.4 Joint carbon dioxide and methane uptake capacity

MOFs are capable of adsorbing both carbon dioxide and methane. Methane is a promising clean energy fuel alternative to fossil fuels; however, its low volumetric energy density poses challenges in storage applications [54]. Previous works how studied the use of machine learning methods for predicting both methane [55] and carbon dioxide [56] storage capacity. To our knowledge, none of these studies have systematically aimed to identify MOFs that cover different regions of the outcome space. This can be important for ensuring training data is not overly biased to specific regions of the space. We explore the use of NS for this type of application using the dataset provided by [34], which is available online at: https://github.com/byooooo/dispersant_screening_PAL.git. The joint distribution of outcomes for this dataset is shown in Figure 11 in which we see that the distribution of carbon dioxide uptake rates is heavily skewed toward small values. The reachability metric is computed with respect to 100 intervals defined on a 10×10101010\times 1010 × 10 grid over the two outcomes.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (53)

C.3 Solubility of Organic Molecules

Here, we provide some additional details for the realistic solubility problems.

C.3.1 Water solubility

Solubility in water is a critical property for drug discovery and design. Machine learning models have been previously explored for solubility prediction [57], however, it is important to note that such models may be trained on data biased toward particular regions of the outcome space. In addition to having a larger spread of solubility values for the purposes of better model training, it can be useful to have solutes with a range of solubility (LogS) values for different applications. Therefore, we consider the use of NS in this application. Specifically, we use the “Water_set_wide” dataset from [39], which contains 900 organic compounds. We use the same 14 features in [39, Table 1]. The outcome distribution is shown in Figure 12 and the reachability performance for all algorithms is shown in Figure 13 where BEACON achieves an average reachability of 1 within 100 iterations. The reachability metric is computed over 25 equally-spaced intervals over the outcome space.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (54)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (55)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (56)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (57)

C.3.2 ESOL

The ESOL dataset is from [40] and consists of 1,128 organic molecules. Using ideas from GAUCHE [31], we develop a tailored GP model using the following Tanimoto-fragprint covariance function

κTanimoto(𝒙,𝒙)=θ0𝒙𝒙𝒙2+𝒙2𝒙𝒙,subscript𝜅Tanimoto𝒙superscript𝒙subscript𝜃0superscript𝒙topsuperscript𝒙superscriptnorm𝒙2superscriptnormsuperscript𝒙2superscript𝒙topsuperscript𝒙\displaystyle\kappa_{\text{Tanimoto}}(\boldsymbol{x},\boldsymbol{x}^{\prime})=%\theta_{0}\frac{\boldsymbol{x}^{\top}\boldsymbol{x}^{\prime}}{\|\boldsymbol{x}%\|^{2}+\|\boldsymbol{x}^{\prime}\|^{2}-\boldsymbol{x}^{\top}\boldsymbol{x}^{%\prime}},italic_κ start_POSTSUBSCRIPT Tanimoto end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT divide start_ARG bold_italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG ∥ bold_italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ,

where 𝒙𝒙\boldsymbol{x}bold_italic_x and 𝒙superscript𝒙\boldsymbol{x}^{\prime}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are binary vectors corresponding to the 2133-dimensional one-hot encoding representation of the fragprints for every molecule in the dataset and \|\cdot\|∥ ⋅ ∥ is the Euclidean norm. Note that κTanimotosubscript𝜅Tanimoto\kappa_{\text{Tanimoto}}italic_κ start_POSTSUBSCRIPT Tanimoto end_POSTSUBSCRIPT only involves a single hyperparameter θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT that is similarly optimized as described in Appendix A.1. The outcome distribution for the ESOL case is shown in Figure 12, which we see is somewhat skewed to the left. The reachability metric is computed over 50 equally-spaced intervals over the outcome space.

C.3.3 LogD

Identifying the physicochemical properties of a candidate molecule is critical in the early stages of drug design. The octanol-water partition distribution coefficient (LogD) stands out as an important indicator as it helps estimate the drug distribution within the human body [58]. Traditional ways to measurement LogD are known to be difficult and time consuming, which has led to an increase in developing machine learning-based models to predict LogD for untested molecules [59, 60]. As mentioned previously, it is important to have nicely distributed data in the outcome space to reduce the chance of overfitting (leading to bias in the model predictions). One way to achieve this spread is using NS, though it is also useful for exploring new regions of the outcome space. We use the dataset from [41] containing 2,070 molecules characterized by 125 descriptors. Since this is a high-dimensional problem, we resort to the SAAS function prior [30] that is based on the assumption that the descriptors exhibit a hierarchy of relevance when it comes to outcome prediction. SAAS has recently been shown to be effective at dealing with such high-dimensional molecular feature representations [42]. The complete outcome distribution for the LogD case is shown in Figure 12. The reachability metric is computed over 25 equally-spaced intervals over the outcome space.

Appendix D Additional Numerical Example: Maze Problem

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (58)

As an additional experiment, we explore a complex reinforcement learning (RL) problem involving navigating through a maze. We take the large maze configuration from the OpenAI Gymnasium [61] whose documentation can be found here: https://minari.farama.org/datasets/pointmaze/large/. The goal is to navigate a green ball from a starting location to an ending location shown by a red ball in Figure 14 in 300 time steps or less. We use a linear control policy that maps the measured states to actions, which is defined by eight input parameters that need to be optimized. Novelty search (NS) is well-suited for this type of problem where the objective landscape can be deceptive (easy to get stuck in local optima when measuring only the distance between the final and desired locations). Here, as opposed to reachability, we measure performance as follows

Reward=initial distance from the targetfinal distance from the targetinitial distance from the target.Rewardinitial distance from the targetfinal distance from the targetinitial distance from the target\displaystyle\text{Reward}=\frac{\text{initial distance from the target}-\text%{final distance from the target}}{\text{initial distance from the target}}.Reward = divide start_ARG initial distance from the target - final distance from the target end_ARG start_ARG initial distance from the target end_ARG .(5)

This metric quantifies improvement in proximity toward the target and is ideally equal to 1 (meaning we have exactly landed at the final target). In addition to the previous algorithms, we also consider the classical expected improvement (EI) algorithm [62] under a GP model over the reward function.

The results obtained on the maze problem are shown in Figure 15. On the left, we see that BEACON again outperforms all other algorithms including objective-focused EI that is actually the second worst. As seen by the plot on the right, BEACON is the only method to consistently escape the maze (doing so in all 20 replicates). Interestingly, other NS approaches are able to escape the maze in some of the replicates, with NS-FS and MaxVar both achieving average scores at or above 0.9. However, as seen by the distribution of scores at the final iteration, there are cases where both of these algorithms underperform and barely break 0.5. This case study really underscores the value of an efficient NS perspective; BEACON outperforms EI on the very task that EI was designed for simply due to the complexity of the objective landscape, which makes the GP a poor model for the reward function. We expect other problems with these types of deceptive characteristics would be able to similarly benefit from BEACON.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (59)

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (60)

Appendix E Scaling BEACON to Large Evaluations Budgets

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (61)

The computational cost of BEACON is mostly dominated by the acquisition function optimization step, which requires repeated inference with the underlying GP model of the outcome function. It is well-known that the exact GP posterior equations (1) exhibit O(N3)𝑂superscript𝑁3O(N^{3})italic_O ( italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) scaling due to the inversion of the covariance matrix [63]. Although not expected to be a concern for reasonably small N𝑁Nitalic_N (on the order of hundreds), this can become a challenge when N𝑁Nitalic_N is on the order of thousands or more. Even though such larger budgets are not expected for all types of expensive systems, there are relevant cases where this could occur such as when the system is modeled by a high-fidelity simulator that can be easily parallelized. Fortunately, there has been a substantial amount of work on sparse approximate GP models that can be leveraged in such cases. To demonstrate this fact, we perform an experiment with the sparse variational Gaussian process (SVGP) described in [64]. We consider a 20-dimensional version of the Rosenbrock problem where 300 initial data points are available. We use the SVGP implementation available at https://github.com/secondmind-labs/trieste, with a Matérn 5/2 covariance function and 100 inducing points allocated using the k𝑘kitalic_k-means approach in [64]. Note that the trieste package is built on top of GPflow [65]. The reachability performance results for BEACON using SVGP over 1000 evaluations for the 20d Rosenbrock problem is shown in Figure 16. The reachability metric is computed over 100 equally-spaced intervals over the outcome space. We see that BEACON achieves dramatically better results than all other methods including finding nearly 3x more diverse behaviors than the standard NS-EA method. It is worth noting that the use of the exact GP within BEACON already begins to considerably slow down at this scale, suggesting that the SVGP provides a nice balance between accuracy and computational cost.

To further investigate the improvements that can be gained in terms of execution time, we perform another experiment involving optimization of our proposed novelty-based acquisition function (4) defined using both exact GP and SVGP models given different amounts of training data. The CPU time to optimize the acquisition function using the Adam solver over 100 iterations (averaged over 10 randomly drawn datasets) is shown in Figure 17. We see that the SVGP exhibits much better scaling than the exact GP, with the CPU time for 5000 points with the SVGP being less than that of 1000 points with the exact GP. These times could be further improved by taking advantage of GPUs, however, the key takeaway is that SVGPs provide an effective path toward efficient NS on large evaluation budgets.

BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (62)
BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems (2024)

References

Top Articles
Latest Posts
Article information

Author: Greg Kuvalis

Last Updated:

Views: 6023

Rating: 4.4 / 5 (55 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Greg Kuvalis

Birthday: 1996-12-20

Address: 53157 Trantow Inlet, Townemouth, FL 92564-0267

Phone: +68218650356656

Job: IT Representative

Hobby: Knitting, Amateur radio, Skiing, Running, Mountain biking, Slacklining, Electronics

Introduction: My name is Greg Kuvalis, I am a witty, spotless, beautiful, charming, delightful, thankful, beautiful person who loves writing and wants to share my knowledge and understanding with you.