## Abstract

A decent number of lower bounds for non-elitist population-based evolutionary
algorithms has been shown by now. Most of them are technically demanding due to
the (hard to avoid) use of negative drift theorems—general results which
translate an expected movement away from the target into a high hitting time. We
propose a simple negative drift theorem for multiplicative drift scenarios and
show that it can simplify existing analyses. We discuss in more detail
Lehre's (2010) *negative
drift in populations* method, one of the most general tools to prove
lower bounds on the runtime of non-elitist mutation-based evolutionary
algorithms for discrete search spaces. Together with other arguments, we obtain
an alternative and simpler proof of this result, which also strengthens and
simplifies this method. In particular, now only three of the five technical
conditions of the previous result have to be verified. The lower bounds we
obtain are explicit instead of only asymptotic. This allows us to compute
concrete lower bounds for concrete algorithms, but also enables us to show that
super-polynomial runtimes appear already when the reproduction rate is only a $(1-\omega (n-1/2))$ factor below the threshold. For the special case of algorithms using standard
bit mutation with a random mutation rate (called uniform mixing in the language
of hyper-heuristics), we prove the result stated by Dang and Lehre (2016b) and extend it to mutation rates other
than $\Theta (1/n)$,
which includes the heavy-tailed mutation operator proposed by Doerr et al.
(2017). We finally use our method and
a novel domination argument to show an exponential lower bound for the runtime
of the mutation-only simple genetic algorithm on OneMax for arbitrary
population size.

## 1 Introduction

Lower bounds for the runtimes of evolutionary algorithms are important as they can warn the algorithm user that certain algorithms or certain parameter settings will not lead to good solutions in acceptable time. Unfortunately, the existing methods to obtain such results, for non-elitist algorithms in particular, are very technical and thus difficult to use.

One reason for this high complexity is the use of drift analysis, which seems hard to circumvent. Drift analysis (see Lengler, 2020) is a set of tools that all try to derive useful information on a hitting time (e.g., the first time a solution of a certain quality is found) from information on the expected progress in one iteration. The hope is that the progress in a single iteration can be analyzed with only moderate difficulty and then the drift theorem does the remaining work. While more direct analysis methods exist and have been successfully used for simple algorithms, for population-based algorithms and in particular non-elitist ones, it is hard to imagine that the complicated population dynamics can be captured in proofs not using more advanced tools such as drift analysis.

Drift analysis has been used with great success to prove upper bounds on runtimes of
evolutionary algorithms. Tools such as the additive (He and Yao, 2001), multiplicative (Doerr et al., 2012), and variable drift theorem (Mitavskiy et
al., 2009; Johannsen, 2010) all allow us to easily obtain an upper bound on a hitting
time solely from the expected progress in one iteration. Unfortunately, proving
matching lower bounds is much harder since here the drift theorems require
additional technical assumptions on the distribution of the progress in one
iteration. This is even more true in the case of so-called *negative
drift*, where the drift is away from the target and we aim at proving a
high lower bound on the hitting time.

In this work, we propose a very simple negative drift theorem for the case of
multiplicative drift (Lemma ^{3}). We
briefly show that this result can ease two classic lower bound analyses (also in
Section 3).

In more detail, we use the new drift theorem (and some more arguments) to rework
Lehre's *negative drift in populations* method (Lehre, 2010). This highly general analysis method can
show exponential lower bounds on the runtime of a large class of evolutionary
algorithms solely by comparing the so-called reproduction rate of individuals in the
population with a threshold that depends only on the mutation rate.

The downside of Lehre's method is that both the result and its proof are very technical. To apply the general result (and not the specialization to algorithms using standard bit mutation), five technical conditions need to be verified, which requires the user to choose suitable values for six different constants; these have an influence on the lower bound one obtains. This renders the method of Lehre hard to use. Among the 54 citations to Lehre (2010) (according to Google scholar on June 9, 2020), only the two works (Lehre, 2011) and (Dang and Lehre, 2016b) apply this method. To hopefully ease future analyses of negative drift in populations, we revisit this method and obtain the following improvements.

*A simpler result:* We manage to show essentially the same lower bound
by only verifying three of the five conditions of Lehre's result (Theorems ^{4} and ^{5}). This also reduces the number of constants one needs to
choose from six to four.

*A non-asymptotic result:* Our result gives explicit lower bounds,
that is, free from asymptotic notation or unspecified constants. Consequently, our
specialization to algorithms using standard bit mutation (Theorem ^{6}) also gives explicit bounds. This allows
us one to prove concrete bounds for specific situations [e.g., that the $(\mu ,\lambda )$ EA with $\lambda =2\mu $ needs more than 13 million fitness evaluations to find the optimum of the OneMax problem defined over bit strings of length $n=500$;
see the example following Theorem ^{6}] and
gives more fine-grained theoretical results [by choosing Lehre's constant $\delta $ as a suitable function of the problems size, we show that a super-polynomial runtime
behavior is observed already when the reproduction rate is only a $(1-\omega (n1/2))$ factor below the threshold; see Corollary ^{7}]. With the absence of asymptotic notation, we can also analyze
algorithms using standard bit mutation with a mutation rate chosen randomly from a
discrete set of alternatives (Section 6). Such
a result was stated by Dang and Lehre (2016b),
however only for mutation rates that are $\Theta (1/n)$.
Our result does not need this restriction and thus, for example, applies also to the
heavy-tailed mutation operator proposed by Doerr et al. (2017).

*A simple proof:* Besides the important aspect that a proof guarantees
the result to be mathematically correct, an understandable proof can also tell us
why a result is correct and give further insights into working principles of
algorithms. While every reader will have a different view on what the ideal proof
looks like, we felt that Lehre's proof, combining several deep and abstract
tools such as multi-type branching processes, eigenvalue arguments, and
Hajek's drift theorem (Hajek, 1982),
does not easily give a broader understanding of the proof mechanics and the working
principles of the algorithms analyzed. We hope that our proof, based on a simple
potential function argument together with our negative drift theorem, is more
accessible.

Finally, we analyze an algorithm using fitness proportionate selection. The negative drift in populations method is not immediately applicable to such algorithms since it is hard to provide a general unconditional upper bound on the reproduction rate: If all but one individual have a very low fitness, then this best individual has a high reproduction rate. We therefore show that at all times all search points are at least as good (in a stochastic domination sense) as random search points. This allows to argue that the reproduction rates are low and then gives a simple proof of an exponential lower bound for the mutation-only simple genetic algorithm (simple GA) with arbitrary population size optimizing the simple OneMax benchmark, improving over the mildly sub-exponential lower bound in Neumann et al. (2009) and the exponential lower bound for large population sizes only in Lehre (2011).

### 1.1 Related Works

A number of different drift theorems dealing with negative drift have been proven so far, among others, in Happ et al. (2008), Oliveto and Witt (2011, 2012a), Rowe and Sudholt (2014), Oliveto and Witt (2015), Kötzing (2016), Lengler and Steger (2018), and Witt (2019) (note that in early works, the name “simplified drift theorem” was used for such results). They all require some additional assumptions on the distribution of the one-step progress, which makes them non-trivial to use. We refer to Lengler (2020, Section 2.4.3) for more details. Another approach to negative drift was used in Antipov et al. (2019) and Doerr (2019b, 2020a). There the original process was transformed suitably (via an exponential function), but in a way that the drift of the new process still is negative or at most a small constant. To this transformed process the lower bound version of the additive drift theorem (He and Yao, 2001) was applied, which gave large lower bounds since the target, due to the exponential rescaling, now was far from the starting point of the process.

In terms of lower bounds for non-elitist algorithms, besides Lehre's general result (Lehre, 2010), the following results for particular algorithms exist (always, $n$ is the problem size, $\u025b$ can be any positive constant, and $e\u22482.718$ is the base of the natural logarithm). Jägersküpper and Storch (2007, Theorem 1) showed that the $(1,\lambda )$ EA with $\lambda \u2264114ln(n)$ is inefficient on any pseudo-Boolean function with a unique optimum. The asymptotically tight condition $\lambda \u2264(1-\u025b)logee-1n$ to yield a super-polynomial runtime was given by Rowe and Sudholt (2014). Happ et al. (2008) showed that two simple (1+1)-type hillclimbers with fitness proportionate selection cannot optimize efficiently any linear function with positive weights. Neumann et al. (2009) showed that a mutation-only variant of the simple GA with fitness proportionate selection is inefficient on the OneMax function when the population size $\mu $ is at most polynomial, and it is inefficient on any pseudo-Boolean function with unique global optimum when $\mu \u226414ln(n)$. The mildly subexponential lower bound for OneMax was improved to an exponential lower bound by Lehre (2011), but only for $\mu \u2265n3$. In a series of remarkable works up to Oliveto and Witt (2015), Oliveto and Witt showed that the true simple GA using crossover cannot optimize OneMax efficiently when $\mu \u2264n14-\u025b$. None of these results gives an explicit lower bound or specifies the base of the exponential function. In Antipov et al. (2019), an explicit lower bound for the runtime of the $(\mu ,\lambda )$ EA is proven (but stated only in the proof of Theorem 3.1 in Antipov et al., 2019). Section 3 of Antipov et al. (2019) bears some similarity with ours, in fact, one can argue that our work extends (Antipov et al., 2019, Section 3) from a particular algorithm to the general class of population-based processes regarded by Lehre (2010) (where, naturally, Antipov et al., 2019 did not have the negative multiplicative drift result and therefore did not obtain bounds that hold with high probability).

This work is an extended version of a paper that appeared in the proceedings of *Parallel Problem Solving from Nature* 2020 (Doerr, 2020c). This version contains all proofs
(for reasons of space, in Doerr, 2020c,
only Lemma ^{1} and Theorem 2 were
proven), a new section on standard bit mutation with random mutation rates, and
several additional details.

## 2 Notation and Preliminaries

*Hamming distance*of two bit strings $x,y\u2208{0,1}n$ by

The classic mutation operator *standard bit mutation* creates an
offspring by flipping each bit of the parent independently with some probability $p$,
which is called *mutation rate*.

We shall twice need the notion of stochastic domination and its relation to standard bit mutation, so we quickly collect these ingredients of our proofs. We refer to Doerr (2019a) for more details on stochastic domination and its use in runtime analysis.

For two real-valued random variables $X$ and $Y$,
we say that $Y$*stochastically dominates*$X$, written as $X\u2aafY$,
if for all $\lambda \u2208R$ we have $Pr[Y\u2264\lambda ]\u2264Pr[X\u2264\lambda ]$.
Stochastic domination is a very flexible way of saying that $Y$ is larger than $X$. It
implies $E[X]\u2264E[Y]$,
but not only this, we also have $E[f(X)]\u2264E[f(Y)]$ for any monotonically increasing function $f$.

Let $X,Y$ be two random variables taking values in some set $\Omega \u2286R$. Let $f:\Omega \u2192R$ be monotonically increasing, that is, we have $f(x)\u2264f(y)$ for all $x,y\u2208\Omega $ with $x\u2264y$. Then $E[f(X)]\u2264E[f(Y)]$.

Significantly improving over previous related arguments in Droste et al. (2000, Section 5) and Doerr et al. (2012, Lemma 13), Witt (2013, Lemma 6.1) showed the following natural domination argument for offspring generated via standard bit mutation with mutation rate at most $12$. We note that the result is formulated in Witt (2013) only for $x*=(1,\cdots ,1)$, but the proof in Witt (2013) or a symmetry argument immediately shows the following general version.

## 3 Negative Multiplicative Drift

The following elementary result allows us to prove lower bounds on the time to reach a target in the presence of multiplicative drift away from the target. While looking innocent, it has the potential to replace the more complicated lower bound arguments previously used in analyses of non-elitist algorithms. We discuss this briefly at the end of this section.

For all $t\u22650$, $E[Xt]\u2264\Delta \delta $.

- Let $M>\Delta \delta $ and $T=min{t\u22650\u2223Xt\u2265M}$. Then for all integers $L\u22650$,and $E[T]\u2265\delta M2\Delta -12$.$Pr[T\u2265L]\u22651-L\Delta \delta M,$

The proof is an easy computation of expectations and an application of Markov's inequality similar to the direct proof of the multiplicative drift theorem in Doerr and Goldberg (2013). We do not see a reason why the result should not also hold for processes taking more than a finite number of values, but since we are only interested in the finite setting, we spare ourselves the more complicated world of continuous probability spaces.

^{3}:

^{4}), we expect to see the condition that for all $t\u22650$,

We now argue that our negative multiplicative drift theorem is likely to find applications beyond ours to the negative drift in populations method in the following section. To this aim, we regard two classic lower bound analyses of non-elitist algorithms and point out where our drift theorem would have eased the analysis.

Neumann et al. (2009) show that the variant of the simple genetic algorithm (simple GA) not using crossover needs time $2n1-O(1/loglogn)$ to optimize the simple OneMax benchmark. The key argument in Neumann et al. (2009) is as follows. The potential $Xt$ of the population $P(t)$ in iteration $t$ is defined as $Xt=\u2211x\u2208P(t)8OneMax(x)$. For this potential, it is shown (Neumann et al., 2009, Lemma 7) that if $Xt\u226580.996n$, then $E[Xt+1]\u2264(1-\delta )Xt$ for some constant $\delta >0$. By bluntly estimating $E[Xt+1]$ in the case that $Xt<80.996n$, this bound could easily be extended to $E[Xt+1|Xt]\u2264(1-\delta )Xt+\Delta $ for some number $\Delta $. This suffices to employ our negative drift theorem and obtain the desired lower bound. Without our drift theorem at hand, in Neumann et al. (2009) the potential $Yt=log8(Xt)$ was considered; it was argued that it displays an additive drift away from the target and that $Yt$ satisfies certain concentration statements necessary for the subsequent use of a negative drift theorem for additive drift.

A second example where we feel that our drift theorem can ease the analysis is the work of Oliveto and Witt (2014, 2015) on the simple GA with crossover optimizing OneMax. Due to the use of crossover, this work is much more involved, so we shall not go into detail and simply point the reader to the location where negative drift occurs. In Lemma 19 of Oliveto and Witt (2015), a multiplicative drift statement (away from the target) is proven. To use a negative drift theorem for additive drift (Oliveto and Witt, 2015, Theorem 2), in the proof of Lemma 20 the logarithm of the original process is regarded. So here again, we think that a direct application of our drift theorem would have eased the analysis.

## 4 Negative Drift in Populations Revisited

In this section, we use our negative multiplicative drift result and some more
arguments to rework Lehre's negative drift in populations method (Lehre, 2010) and obtain Theorem ^{4} further below. This method allows an
analysis of a broad class of evolutionary algorithms, namely all that can be
described via the following type of population process.

### 4.1 Population Selection-Mutation Processes

A *population selection-mutation (PSM) process* (called *population selection-variation algorithm* in Lehre, 2010) is the following type of random
process. Let $\Omega $ be a finite set. We call $\Omega $ the *search space* and its elements *solution
candidates* or *individuals*. Let $\lambda \u2208N$ be called the *population size* of the process. An ordered multi-set of
cardinality $\lambda $,
in other words, a $\lambda $-tuple,
over the search space $\Omega $ is called a *population*. Let $P=\Omega \lambda $ be the set of all populations. For $P\u2208P$, we write $P1,\cdots ,P\lambda $ to denote the elements of $P$.
We also write $x\u2208P$ to denote that there is an $i\u2208[1..\lambda ]$ such that $x=Pi$.

A PSM process starts with some, possibly random, population $P(0)$.
In each iteration $t=1,2,\cdots $,
a new population $P(t)$ is generated from the previous one $P(t-1)$ as follows. Via a (possibly) randomized *selection operator*$sel(\xb7)$,
a $\lambda $-tuple
of individuals is selected and then each of them creates an offspring through
the application of a randomized *mutation operator*$mut(\xb7)$.

The *selection operator* can be arbitrary except that it only
selects individuals from $P(t-1)$.
In particular, we do not assume that the selected individuals are independent.
Formally speaking, the outcome of the selection process is a random $\lambda $-tuple $Q=sel(P(t-1))\u2208[1..\lambda ]\lambda $ such that $PQ1(t-1),\cdots ,PQ\lambda (t-1)$ are the selected parents.

From each selected parent $PQi(t-1)$,
a single offspring $Pi(t)$ is generated via a randomized *mutation operator*$Pi(t)=mut(PQi(t-1))$.
Formally speaking, for each $x\u2208\Omega $, $mut(x)$ is a probability distribution on $\Omega $ and we write $y=mut(x)$ to indicate that $y$ is
sampled from this distribution. We assume that each sample, that is, each call
of a mutation operator, uses independent randomness. With this notation, we can
write the new population as $P(t)=mut(PQ1(t-1)),\cdots ,mut(PQ\lambda (t-1))$ with $Q=sel(P(t-1))$.
From the definition it is clear that a PSM process is a Markov process with
state space $P$. A pseudocode description of PSM
processes is given in Algorithm 1.

The following characteristic of the selection operator was found to be crucial
for the analysis of PSM processes in Lehre (2010). Let $P\u2208P$ and $i\u2208[1..\lambda ]$.
Then the random variable $R(i,P)=|{j\u2208[1..\lambda ]\u2223sel(P)j=Pi}|$,
called *reproduction number of the $i$-th
individual in $P$*,
denotes the number of times $Pi$ was selected from $P$ as
parent. Its expectation $E[R(i,P)]$ is called *reproduction rate*.

Example: We now describe how the $(\mu ,\lambda )$ EA fits into this framework. That it fits into this framework and that the reproduction number is $\lambda \mu $ was already stated in Lehre (2010), but how exactly this works out, to the best of our knowledge, was never made precise so far, and is also not totally trivial.

We specify that when talking about the $(\mu ,\lambda )$ EA, given in pseudocode in Algorithm 2, we mean the basic EA which starts with a parent population of $\mu $ search points chosen independently and uniformly at random from ${0,1}n$. In each iteration, $\lambda $ offspring are generated, each by selecting a parent individual uniformly at random (with repetition) and mutating it via standard bit mutation with mutation rate $p$. The next parent population is selected from these $\lambda $ offspring by taking $\mu $ best individuals, breaking ties randomly.

This algorithm can be modeled as a PSM process with population size $\lambda $ (not $\mu $). To do so, we need a slightly non-standard initialization of the population. We generate $P(0)$ by first taking $\mu $ random search points and then generating each $Pi(0)$, $i\u2208[1..\lambda ]$, by choosing (with replacement) a random one of the $\mu $ base individuals and mutating it. With this definition, each individual in $P(0)$ is uniformly distributed in ${0,1}n$, but these individuals are not independent.

Given a population $P$ consisting of $\lambda $ individuals, the selection operator first selects a set $P0$ of $\mu $ best individuals from $P$, breaking ties randomly. Formally speaking, this is a tuple $(i1,\cdots ,i\mu )$ of indices in $[1..\lambda ]$. Then a random vector $(j1,\cdots ,j\lambda )\u2208[1..\mu ]\lambda $ is chosen and the selected parents are taken as $Q=(Pij1,\cdots ,Pij\lambda )$. The next population $P'$ is obtained by applying the mutation operator to each of these, that is, $P'=(mut(Pij1),\cdots ,mut(Pij\lambda ))$, where $mut(\xb7)$ denotes standard bit mutation with mutation rate $p$.

From this description, it is clear that each individual of each population of the $(\mu ,\lambda )$ EA has a reproduction rate of $\lambda \mu $.

### 4.2 Our “Negative Drift in Populations” Result

We prove the following version of the negative drift in populations method.

Consider a PSM process $(P(t))t\u22650$ with associated reproduction numbers $R(\xb7,\xb7)$ as defined in Section 4.1. Let $g:\Omega \u2192Z\u22650$, called potential function, and $a,b\u2208Z\u22650$ with $a\u2264b$. Assume that for all $x\u2208P(0)$ we have $g(x)\u2265b$. Let $T=min{t\u22650\u2223\u2203i\u2208[1..\lambda ]:g(Pi(t))\u2264a}$ the first time we have a search point with potential $a$ or less in the population. Assume that the following three conditions are satisfied.

There is an $\alpha \u22651$ such that for all populations $P\u2208P$ with $min{g(Pi)\u2223i\u2208[1..\lambda ]}>a$ and all $i\u2208[1..\lambda ]$ with $g(Pi)<b$, we have $E[R(i,P)]\u2264\alpha $.

- There is a $\kappa >0$ and a $0<\delta <1$ such that for all $x\u2208\Omega $ with $a<g(x)<b$ we have$E[exp(-\kappa g(mut(x)))]\u22641\alpha (1-\delta )exp(-\kappa g(x)).$
- There is a $D\u2265\delta $ such for all $x\u2208\Omega $ with $g(x)\u2265b$, we have$E[exp(-\kappa g(mut(x)))]\u2264Dexp(-\kappa b).$

Then

$E[T]\u2265\delta 2D\lambda exp(\kappa (b-a))-12$, and

for all $L\u22651$, we have $Pr[T<L]\u2264L\lambda D\delta exp(-\kappa (b-a))$.

Before proceeding with the proof, we compare our result with Lehre (2010, Theorem 1). We first note that, apart from a technicality which we discuss toward the end of this comparison, the assumptions of our result are weaker than the ones in Lehre (2010) since we do not need the technical fourth and fifth assumption of Lehre (2010), which in our notation would read as follows.

- There is a $\delta 2>0$ such that for all $i\u2208[a..b]$ and all $k,\u2113\u2208Z$ with $1\u2264k+\u2113$ and all $x,y\u2208\Omega $ with $g(x)=i$ and $g(y)=i-\u2113$ we have$Pr[g(mut(x))=i-\u2113\u2227g(mut(y))=i-\u2113-k]\u2264exp(\kappa (1-\delta 2)(b-a))Pr[g(mut(x))=i-k-\u2113].$
- There is a $\delta 3>0$ such that for all $i,j,k,\u2113\u2208Z$ with $a\u2264i\u2264b$ and $1\u2264k+\u2113\u2264j$ and all $x,y\u2208\Omega $ with $g(x)=i$ and $g(y)=i-k$ we have$Pr[g(mut(x))=i-j]\u2264\delta 3Pr[g(mut(y))=i-k-\u2113].$

The one point where the result of Lehre (2010) potentially is stronger is that it needs assumptions only on the “average drift” from the random search point at time $t$ conditional on having a fixed potential, whereas we require the same bound on the “point-wise drift,” that is, conditional on the current search point being equal to a particular search point of this potential. Let us make this more precise. Lehre uses the notation $(Xt)t\u22650$ to denote the Markov process on $\Omega $ associated with the mutation operator [unfortunately, it is not said in Lehre (2010) what $X0$ is, that is, how this process is started]. Then $\Delta t(i)=(g(Xt+1-g(Xt)\u2223g(Xt)=i)$ defines the potential gain in step $t$ when the current state has potential $i$. With this notation, instead of our second and third conditions, Lehre (2010) requires only the weaker conditions (here again translated into our notation).

- (ii')
For all $t\u22650$ and all $a<i<b$, $E[exp(-\kappa \Delta t(i))]<1\alpha (1-\delta )$.

- (iii')
For all $t\u22650$, $E[exp(-\kappa (g(Xt+1)-b))\u2223g(Xt)\u2265b]<D$.

So Lehre only requires that the random individual at time $t$, conditional on having a certain potential, gives rise to a certain drift, whereas we require that each particular individual with this potential gives rise to this drift. On the formal level, Lehre's condition is much weaker than ours (assuming that the unclear point of what is $X0$ can be fixed). That said, to exploit such weaker conditions, one would need to be able to compute such average drifts and they would need to be smaller than the worst-case point-wise drift. We are not aware of many examples where average drift was successfully used in drift analysis (one is Jägersküpper's remarkable analysis of the linear functions problem; Jägersküpper, 2008) despite the fact that many classic drift theorems only require conditions on the average drift to hold.

We now prove Theorem ^{4}. Before
stating the formal proof, we describe on a high level its main ingredients and
how it differs from Lehre's proof.

The main challenge when using drift analysis is designing a potential function that suitably measures the progress. For simple hillclimbers and optimization problems, the fitness of the current solution may suffice, but already the analysis of the $(1+1)$ EA on linear functions resisted such easy approaches (He and Yao, 2001; Droste et al., 2002; Doerr et al., 2012; Witt, 2013). For population-based algorithms, the additional challenge is to capture the quality of the whole population in a single number. We note at this point that the notion of “negative drift in populations” was used in Lehre to informally describe the characteristic of the population processes regarded, but drift analysis as a mathematical tool was employed only on the level of single individuals and the resulting findings were lifted to the whole population via advanced tools like branching processes and eigenvalue arguments.

To prove upper bounds, in Witt (2006), Chen et al. (2009), Lehre (2011), Dang and Lehre (2016a), Corus et al. (2018), Antipov et al. (2018), and Doerr and Kötzing (2019), implicitly or explicitly potential functions were used that build on the fitness of the best individual in the population and the number of individuals having this fitness. Regarding only the current-best individuals, these potential functions might not be suitable for lower bound proofs.

The lower bound proofs in Neumann et al. (2009), Oliveto and Witt (2014, 2015), and Antipov et al.
(2019) all define a natural potential
for single individuals, namely the Hamming distance to the optimum, and then
lift this potential to populations by summing over all individuals an
exponential transformation of their base potential (this ingenious definition
was, to the best of our knowledge, not known in the theory of evolutionary
algorithms before the work of Neumann et al., 2009). This is the type of potential we shall use as well, and given
the assumptions of Theorem ^{4}, it is
not surprising that $\u2211x\u2208Pexp(-\kappa g(x))$ is a good choice. For this potential, we shall then show with only mild effort
that it satisfies the assumptions of our drift theorem, which yields the desired
lower bounds on the runtime (using that a single good solution in the population
already requires a very high potential due to the exponential scaling). We now
give the details of this proof idea.

^{4}:

Let $\Delta =\lambda Dexp(-\kappa b)$.
Since $P(0)$ contains no individual with potential below $b$,
we have $X0\u2264\lambda exp(-\kappa b)=\Delta D\u2264\Delta \delta $.
Hence, also the assumption $E[X0]\u2264\Delta \delta $ of Lemma ^{3} is fulfilled.

^{3}) gives

We note that the proof above actually shows the following slightly stronger statement, which can be useful when working with random initial populations (as, for example, in the following section).

Theorem ^{4} remains valid when the
assumption that all initial individuals have potential at least $b$ is replaced by the assumption $\u2211i=1\lambda E[exp(-\kappa g(Pi(0)))]\u2264\lambda Dexp(-\kappa b)\delta $.

## 5 Processes Using Standard Bit Mutation

Since many EAs use standard bit mutation, as in Lehre (2010) we now simplify our main result for processes using standard bit mutation and for $g$ being the Hamming distance to a target solution. Hence, in this section, we have $\Omega ={0,1}n$ and $y=mut(x)$ is obtained from $x$ by flipping each bit of $x$ independently with probability $p$. Since our results are non-asymptotic, we can work with any $p\u226412$.

Consider a PSM process (see Section 4.1) with search space $\Omega ={0,1}n$, using standard bit mutation with mutation rate $p\u2208[0,12]$ as mutation operator, and such that $Pi(0)$ is uniformly distributed in $\Omega $ for each $i\u2208[1..\lambda ]$ (possibly with dependencies among the individuals). Let $x*\u2208\Omega $ be the target of the process. For all $x\u2208\Omega $, let $g(x):=H(x,x*)$ denote the Hamming distance from the target.

Let $\alpha \u22651$ and $0<\delta <1$ such that $ln(\alpha 1-\delta )<pn$, that is, such that $1-1pnln(\alpha 1-\delta )=:\u025b>0$. Let $B=2\u025b$. Let $a,b$ be integers such that $0\u2264a<b$ and $b\u2264b\u02dc:=n1B2-1$.

Selection condition: Assume that for all populations $P\u2208P$ with $min{g(Pi)\u2223i\u2208[1..\lambda ]}>a$ and all $i\u2208[1..\lambda ]$ with $g(Pi)<b$, we have $E[R(i,P)]\u2264\alpha $.

The proof of this result is a reduction to Theorem ^{4}. To show that the second and third condition of Theorem ^{4} are satisfied, one has to estimate $E[exp(-\kappa (g(mut(x))-g(x)))]$,
which is not difficult since $g(mut(x))-g(x)$ can be written as sum of independent random variables. With a similar computation
and some elementary calculus, we show that the weaker starting condition of Theorem ^{5} is satisfied.

^{6}:

We apply Theorem ^{4}. To show the
second and third condition of the theorem, let $x\u2208\Omega $ and let $y=mut(x)$ be the random offspring generated from $x$.
We use the shorthand $d=g(x)$.
We note that $g(y)-g(x)=g(y)-d$ can be expressed as a sum of $n$ independent random variables $Z1,\cdots ,Zn$ such that for $i\u2208[1..d]$,
we have $Pr[Zi=-1]=p$ and $Pr[Zi=0]=1-p$,
and for $i=[d+1..n]$,
we have $Pr[Zi=+1]=p$ and $Pr[Zi=0]=1-p$.

^{4}for $\kappa =ln(B)$.

^{4}is satisfied, assume that $g(x)\u2265b$. We first note the following. Let $x'\u2208\Omega $ with $g(x')=b$ and let $y'=mut(x')$. By Lemma

^{2}, $g(y)$ stochastically dominates $g(y')$. Consequently, by Lemma

^{1},

^{4}(including the requirement $D\u2265\delta $).

^{5}is satisfied. Using the moment-generating function of a binomially distributed random variable (which is nothing more than the arguments used in (3)), this follows immediately from the following estimate, valid for a random search point $x$:

^{5}. From the conclusion of Theorem

^{4}, we obtain

As a simple **example** for an application of this result, let us consider
the classic $(\mu ,\lambda )$ EA (with uniform selection for variation, truncation selection for inclusion into
the next generation, and mutation rate $p=1n$)
with $\lambda =2\mu $ optimizing some function $f:{0,1}n\u2192R$, $n=500$,
with unique global optimum. For simplicity, let us take as performance measure $\lambda T$,
that is, the number of fitness evaluations in all iterations up to the one in which
the optimum was found. Since $\lambda =2\mu $,
we have $\alpha =2$.
By taking $\delta =0.01$,
we obtain a concrete lower bound of an expected number of more than 13 million
fitness evaluations until the optimum is found (regardless of $\mu $ and $f$).

Since Theorem ^{6} is slightly technical, we
now formulate the following corollary, which removes the variable $\delta $ without significantly weakening the result. We note that the proof of this result
applies Theorem ^{6} with a non-constant $\delta $,
so we do not see how such a result could have been proven from Lehre (2010).

Consider a PSM process as in Theorem ^{6}. Let $x*\u2208\Omega $ be the target of the process. For all $x\u2208\Omega $,
let $g(x):=H(x,x*)$ denote the Hamming distance from the target. Assume that there is an $\alpha \u22651$ such that

$ln(\alpha )\u2264p(n-1)$, which is equivalent to $\gamma :=1-ln\alpha pn\u22651n$;

there is an $a\u2264b:=\u230a(1-4n)n14\gamma 2-1\u230b$ such that for all populations $P\u2208P$ with $min{g(Pi)\u2223i\u2208[1..\lambda ]}>a$ and for all $i\u2208[1..\lambda ]$, we have $E[R(i,P)]\u2264\alpha $.

The main argument is employing Theorem ^{6} with the $\delta =p2n$ and computing that this small $\delta $ has no significant influence on the exponential term of the bounds.

^{7}:

^{6}with $\delta =p2n$. Since $\delta \u226412$, we have $1-\delta \u2265exp(-2\delta )$ and thus $ln(\alpha 1-\delta )=ln(\alpha )-ln(1-\delta )\u2264ln(\alpha )+2\delta $. Consequently, $\u025b:=1-1pnln(\alpha 1-\delta )$ defined as in Theorem

^{6}satisfies

^{6}become

We note that $b=\Theta (n\gamma 2)$ since $\gamma $ is always at most one. By assumption, $(b-a)=\Theta (b)$. Assume first that $\gamma =\omega (n-1/2)$. If $\gamma \u2264n-1/4$, then $exp(ln(2/\gamma )(b-a))=(2/\gamma )b-a\u2265(2n1/4)\omega (1)$, which is super-polynomial. If $\gamma \u2265n-1/4$, then $b-a=\Omega (n1/2)$ and $exp(ln(2/\gamma )(b-a))\u22652b-a$ is again super-polynomial. This shows the claimed super-polynomiality for $\gamma =\omega (n-1/2)$.

For $\gamma =\Omega (1)$, we have $b-a=\Theta (b)=\Theta (n)$ and thus $exp(ln(2/\gamma )(b-a))\u2265exp(ln(2)(b-a))=exp(\Theta (n))$ is exponential in $n$.

The asymptotic statements of the with-high-probability claims follow analogously.

## 6 Standard Bit Mutation with Random Mutation Rate

To analyze a *uniform mixing hyper-heuristic* which uses standard-bit
mutation with a mutation rate randomly chosen from a finite set of alternatives,
Dang and Lehre (2016b, Theorem 2) extend
Theorem ^{6} to such mutation operators.
They do not give a proof of their result, stating that it would be similar to the
proof of the result for classic standard bit mutation (Lehre, 2010, Theorem ^{4}).
Since we did not find this so obvious, we reprove this result now with our methods.
The non-asymptoticity of our result allows to extend it to super-constant numbers of
mutation rates and to mutation rates other than $\Theta (1/n)$.
We note that such situations appear naturally with the heavy-tailed mutation
operator proposed in Doerr et al. (2017).

We show the following result, which extends Theorem ^{6}.

Let $n\u2208N$. Let $m\u2208N$, $p1,\cdots ,pm\u2208[0,12]$, and $q1,\cdots ,qm\u2208[0,1]$ such that $\u2211i=1mqi=1$. Let $mut$ be the mutation operator which, in each application independently, chooses an $I\u2208[1..m]$ with probability $Pr[I=i]=qi$ for all $i\u2208[1..m]$ and then applies standard bit mutation with mutation rate $pI$.

Consider a PSM process (see Section 4.1) with search space $\Omega ={0,1}n$, using this mutation operator $mut(\xb7)$, and such that each initial individual is uniformly distributed in $\Omega $ (not necessarily independently). Let $x*\u2208\Omega $ be the target of the process. For all $x\u2208\Omega $, let $g(x):=H(x,x*)$ denote the Hamming distance from the target.

Selection condition: Assume that for all populations $P\u2208P$ with $min{g(Pi)\u2223i\u2208[1..\lambda ]}>a$ and all $i\u2208[1..\lambda ]$ with $g(Pi)<b$, we have $E[R(i,P)]\u2264\alpha $.

It is clear that when using standard bit mutation with a random mutation rate, then
the drift—regardless of whether we just regard the fitness or an exponential
transformation of it—is a convex combination of the drift values of each of
the individual mutation operators. The reason why this argument does not immediately
extend Theorem ^{6} to random mutation
rates is that the mutation rate also occurs in the exponential term $exp(pn(-1+2B))$ in equation (4). Apart from this
difficulty, however, we can reuse large parts of the proof of Theorem ^{6}.

^{8}:

^{6}, we have

^{4}and, with the same domination argument as in the proof of Theorem

^{6}and $D=max{(1-\delta )1\alpha ,\delta}$, also the third condition of Theorem

^{4}. The starting condition of Theorem

^{5}follows as in the proof of Theorem

^{6}by noting that again $B\u22652$. Now Theorem

^{4}is applicable and as in the last few lines of the proof of Theorem

^{6}we show our claim (note that now it suffices to simply replace $\kappa $ by $lnB$ and not by the more complicated logarithmic term there).

Equation (5) defining the admissible values for $B$ and thus for the starting point $b$ of the negative drift regime is not very convenient to work with in general. We stated it nevertheless because in particular situations it might be useful, for example, to show an inapproximability result, that is, that a certain algorithm cannot come closer to the optimum than by a certain margin in subexponential time. The following weaker assumption is easier to work with and should, in most cases, give satisfactory results as well.

^{8}, we have

If in an asymptotic setting $\gamma $ and $\alpha $ can be taken as constants, then this lemma and Theorem ^{8} show an exponential lower bound on the runtime. This proves
(Dang and Lehre, 2016b, Theorem 2) and extends
it to mutation rates that are not necessarily $\Theta (1/n)$.

As an example where mutation rates other than $\Theta (1/n)$ occur, we now regard the heavy-tailed mutation operator proposed in Doerr et al. (2017). This operator was shown to give a uniformly good performance of the $(1+1)$ EA on all jump functions, whereas each fixed mutation rate was seen to be good only for a small range of jump sizes. The heavy-tailed operator and variations of it have shown a good performance also in other works, for example, Mironovich and Buzdalov (2017); Friedrich et al. (2018a, 2018b); Friedrich, Quinzan et al. (2018); Wu et al. (2018); Antipov et al. (2020a, 2020b); Antipov and Doerr (2020); and Ye et al. (2020). The heavy-tailed mutation operator is nothing else than standard bit mutation with a random mutation rate, chosen from a heavy-tailed distribution. Doerr et al. (2017) defined it as follows. Let $\beta >1$ be a constant. This will be the only parameter of the mutation operator, however, one with not too much importance, so Doerr et al. (2017) simply propose to take $\beta =1.5$. In each invocation of the mutation operator, a number $\alpha \u2208[1..N]$, $N:=\u230a12n\u230b$ is chosen from the power-law distribution with exponent $\beta $. Hence, $Pr[\alpha =i]=(CN\beta )-1i-\beta $, where $CN\beta $ is the normalizing constant $CN\beta :=\u2211i=1Ni-\beta $. Once $\alpha $ is determined, standard bit mutation with mutation rate $p=\alpha n$ is employed.

For fixed $N$,
the expression on the left-hand side in Lemma ^{9} is $AN=\u2211i=1N(CN\beta )-1i-\beta e-i$.
This is a convex combination of $e-i$ terms and by comparing the coefficients, we easily see that this expression is
decreasing in $N$.
Computing $A100<0.178$,
we see that for all $n\u2265200$,
a reproduction number of at most $\alpha \u226410.178\u22485.618$ is small enough to lead to exponential runtimes. This is higher than for standard
bit mutation with mutation rate $p=1n$,
where only $\alpha \u2264e\u22482.718$ suffices to show exponential runtimes. This observation fits to our general feeling
that larger mutation rates can be destructive, from which in particular non-elitist
algorithms suffer.

For the limiting value $A=limN\u2192\u221eAN$ we note that $A\u2265A-:=\u2211i=1100(C\u221e\beta )-1i-\beta exp(-i)\u22480.164004$ and $A\u2264A+:=\u2211i=1100(C\u221e\beta )-1i-\beta exp(-i)+exp(-101)\u22480.164004$. Hence, for $n$ sufficiently large, even a value of $\alpha \u224810.164004=6.0974$ admits exponential lower bounds for runtimes.

Without going into details, and in particular without full proofs, we note that these estimates are tight, and this for all mutation operators of the type discussed in this section. Consider such a mutation operator such that $\u2211i=1mqiexp(-pin)\u2265(1+\delta )1\alpha $ for some $\delta >0$. We take the $(\mu ,\lambda )$ EA optimizing OneMax as example. Assume that at some time we have in our parent population $k$ individuals on the highest non-empty fitness level $L$. In expectation, each of them generates $\alpha =\lambda /\mu $ offspring. Each of these offspring is an exact copy of the parent with probability $\u2211i=1mqiexp(-pin)\u2265(1+\delta )1\alpha $. Consequently, in the next generation the expected number of individuals on level $L$ or higher (as long as level $L$ is not full) is $(1+\delta )k$. This is enough to show polynomial runtimes via the level-based method (Lehre, 2011; Dang and Lehre, 2016a; Corus et al., 2018; Doerr and Kötzing, 2019) when $\lambda $ is large enough. For example, the computation just made shows that condition (G2) in Doerr and Kötzing (2019, Theorem 3.2) is satisfied.

We note that this tightness stems from the fact that the term $\u2211i=1mqiexp(-pin)$ appears both in the drift computation here and, via the probability to generate a copy of a parent, in the fitness level method for populations. We do not think that this is a coincidence, but leave working out the details to a future work.

## 7 Fitness Proportionate Selection

In this section, we apply our method to a mutation-only version of the simple genetic algorithm (simple GA). We note that this algorithm traditionally is used with crossover (Goldberg, 1989). The mutation-only version has been regarded in the runtime analysis community mostly because runtime analyses for crossover-based algorithms are extremely difficult. While the first runtime analysis for the mutation-only version (Neumann et al., 2009) appeared in 2009 and showed a near-exponential lower bound on OneMax for arbitrary polynomially bounded population sizes, the first analysis of the crossover-based version from 2012 (Oliveto and Witt, 2012b) could only show a significantly sub-exponential lower bound ($2nc$ for a constant $c$ which is at most $180$) and this for population sizes below $n1/8$. We note that the current best result (Oliveto and Witt, 2015) gives a similar lower bound for population sizes below $n1/4$. Both works call these runtimes exponential, and we acknowledge that this definition for exponential runtimes exists, but given the substantial difference between $2n1/80$ (which is less than 3.5 for all $n\u22641020$) and $exp(\Theta (n))$ we prefer to reserve the notion “exponential” for the latter.

The mutation-only version of the simple GA with population size $\mu \u2208N$ is described in Algorithm 3. This algorithm starts with a population $P(0)$ of $\mu $ random individuals from ${0,1}n$. In each iteration $t=1,2,3,\cdots $, it computes from the previous population $P(t-1)$ a new population $P(t)$ by $\mu $ times independently selecting an individual from $P(t-1)$ via fitness proportionate selection and mutating it via standard bit mutation with mutation rate $p=1n$.

The precise known results for the performance of Algorithm 3 on the OneMax benchmark are the following. Neumann et al. (2009, Theorem ^{8}) showed that
with $\mu \u2264poly(n)$ it needs with high probability more than $2n1-O(1/loglogn)$ iterations to find the optimum of the OneMax function or any search point
in Hamming distance at most $0.003n$ from it. This is only a subexponential lower bound. In Lehre (2011, Corollary 13), building on the lower bound method from
Lehre (2010), a truly exponential lower bound
is shown for the weaker task of finding a search point in Hamming distance at most $0.029n$ from the optimum, but only for a relatively large population size of $\mu \u2265n3$ (and again $\mu \u2264poly(n)$).

We now extend this result to arbitrary $\mu $,
that is, we remove the conditions $\mu \u2265n3$ and $\mu \u2264poly(n)$.
To obtain the best known constant 0.029 for how close to the optimum the algorithm
cannot go in subexponential time, we have to compromise with the constants in the
runtime, which consequently are only of a theoretical interest. We therefore do not
specify the base of the exponential function or the leading constant. We note that
this would have been easily possible since we only use a simple additive Chernoff
bound and Corollary ^{7}. We further note
that Lehre (2011) also shows lower bounds for
a scaled version of fitness proportionate selection and a general $\Theta (1/n)$ mutation rate. This would also be possible with our approach and would again remove
the conditions on $\lambda $,
but we do not see that the additional effort is justified here.

There is a $T=exp(\Omega (n))$ such that the mutation-only simple GA optimizing OneMax with any population size $\mu $ with probability $1-exp(-\Omega (n))$ does not find any solution $x$ with $OneMax(x)\u22650.971n$ within $T$ fitness evaluations.

The main difficulty in proving lower bounds for algorithms using fitness proportionate selection is that the reproduction number is non-trivial to estimate. If all but one individual have a fitness of zero, then this individual is selected $\mu $ times. Hence, $\mu $ is the only general upper bound for the reproduction number. The previous works and ours overcome this difficulty by arguing that the average fitness in the population cannot significantly drop below the initial value of $n/2$, which immediately yields that an individual with fitness $k$ has a reproduction number of roughly at most $kn/2$.

While it is natural that the typical fitness of an individual should not drop far
below $n/2$,
making this argument precise is not completely trivial. In Neumann et al. (2009, Lemma ^{6}), it was informally argued that the situation with fitness
proportionate selection cannot be worse than with uniform selection. For the latter
situation a union bound over all lineages of individuals is employed and a
negative-drift analysis from Oliveto and Witt (2008, Section 3) is used for a
single lineage. The analysis in Lehre (2011,
Lemma ^{9}) builds on the (positive) drift
stemming from standard bit mutation when the fitness is below $n/2$ (this argument needs a mutation rate of at least $\Omega (1/n)$)
and the independence of the offspring (here the lower bound $\lambda \u2265n3$ is needed to admit the desired Chernoff bound estimates).

Our proof relies on a natural domination argument which shows that at all times all
individuals are at least as good as random individuals in the sense of stochastic
domination in fitness. This allows to use a simple Chernoff and union bound to argue
that with high probability, for a long time all individuals have a fitness of at
least $(12-\u025b)n$.
The remainder of the proof is an application of Corollary ^{7}. Here the lower bound (Lehre, 2010, Theorem ^{4}) would have
been applicable as well with the main difference that there one has to deal with the
constant $\delta $,
which does not exist in Corollary ^{7}.

We start by proving the key argument used in the proof of Theorem ^{10}, namely that at each time $t$ for each individual $i\u2208[1..\mu ]$ the fitness stochastically dominates (see Section 2) the one of a random individual. We denote by $Bin(n,p)$ the binomial distribution with parameters $n$ and $p$. With a
slight abuse of notation, we write $Bin(n,p)\u2aafY$ to denote that $Y$ stochastically dominates $X$ when $X$ is binomially distributed with parameters $n$ and $p$.

In this notation, our goal is to show that for all times $t$ and all $i\u2208[1..\mu ]$, we have $Bin(n,12)\u2aafOneMax(Pi(t))$. This statement appears easy to believe since fitness proportionate selection, favoring better individuals at least slightly, should not be able to make the population worse. To be on the safe side, we nevertheless prove this statement formally (after the following remark).

We note that another statement that might be easy to believe is not true, namely that the sum of the fitness values of a population at all times $t\u22651$ dominates the sum of the fitness values of a random population (such as the initial population), that is, that $Bin(\mu n,12)\u2aaf\u2211i=1\mu OneMax(Pi(t))$. As counterexample, let $n$ be a multiple of 10 and let us consider the simple GA with $\mu =n$ after one iteration. Let $Y=\u2211i=1\mu OneMax(Pi(1))$. We estimate the probability of the event $Y\u22640.4n\mu $. With probability at least $2-n$ we have $OneMax(P1(0))=0.4n$. For each $i=2,\cdots ,\mu $, we have $OneMax(Pi(0))\u22640.5n$ with probability 0.5 by the symmetry of the binomial distribution with parameter $p=0.5$. These events are all independent, so with probability at least $2-n-\mu +1$, we have all of them. In this case, for each $i=1,\cdots ,\mu $ independently, with probability at least $0.4n/(0.4n+0.5n(\mu -1))\u22650.8/\mu $ the $i$-th parent chosen in iteration 1 is $P1(0)$ and with probability at least $(1-1/n)n\u22651/4$ the offspring generated from it equals the parent. All these events together occur with probability at least $2-n-\mu +1(0.8/\mu )\mu (1/4)\mu \u2265(20n)-n$, recall that $\mu =n$, which shows $Pr[Y\u22640.04n2]\u2265(20n)-n$. Now for $X\u223cBin(n\mu ,12)$, a simple Chernoff bound argument, e.g., via the additive Chernoff bound (Doerr, 2020d, Theorem 1.10.7), shows that $Pr[X\u22640.4n\mu ]\u2264exp(2(0.1n\mu )2/n\mu )=exp(-0.02n\mu )=exp(-0.02n2)$. Since this is (much) smaller than $(20n)-n$ for $n$ sufficiently large ($n\u2265180$ suffices), we do not have $X\u2aafY$.

Consider a run of the simple GA (Algorithm 3) on the OneMax benchmark. Then for each $t\u22650$ and each $i\u2208[1..\mu ]$, we have $Bin(n,12)\u2aafOneMax(Pi(t))$.

To prove this result, we use the following auxiliary result, which states a number uniformly chosen from a collection of non-negative numbers is stochastically dominated by a number chosen from the same collection via an analogue of fitness proportionate selection. To define the latter formally, let $n1,\cdots ,n\mu \u2208R\u22650$. For a random variable $v$ we write $v\u223cfp(n1,\cdots ,n\mu )$ if

in the case that $ui>0$ for at least one $i\u2208[1..\mu ]$, we have $Pr[v=i]=ni\u2211j=1\mu nj$ for all $i\u2208[1..\mu ]$, and

in the case that $ui=0$ for all $i\u2208[1..\mu ]$, we have $Pr[v=i]=1\mu $ for all $i\u2208[1..\mu ]$.

Let $n1,\cdots ,n\mu \u2208R\u22650$. Let $u\u2208[1..\mu ]$ be uniformly chosen and $U=nu$. Let $v\u223cfp(n1,\cdots ,n\mu )$ and $V=nv$. Then $U\u2aafV$.

We now show Lemma ^{11}.

^{11}:

We show the claim via induction over time. For the random initial population $P(0)$,
the claim is obviously true. Assume that in some iteration $t+1$,
the parent population $P(t)$ satisfies that for all $i\u2208[1..\mu ]$,
we have $Bin(n,12)\u2aafOneMax(Pi(t))$.
We show that the same is true for $P(t+1)$.
Since all individuals of $P(t+1)$ are identically distributed, we consider how one of them is generated. Let $u\u2208[1..\mu ]$ be random and let $v\u223cfp(OneMax(P1(t)),\cdots ,OneMax(P\mu (t)))$ be the parent individual selected for the generation of the offspring. By our
inductive assumption and Lemma ^{12},
we have $Bin(n,12)\u2aafOneMax(Pu(t))\u2aafOneMax(Pv(t))$.
Let $x$ be
a uniformly random individual and $y=Pv(t)$ be the parent just selected. Let $x'$ and $y'$ be the results of applying standard bit mutation to $x$ and $y$.
Since $OneMax(x)\u2aafOneMax(y)$,
by Lemma ^{2} we have $OneMax(x')\u2aafOneMax(y')$.
Now $y'$ is equal (in distribution) to the offspring we just regard and $x'$ is (still) a random bit string. Hence, $Bin(n,12)\u223cOneMax(x')\u2aafOneMax(y')$ as desired.

We are now ready to give the formal proof of Theorem ^{10}.

^{10}:

Consider a run of the simple GA with population size $\mu $.
With the domination argument of Lemma ^{11}, the fitness of a particular solution $Pi(t)$ dominates a sum of $n$ independent uniform ${0,1}$-valued
random variables. Hence, using the additive Chernoff bound (Doerr, 2020d, Theorem 1.10.7), we see that $OneMax(Pi(t))\u2264(12-\u025b)n=:s$ with probability at most $exp(-2\u025b2n)$ for all $\u025b>0$.

To avoid working in conditional probability spaces, let us consider a
modification of the simple GA. It is identical with the original algorithm up to
the point when the fitness of an individual $Pi(t)$ for the first time goes below $s$.
From that time on, the algorithm selects the parents uniformly. Such an
artificial continuation of a process from a time beyond the horizon of interest
on was, to the best of our knowledge, in the theory of evolutionary algorithms
first used in Doerr et al. (2011). For
our modified simple GA, the reproduction rate of any individual in a population
with all elements having fitness less than $n-a$, $a\u2264n-s$,
is at most $n-as=:\alpha $.
Hence, we can apply Corollary ^{7} with
this $\alpha $.
Taking, similar as in Lehre (2011), $\u025b=0.0001$ and $a=0.029$,
we can work with $\alpha =1-a0.5-\u025b\u22481.942388$ and thus $\gamma \u22480.336082$.
For $n$ sufficiently large, this allows to use $b=\u23080.02905n\u2309$.

While an exponential runtime on OneMax is not an exciting performance, for
the sake of completeness we note that the runtime of the simple GA on OneMax is not worse than exponential. A runtime of $exp(O(n))$ can be shown with the methods of Doerr (2020b) (with some adaptations). The key observation is that, similar to
property (A) in Doerr (2020b, Theorem ^{3}), if at some time $t$ the population contains an individual $x$ with some fitness at least $n/3$,
then in the next iteration this individual is chosen as parent at least once with at
least constant probability and, conditional on this, with probability $\Omega (1n)$ a particular better Hamming neighbor of $x$ is
generated from $x$.

## 8 Conclusion and Outlook

In this work, we have proven two technical tools which might ease future lower bound proofs in discrete evolutionary optimization. The negative multiplicative drift theorem has the potential to replace the more technical negative drift theorems used so far in different contexts. Our strengthening and simplification of the negative drift in populations method should help increase our not very well developed understanding of population-based algorithms in the future. Clearly, it is restricted to mutation-based algorithms—providing such a tool for crossover-based algorithms and extending our understanding of how to prove lower bounds for these beyond the few results (Doerr and Theile, 2009; Oliveto and Witt, 2015; Sutton and Witt, 2019; Doerr, 2020e) would be great progress.

## Acknowledgments

This work was supported by a public grant as part of the Investissement d'avenir project, reference ANR-11-LABX-0056-LMH, LabEx LMH.

## References

*Genetic and Evolutionary Computation Conference (GECCO)*

*Parallel Problem Solving from Nature, Part II*

*Parallel Problem Solving from Nature, Part II*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Genetic and Evolutionary Computation Conference (GECCO)*

*IEEE Transactions on Systems, Man, and Cybernetics, Part B*

*IEEE Transactions on Evolutionary Computation*

*Algorithmica*

*Parallel Problem Solving from Nature*

*Theoretical Computer Science*

*Foundations of Genetic Algorithms*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Parallel Problem Solving from Nature, Part II*

*Parallel Problem Solving from Nature, Part II*

*Theory of evolutionary computation: Recent developments in discrete optimization*

*CoRR*

*Algorithmica*

*Evolutionary Computation*

*Algorithmica*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Conference of the IEEE Industrial Electronics Society*

*Theoretical Computer Science*

*CoRR*

*Parallel Problem Solving from Nature, Part I*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Genetic algorithms in search, optimization and machine learning*

*Advances in Applied Probability*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Artificial Intelligence*

*Parallel Problem Solving from Nature*

*Foundations of Computational Intelligence*

*Random combinatorial structures and randomized search heuristics*

*Algorithmica*

*Parallel Problem Solving from Nature*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Theory of evolutionary computation: Recent developments in discrete optimization*

*Combinatorics, Probability & Computing*

*Genetic and Evolutionary Computation Conference (GECCO), Companion Material*

*International Journal on Intelligent Computing and Cybernetics*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Parallel Problem Solving from Nature*

*Algorithmica*

*CoRR*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Theoretical Computer Science*

*Theoretical Computer Science*

*Theoretical Computer Science*

*Genetic and Evolutionary Computation Conference (GECCO)*

*Evolutionary Computation*

*Combinatorics, Probability & Computing*

*Algorithmica*

*Intelligent Computing Methodologies, Part III*

*Parallel Problem Solving from Nature, Part II*

## Author notes

^{*}The author can be contacted at
doerr (at) lix.polytechnique (dot) fr.