Nonequilibrium candidate Monte Carlo

In a recent Computational Statistics Club meeting, we covered the topic of Nonequilibrium candidate Monte Carlo (NCMC), by Jerome Nilmeier, Gavin Crooks, David Minh, and John Chodera. Our lab is interested in predicting equilibrium thermodynamic properties of materials, such as binding affinities of small molecules to proteins. Using the framework of statistical mechanics, we can frame these prediction problems as (intractable) integrals over all possible configurations of the flexible protein and flexible ligand, weighted by their Boltzmann weight (the negative, exponentiated reduced potential energy ). For realistic systems, these integrals are complicated enough that there is no analytical solution. Therefore, our objective is to approximate these integrals in clever ways. One way is to construct and simulate a stochastic process whose time averages eventually equal configuration averages. If we simulate long enough, time averages become good estimates for configuration averages. If the stochastic process we construct exhibits slow transitions between configurations (for example, due to high energy barriers between configurations), “long enough” can become prohibitive in cost. We’re interested in coming up with more efficient sampling methods to tackle this issue.

What’s new in NCMC?

NCMC allows us to turn incomplete knowledge about a system into an efficient proposal. Sometimes information is available about a part of the system in isolation, or we know which parts of a potential induce the biggest barriers in our system. NCMC constructs valid acceptance criteria when we use nonequilibrium driven processes as proposals. This is useful because there are often natural ways to turn incomplete information about the system into a gradual nonequilibrium “driving” process, even in cases where it would be difficult to construct a successful instantaneous move using that same information. Nonequilibrium processes allow us to use the information we have to drive over or perturb barriers, and relax the system into equilibrium in a new state. The NCMC acceptance criteria rely on quantifying the divergence from equilibrium, which is achieved by bookkeeping the amount of perturbation/work over the course of a nonequilibrium trajectory.

What are concrete examples of NCMC algorithms / design choices?

Example from paper: WCA dimer

The example discussed in the paper is a bistable dimer in a crowded solvent environment. Our default stochastic process for simulating this system might be Generalized Hamiltonian Monte Carlo (an unbiased version of thermostatted molecular dynamics). However, we have to simulate a very long time to sample the different “conformations” of the dimer. Can we do better, by exploiting some insight into the system?


Our isolated “insight” into the system comes in the form of knowing something about the energy term for the bond length. We know it has minima (and where), and we know there’s a big (5kT) barrier between the minima. In effect, we know something about the independent distribution of the bond length, if it were in isolation. This allows us to construct an instantaneous Monte Carlo move that extends / compacts the dimer, in effect “tunneling” through the 5kT barrier to a distant region of high probability. In a vacuum, this observation serves us very well: by including this instantaneous proposal along with MD, we reduce our correlation time from 60 to 0.


However, if we naively try to apply this insight to the solvated version of our system, our instantaneous proposals would almost always be rejected (only accepting ~1 in proposals!), since solvent is in the way, and our overall correlation time increases over MD.


What’s going wrong? There’s a very large volume of configuration space consistent with the dimer having any chosen bond length, but most of that volume is “empty,” i.e. if we drew the positions of the solvent at random, then we’d usually not have a reasonable configuration.

NCMC allows us to turn observations like this into computable proposals. In this case, the authors constructed a “deterministic drive” NCMC move, where we gradually modify the bond length and let the rest of the system evolve as usual. In other words, we never modify the potential, we just “do work” by dragging one degree of freedom (DOF) and letting the rest of the DOF relax.


The experiments in the paper show that there’s a “sweet spot” for efficiency – if your proposals have many steps, then the solvent almost always has time to relax, giving you a very high acceptance rate.


On the other hand, proposal trajectories with many steps are more computationally expensive.


As an analogy, imagine you’re trying to push through a crowded subway station. If you instantly “teleported” to a random point in the crowd, you would probably have a bad time / end up overlapping with another commuter. If you try to sprint, you’ll have to work a lot harder bowling people over. If you’re patient and walk at roughly the same speed as everyone around you, you don’t have to work as hard to get where you’re going, but it may take longer.


Example: thermodynamic perturbation, 1D toy potential

Another way we can modify a system is to change its potential energy function.

For biomolecular systems we care about, the potential energy function is typically a sum of a large number of low-order terms. For example, we might have terms that depend on the distances between pairs of atoms, terms for the angles formed by bonded triples of atoms, and terms for the dihedral angles formed by bonded quartets of atoms. Some of these terms are “stiffer” and more difficult to sample than others. For example, the bonded pair potential is typically a very stiff spring, and the dihedral terms are typically multimodal.

One way to ease the sampling problem is to temporarily turn off parts of the potential that contribute a lot to our sampling difficulty, for example, by making the distribution more multimodal.

As a 1D example of this, consider this toy potential:


Which induces the following equilibrium distribution:


If only the first term were there, we’d have an easy sampling problem (a Gaussian!). However, with the second term active, we have a tricky multimodal distribution. In other words, the potential has an “easy bit” and a “difficult bit” – this suggests we might be able to ease our problem by temporarily turning off the “difficult bit” then turning it back on again.


We can exploit this insight by attaching a control parameter to our system, so we can turn off the second term:

Now, we can construct NCMC protocols that temporarily “soften” the distribution.


This is an example of a “thermodynamic perturbation” NCMC protocol. Here we’re exploiting knowledge about what’s producing the barriers, not necessarily the locations of the modes.


Our lab is applying NCMC in many active projects:

In the CSC meeting, we discussed some the practical challenges of optimizing NCMC protocols. In particular, we discussed the question: “what goes into the ‘objective function’ we’re trying to optimize?” There are a few objectives that are in tension. Overall, we want to achieve high computational efficiency: uncorrelated samples per second of computer time. This entails increasing the acceptance rate (by minimizing the dissipative work performed by the protocol by balancing perturbation vs. relaxation), minimizing autocorrelation time (by proposing “distant” configurations and retaining a high acceptance rate), and minimizing protocol length (to minimize computational cost).

All of this raises some challenges.

First, measuring the autocorrelation time is difficult! In the case of the WCA dimer in solvent, we have reliable prior knowledge of a 1D slow coordinate to monitor (namely, the bond length of the dimer). Absent this knowledge, which coordinate (or combination of coordinates) should we monitor? We could resort to picking an arbitrary coordinate (e.g. the potential energy), but this can lead to wildly incorrect estimates. (For example, if our energy surface is “bumpy,” with many symmetric energy wells, we would estimate a correlation time on the order of the per-well dwell time, which would be much lower than the global correlation time. On the other hand, if our energy surface is perfectly uniform, the observed potential energy would never change over the course of our simulation, possibly causing us to estimate an infinite correlation time.) A potential solution is to apply tools developed for studying molecular kinetics (such as Markov State Models, time-structured independent component analysis, and the variational approach), which allow us to pick optimal slow coordinates automatically, and lower-bound the actual mixing time.

Second, there’s a tension between computational cost and protocol work, as illustrated in WCA dimer example. By making the NCMC trajectories longer / the transformations more gradual, we can reduce the dissipative protocol work, at the cost of increased computational effort. It seems difficult to predict which parameters will lead to a favorable tradeoff a priori.

Finally, perhaps there are ways to sample near-optimal NCMC protocols automatically, following the recent work of Todd Gingrich, Grant Rotskoff, Gavin Crooks, and Phillip Geissler on sampling low-dissipation protocols. That body of work considers the problem of transforming from one distribution into a different distribution, and characterizing the ensemble of protocols that perform this transformation. One of the key findings is that the “volume” of near-optimal protocols can be very large, so it can be much easier to find a near-optimal protocol than an optimal one. However, our problem is slightly different, since the initial and target distributions are the same – i.e. our protocols transform samples from p into samples from p. In particular, an optimal protocol in terms of dissipation is simply to return the sample unchanged. Our objective is for our protocols to produce uncorrelated samples from the same distribution.

We may need to adopt a slightly different definition of the start and end states, for example, we want the protocol to produce a sample from the same distribution, but also “far away” from the input, in terms of whatever function we are using to measure autocorrelation.