Performance significantly affected by "flatness" of the compounds #757

chrisjonesBSU · 2023-08-23T19:35:41Z

@marjanAlbouye and I were experiencing some very slow performance with gmso.parameterization.apply and have been doing some digging into what the problem could be.

It doesn't seem to be an issue of number of particles or molecules, and given the examples below, the only thing I can think of is that flat, planar molecules, or in this case polymers, are causing issues.

I'm curious to see if anyone else can re-create this, or has any ideas of what the issue could be. Possibly something related to graphs and sub graphs as a function of topology size and structure?

Examples:

1. Creating a 20mer poly(ethylene) from SMILES compared to a 20mer using the Polymer Builder.

Creating the chain from SMILES gives a very windy, twisty chain. Creating it from the polymer builder gives a flat, straight chain.
Applying a forcefield to the topology for the molecules from SMILES takes a few seconds. I'm not sure how long it takes for the system using the Polymer Builder because after letting it run for about an hour, it still isn't done.

import mbuild as mb
from mbuild.lib.recipes import Polymer
import gmso
from gmso.external import from_mbuild
from gmso.parameterization import apply
import forcefield_utilities as ffutils

import time

oplsaa = ffutils.FoyerFFs().load('oplsaa').to_gmso_ff()

# Polyethylene from SMILES

pe = mb.load("C"*40, smiles=True)
pe_box = mb.fill_box(compound=pe, box=[20, 20, 20], n_compounds=11)
pe_top = from_mbuild(pe_box)
print(pe_box.n_particles)

start = time.time()
apply(top=pe_top, forcefields=oplsaa, identify_connections=True)
print(time.time() - start)

1342 # Num particles
5.7355194091796875 # Time (seconds)

# Polyethylene from Polymer Builder

monomer = mb.load("CC", smiles=True)

chain = Polymer()
chain.add_monomer(monomer, indices=[2, -2], separation=0.145)
chain.build(n=20)

chain_box = mb.fill_box(compound=chain, n_compounds=11, box=[20, 20, 20])
chain_top = from_mbuild(chain_box)
print(chain_box.n_particles)

start = time.time()
apply(top=chain_top, forcefields=oplsaa, identify_connections=True)
print(time.time() - start)

I don't think it is an issue specific to the Polymer builder.

Example 2: Poly(benzene) from the polymer builder

Here, I choose different values for the bonding indices. One gives a twisty chain, one gives a perfectly flat and planar chain. The twisty one finishes in a reasonable amount of time. The flat one never finishes.

# Poly(benzene) from Polymer Builder

benzene = mb.load("c1ccccc1", smiles=True)
chain = Polymer()
chain.add_monomer(benzene, indices=[6, 8], separation=0.14)
chain.build(n=15)

box = mb.fill_box(compound=chain, n_compounds=10, box=[20, 20, 20])
top = from_mbuild(box)

print(box.n_particles)

start = time.time()
apply(top=top, forcefields=oplsaa, identify_connections=True)
finish = time.time()

print(finish-start)

1520 # Num particles
6.535366058349609 # Time (seconds)

# Flat Poly(benzene) from Polymer Builder

benzene = mb.load("c1ccccc1", smiles=True)
chain = Polymer()
chain.add_monomer(benzene, indices=[6, 9], separation=0.14)
chain.build(n=15)

box = mb.fill_box(compound=chain, n_compounds=10, box=[20, 20, 20])
top = from_mbuild(box)

print(box.n_particles)

start = time.time()
apply(top=top, forcefields=oplsaa, identify_connections=True)
finish = time.time()

print(finish-start)

I'm not super familiar with the graph/networkx operations GMSO is doing under the hood, but I wonder if there are some edge cases where you get stuck in an infinite while loop or for loop? For example, I have a bigger system than these examples (but not crazy big, only 13,000 particles) running on a cluster that is approaching 20 hours in the apply step (which is what prompted this issue). The polymer chains are totally flat like this benzene example.

The text was updated successfully, but these errors were encountered:

chrisjonesBSU · 2023-08-23T19:44:38Z

It seems that if I set identify_connected_components=False instead of True then the issue goes away.

daico007 · 2023-08-23T19:48:25Z

Yeah, I can see that issue, the identify_connected_components is a function called on the bond graph (to combine those that have similar bond graph), which supposed to help speed up the apply step. This option, however, only seems to work for solvent box specifically, but easily backfire (i.e., make things extremely slow, like in your case) in other cases.
Probably a good case/reason for it to be turned off by default.

CalCraven · 2023-08-23T19:56:34Z

Hey @chrisjonesBSU, thanks for this issue. These examples are nicely fleshed out. So the GMSO apply step takes advantage of a lot of assumptions about repeated structure in a topology.

My guess is that for some reason, the polymer builder gets the residues correct in the flat molecule, and doesn't get it right in the twisted chain (looking just at example 2). You might gain a lot of information by printing out top.unique_site_labels("molecule"); top.unique_site_labels("residue") to see if that is different between the two examples.

You can also see if the identify_connections=False fixes anything. That step has been shown to be slow in the past, and is different from the atomtyping.

identify_connected_components is only good for lots of small molecules. Small numbers of large molecules is going to be inherently bad for it. For "most" cases it adds a lot of speedups. That's why it's the default. I can see a reason to remove that as the default though.

For this specific case (a polymer with many sub monomers), the ideal method for applying is actually with this argument turned on:
use_molecule_info: bool, optional, default=False A flag to determine whether or not to look at site.residue_name to look parameterize each molecule only once. Currently unused.
This would atomtype every residue the same way, so the monomer would only have to be applied once. But for now, we focused on making it work on many molecules, because that's a much simpler case, so that uses the identify_connected_components. I could also see a reason to rename these two arguments. One could be: many_molecules=True, and the other would be many_residues=True. That's more useful potentially to people that are trying to do the atomtyping with a particular system.

chrisiacovella · 2023-08-23T20:01:52Z

A difference between smiles and polymer builder is the overall packaging (i.e., hierarchy). Could you try using flatten and/or condense on the chain before filling the box? There could be something in identify_connections that is getting stuck in an infinite loop because because of the underlying hierarchy.

chrisjonesBSU · 2023-08-23T20:04:48Z

Thanks for the detailed responses. It seems like a case could be made for either default value for identify_connected_components. It seems like the performance hit is potentially much worse (going from seconds to hours) if someone accidentally leaves it set to True when it should be False than if the opposite were to happen.

Whatever choice is made for the default value, maybe we add some warnings if possible, and definitely some more information to the doc strings?

@chrisiacovella
I called just the identify_connections() method on all of the topologies, and it is similarly fast for all of the examples.

CalCraven · 2023-08-23T20:06:54Z

Also an added point is that use_molecule_info in the apply step claims it is unused in the docstring, but it actually is. In fact, you can turn off identify_connected_components if the molecules are all already labeled correctly in the topology, and you don't have to go down to compare the networkx graphs. Here is where it get's used, and here is the details about it in the parameterizer

daico007 mentioned this issue Aug 29, 2023

Turn off identify_connected_components default #760

Merged

chrisjonesBSU linked a pull request Sep 5, 2023 that will close this issue

Turn off identify_connected_components default #760

Merged

daico007 closed this as completed in #760 Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance significantly affected by "flatness" of the compounds #757

Performance significantly affected by "flatness" of the compounds #757

chrisjonesBSU commented Aug 23, 2023

chrisjonesBSU commented Aug 23, 2023

daico007 commented Aug 23, 2023

CalCraven commented Aug 23, 2023

chrisiacovella commented Aug 23, 2023

chrisjonesBSU commented Aug 23, 2023 •

edited

Loading

CalCraven commented Aug 23, 2023

Performance significantly affected by "flatness" of the compounds #757

Performance significantly affected by "flatness" of the compounds #757

Comments

chrisjonesBSU commented Aug 23, 2023

chrisjonesBSU commented Aug 23, 2023

daico007 commented Aug 23, 2023

CalCraven commented Aug 23, 2023

chrisiacovella commented Aug 23, 2023

chrisjonesBSU commented Aug 23, 2023 • edited Loading

CalCraven commented Aug 23, 2023

chrisjonesBSU commented Aug 23, 2023 •

edited

Loading