It may be possible to reduce load imbalance across a warp by sorting cells based on their cost (rhs evaluations or substeps), and then looping through them based on this sorting. Then, cells with similar costs to burn are computed together across the same warp, reducing SIMD divergence.
This is mostly intended for chemistry networks, where the cost is mostly a function of density (which sets how close to equilibrium the network is).