By the CHARM Tx technology team
Introduction
Ligand-protein cofolding models have the potential to transform structure-based drug discovery, yet so far, they’ve seen limited uptake in real-world discovery campaigns. Despite a growing body of academic work and a surge in both open- and closed-source efforts, few models have demonstrated the accuracy and robustness needed to truly impact decision-making in hit discovery and optimization. Accuracy, generalisability, and a limited domain of applicability are common hurdles to enabling this technology for drug discovery campaigns.
At CHARM Therapeutics, we’ve been advancing the frontier of cofolding with successive iterations of our in-house platform and model, DragonFold, in tandem with active drug discovery campaigns. Here we introduce our latest model and show how it delivers state-of-the-art performance in binding pose prediction.
The Landscape: Where Current Models Fall Short
Since the inception of Alphafold-2 there has been an abundance of scientific exploration into generative AI models for protein design. Since then, ligand-protein cofolding models have taken the stage with a promise to drive early-stage drug discovery by augmenting or in some cases replacing time-consuming structural biology efforts. Some of the key players in this space include Boltz [1], Chai [2], Protenix [3] and Alphafold-3 [4]. Although these models work well in some cases, they do not extrapolate well to unseen data.
Benchmarks: How DragonFold Stacks Up
To assess how DragonFold stacks up with the best available models, we employ Runs-n-Poses, a high-quality benchmarking set on ligand-pose predictions [5].

As noted in the literature, generative AI models predict more accurately on structures that they have observed before – and less so on structures that they’ve barely seen [6]. This is one of the key strengths of the Runs-n-Poses benchmark: it allows us to stratify our predictions by training set similarity. We see that DragonFold outperforms Boltz-1 and matches AlphaFold-3 performance – even outperforming it in the 60-80% similarity region, which is a critical area to operate in if you are exploring therapeutic targets that already have some presence in literature:

For the above benchmark we are comparing with Boltz-1, not Boltz-2. We currently do not have a head-to-head comparison with Boltz-2 on a reliable, established benchmark like Runs-n-Poses. Boltz-2 has been trained on much more recent data (up to 2023) whereas all models in the above benchmark were trained on exactly the same PDB data up to 2021 – and validated on the data after that time.
Why This Matters: Implications for Drug Discovery
We show here that DragonFold matches the performance of AlphaFold-3 for pose prediction. Even for these state-of-the-art models extrapolation to unseen data is challenging – at CHARM we have developed a number of solutions to get around this issue.
- Finetuning of production-level DragonFold models on specific data from discovery programs – triggerable by anyone (and toward any subselection of data) in the science team.
- Fast ligand-based templating at inference time: for biasing pose predictions to particular binding modes (e.g. fragments), where scientists can bias DragonFold with a relevant, reference protein-ligand pose.
- A flexible and operable web UI: all scientists in CHARM are in control of DragonFold and use it for many different types of exploration.
Affinity predictions for generative AI
Affinity predictions by generative AI suffer from the same extrapolation issue, limiting their utility for real-world drug discovery campaigns unless local finetuning is used. At CHARM, we have trained affinity prediction models like the Boltz-2 affinity model, but have found them to be inadequate for program usage due to sensitivity to training data. To enable full utility of structure and affinity predictions in the absence of data, CHARM has built a combined DragonFold-physics method that directly uses co-folded structures for FEP with demonstrated high accuracy and utility for internal programs without any need for prior potency or affinity data [7].
For the purposes of this blogpost we have applied Boltz-2 affinity predictions to an internal target (“Target A”) that we described in our recent DragonFold-FEP manuscript. We show that Boltz-2 affinity predictions do not correlate with experimental potencies in this case where none of the chemical matter was in Boltz-2’s training set, as opposed to DragonFold-FEP (“DLF”) which shows excellent correlation:
What’s Next
At CHARM Therapeutics, we have learned with the years that structure prediction is a solid foundation but to really help drive drug discovery programs one needs to build accurate affinity prediction models on top for hit-finding or affinity prediction. We have also learned about the generalization problem – generative AI models degrade rapidly outside of what was seen in training. Finetuning and positional templating is pivotal in driving these models to predict accurately for internal programs – even crystallographic fragment screens are useful toward this. There remains a list of hurdles to overcome for cofolding models such as water positioning, chirality and unphysical posing. Blending generative AI with physics-based methods is likely key to success in this space.



