← Blog

The Case of the Missing Open Source Chemical Descriptor Calculator

Why Cheminformatics Sucks in '26

A polemic, with apologies to DJ Shadow.

tl;dr: By analyzing the gaps in current open-source descriptor calculators and prioritizing the implementation of high-leverage descriptors, we can work towards a more open and accessible cheminformatics ecosystem. New code at DescJocky proposes the shape of development moving forward.

The QSAR Process

Quantitative Structure-Activity Relationship (QSAR) modeling is a computational technique used in cheminformatics to predict the biological activity of chemical compounds based on their molecular structure. A related technique is Quantitative Structure-Property Relationship (QSPR) modeling, which focuses on predicting physical properties of compounds. Both QSAR and QSPR are essential tools in drug discovery, materials science, and environmental chemistry.

The over-arching idea that chemical data, such as molecular structure, can be used to predict the activity or properties of compounds can generally be traced back to the work of Corwin Hansch and Albert Maloney in the 1960s. They developed the Hansch equation, which relates the biological activity of a compound to its physicochemical properties, such as lipophilicity and electronic effects. This pioneering work, published in their 1962 paper Correlation of Biological Activity of Phenoxyacetic Acids with Hammett Substituent Constants and Partition Coefficients, laid the foundation for the field of QSAR modeling.

The QSAR process typically involves the following steps:

  1. Data Collection and Preprocessing: Gather a dataset of chemical compounds of interest, and their known biological or physical properties of interest from literature or experiment. Let's call the endpoint we want to focus on y.
  2. Descriptor Calculation: Use the 3-D structure of the compounds to calculate a set of molecular descriptors, which are numeric values that describe the chemical structures as a vector of features. This matrix of descriptors can be denoted X.
  3. Feature Selection and Model Building: Use statistical or machine learning techniques to select the most relevant descriptors and build a predictive model. The model is some function f that maps X to y: f(X) = y.
  4. Model Validation: Evaluate the performance of the model using techniques such as cross-validation, leave-one-out, and external validation on a separate test set; in addition, assess the model's applicability domain (with a Williams Plot) and robustness to random noise (with y-Scrambling) to ensure that the model does what you think it does.
  5. Model Interpretation and Application: Interpret the statistically-robust model to understand the relationship between chemical structure and activity or properties, and use this to guide the design of new compounds with desired characteristics.

The idea that chemical structures, databased and prepared in a standardized way, can be used in either a statistical (as with QSAR/QSPR) or simulated (as in quantum chemistry, molecular docking, and molecular dynamics) manner can provide insightful and possibly actionable information about the physical world is the basis of cheminformatics.

Cheminformatics sucks in 2026, and mostly because it's still a proprietary, and thus paywalled, field.

How Bioinformatics Went So Right: a short history of institutional choices

Let's contrast cheminformatics, with bioinformatics, the field that applies computational techniques (either statistical or simulated) to biological data, such as DNA, RNA, and protein sequences.

Bioinformatics has flourished in the last few decades. With the publishing of the Human Genome Project in 2003, bioinformatics is a field which leverages high throughput data generation, where data and metadata searches to open access journals and databases are regular procedure and cost nothing, where every solved crystal structure is deposited in the Protein Data Bank under CCO, and where the most widely used tools for sequence alignment, structure prediction, and molecular dynamics are all open source and have a 6 month rolling release schedule. If you want to do bioinformatics, you can do it. Right now. For free. You don't need to ask anyone for permission, and you don't need to ask anyone for money. You have a computer. You have the internet. You have the data. You have the tools. If your particular perspective on the subject is conducive to good scientific practice, you can do good science.

It did not have to be this way. There were a number of historical choices that led to this state of affairs:

  • In 1965, Margret Dayhoff published the first protein sequence database, a physical tome which contained 65 sequences. This Atlas of Protein Sequence and Structure was distributed at cost. The database was later digitized and sold by the Neckelmann Bioinformatics Resource Foundation as the Protein Information Resource (PIR), which was later acquired by the University of Delaware and made freely available only later.
  • In parallel, the X-Ray crystallography community was developing its own data-sharing norms. In 1971, the Protein Data Bank (PDB) was established at the Cold Harbor symposium on protein structure. This database was initially managed by the Brookhaven National Laboratory, and was initially tiny, but served a small community of structural biologists who knew one another personally.
  • By the late 1970s, Sanger sequencing was producing more DNA sequences than could be published, and the needs for a computerized sequence database became apparent. Both Dayhoff and Walter Goad independently developed sequence databases, and with them, proposals for the curation and standardization of the sequence data. Dayhoff proposed a system of self-sustaining databases driven by user fees, whereas Goad's proposal was for a publicly funded, open access database distributed over the then-new ARPANET network, leveraging the work already done by Goad's Los Alamos National Laboratory. Goad's proposal won out, and the GenBank database was established in 1982, with funding from the National Institutes of Health (NIH) and the Department of Energy (DOE).
  • By 1988, GenBank, the EMBL Nucleotide Sequence Database, and the DNA Data Bank of Japan (DDBJ) established the International Nucleotide Sequence Database Collaboration (INSDC), which ensured that the three databases would share data and maintain a common format. . They agreed to synchronize their holdings daily, creating a single global pool of sequence data with three mirrors.

This federated, open access model of data sharing and tool development has been the norm in bioinformatics ever since. With the inclusion of the Bermuda Principles in 1996, these early choices were codified into a set of principles that have guided the field ever since.

These choices were not inevitable, and they were not universally accepted at the time. The Bermuda Principles were almost immediately tested as Celera Genomics, a private company, attempted to sequence the human genome independently of the HGP, announced in 1998 and led to a debate of the patentability of the human genome. The public consortium's commitment to open access and data sharing ultimately prevailed, and the human genome was published in 2003 as a freely available resource, but this was only possible in the above-detailed historical context, in which the combination of early guild norms and institutional choices had already established a culture of open access and data sharing in the field. It was the result of specific people and decisions. The bioinformatics open-data ecosystem was not inevitable. It was constructed by specific people making specific choices:

  • Elke Jordan and Christine Carrico at NIH chose to fund GenBank as a free resource rather than a cost-recovery service (1982) based on the stronger proposal from Los Alamos.
  • David Lipman built NCBI as a provider of free tools, not just a data warehouse (1988).
  • Altschul, Gish, Miller, Myers, and Lipman made BLAST free and fast enough to be everyone's first tool (1990).
  • Amos Bairoch built Swiss-Prot from his PhD project into the world's protein knowledgebase, keeping it free for academics even when funding was precarious (1986 onwards).
  • John Sulston and Robert Waterston championed daily data release from C. elegans to the human genome (1990s).
  • The Wellcome Trust and NIH used their funding leverage to enforce the Bermuda Principles (1996–1998).
  • The crystallography community established mandatory structure deposition in the PDB, and journals followed (1989 onwards).
  • Robert Gentleman created Bioconductor, embedding open-source tooling into the training pipeline (2001).

Each of the above choices had alternatives, which could have led to a very different ecosystem, and which would have had a profound impact on the development of the field. The genome could have been patented gene by gene. Swiss-Prot nearly died for lack of funding in the mid-1990s. The Bermuda Principles were contentious and their enforcement required hardball negotiation.

The Sorry State of Cheminformatics

Cheminformatics never had a Bermuda moment.

An easy way to see this is to look at the tool OpenBabel. OpenBabel converts between chemical file formats. How many file formats does it support? 110.

There was no single, galvanizing, publicly funded project equivalent to the Human Genome Project that forced the question of data openness. Chemical data remained distributed across commercial databases (CAS, Reaxys, SciFinder) and proprietary software (Dragon, MOE, Schrödinger). The closest analogues, PubChem (launched 2004) and RDKit (open-sourced 2006), arrived a decade later than their biological counterparts and without the policy infrastructure to mandate participation. The field remains in a sorry state.

FOSS exists in cheminformatics, but it is often niche projects, and the field is still dominated by proprietary software and databases. As a result, researchers pay for multi-thousand-dollar licenses to access proprietary chemical software, often from companies which have known policies to blacklist or sue you if you try to understand how the software works, and these researchers are misled into believing that what they publish is science, when it is not reproducible without joining an expensive club.

Indeed, I would go so far as to argue that without insistence on open methodology, computational chemistry cannot be considered a science. The scientific method relies on reproducibility and falsifiability, which are impossible to achieve when the tools and data are locked behind paywalls or prohibitive licenses.

Chemical Descriptor Software

The Landscape

The calculation of chemical descriptors is a fundamental step in the QSAR/QSPR process, as these descriptors serve as the features that machine learning models use to make predictions.

Dragon is a widely used commercial software for calculating a vast array of molecular descriptors. It offers over 5,000 descriptors, including constitutional, topological, geometrical, and electronic descriptors. A similar software, AlvaDesc, published by the AlvaScience group started by many of the same Milan-based researchers who developed Dragon, also offers a large number of descriptors. Both of these software packages are proprietary.

Curiously, unlike other proprietary tools such as Gaussian or Schrödinger, Dragon and AlvaDesc provide detailed documentation of their descriptor calculation methods. There is no technical barrier to implementing these descriptors in an open-source software package.

Indeed, there are some open-source software packages that calculate chemical descriptors, such as RDKit, Mordred, and PaDeL-Descriptor. However, these packages do not implement the full range of descriptors available in Dragon or AlvaDesc. Mordred calculates 1826 descriptors (1613 2-D, 213 3-D), RDKit calculates around 200 descriptors, and PaDeL-Descriptor calculates around 1400 descriptors. There is a great deal of overlap between Mordred and PaDeL.

Gap Analysis

The total numeric gap between Dragon and AlvaDesc and Mordred is roughly 4000 descriptors. This may seem intimidating. However, this gap is not uniformly distributed.

In fact, much of this difference in count is in a combinatorial expansion of descriptors, where the same basic mathematical operation (spectral moment, eigenvector coefficient, Wiener-like index, etc.) is applied to different molecular matrices. A single, well-designed matrix framework could implement a large number of descriptors.

In the following analysis, I have taken a look at the descriptors provided by AlvaDesc, and have identified the descriptor types and categories which are not currently implemented in open-source software, based on the descriptor family 'blocks' provided by AlvaDesc themselves. I have also prioritized and tiered these descriptor families based on a balance of implementation ease and potential impact on model performance.

It is worth noting that RDKit and Mordred already implement many commonly used descriptors, including constitutional descriptors and 3-D descriptor families such as RDF, 3D-MoRSE, and WHIM. Therefore, the focus of this gap analysis is on descriptor families that are not currently implemented in open-source software, and which have shown to be useful in QSAR modeling.

Tier 1: High Leverage Gaps

These descriptor families are those which have proven QSAR utility, and which can be implemented, which no open-source software covers well.

Pharmacophore Descriptors
  • ~200 descriptors in AlvaDesc
  • What: Counts and distance statistics over pharmacophoric atom types: hydrogen-bond donor (D), acceptor (A), positive ionizable (P), negative ionizable (N), lipophilic (L), aromatic (AR). AlvaDesc defines pharmacophoric pairs and triplets at various topological and 3-D distance bins.
  • Coverage: Basically none. Neither Mordred or RDKit have these descriptors as SCALAR features, instead computing bit vector fingerprints with rdMolChemicalFeatures, which are not directly comparable to the counts and distance statistics provided by AlvaDesc.
  • Why: Pharmacophore descriptors consistently appear in QSAR variable-importance rankings for biological activity endpoints.
  • Implementation Effort: Medium. We need:

1. A pharmacophore perception layer similar to RDKit MolChemicalFeatureFactory 2. Pairwise and triplet distance computation over 3-D coords 3. Binning into distance intervals which are chemically meaningful.

CATS 3D Descriptors
  • ~300 descriptors in AlvaDesc
  • What: CATS (Chemically Advanced Template Search) descriptors are another pharmacophore-pair scheme, but using a specific set of 6 pharmacophore types (D, A, P, N, L) and computing pair distributions over either topological distance (CATS 2D) or Euclidean distance (CATS 3D) in fixed bins. Produces a fixed-length vector regardless of molecule size.
  • Coverage: None. No open-source package computes CATS descriptors as scalar features.
  • Why: CATS descriptors have been shown to be useful in virtual screening and QSAR modeling, particularly for capturing pharmacophore information in a way that is size-independent, specifically for scaffold-hopping in virtual screening, they capture pharmacophoric similarity even between structurally unrelated molecules.
  • Implementation effort. Low-medium. Once you have the pharmacophore perception layer from item 1 above, CATS is essentially a distance histogram with 6×5/2 = 15 pair types × N distance bins. A single function.
2D Atom Pairs
  • ~1600 descriptors in AlvaDesc
  • What: For every pair of atom types (defined by element, hybridization, and number of heavy-atom neighbors), count the number of pairs at each topological (that is, graph) distance. They are a 2-D analogue of the pharmacophore pair descriptors above, but with a much more granular atom-typing scheme.
  • Coverage: None. RDKit has rdMolDescriptors.GetAtomPairFingerprint() which computes atom-pair fingerprints, but as hashed bit vectors, not the scalar pair-counts that Dragon/AlvaDesc export as individual descriptors. Mordred does not implement atom-pair descriptors.
  • Why: Atom-pair descriptors provide a middle ground between fragment search (brittle and sparse) and whole-molecule topological indices (very global). They are 30% of Dragon's descriptors, and a major contributor to QSAR benchmark performance.
  • Implementation effort: Medium. We need to define a comprehensive atom-typing scheme, and then compute pairwise topological distances and counts for each pair type. Challenge exists in defining a vocabulary so the descriptor vector has consistent output.
Edge Adjacency Indices
  • ~300 descriptors in AlvaDesc
  • What: Consider the molecular graph, where atoms are vertices and bonds are edges. The edge adjacency matrix E has entry E[i,j] = 1 if bonds i and j share an atom.
  • Coverage: None. RDKit and Mordred do not compute edge adjacency indices. This is a complete blind spot in the current open-source descriptor ecosystem.
  • Why: Edge adjacency indices encode bond-level topological information that vertex-based indices miss. They're particularly useful for distinguishing isomers and for properties sensitive to conjugation patterns.
  • Implementation effort: Medium-Low. If you already have a matrix spectral descriptor framework (which you should build anyway — see Tier 2), the edge adjacency matrix is just a different matrix fed into the same machinery.
Extended Topochemical Atom (ETA) Indices
  • 39 descriptors in AlvaDesc
  • What: A set of indices based on the concept of "valence connectivity" but with corrections for heteroatom electronegativity effects. Developed by the Roy group. Include ETA_alpha (related to molecular surface area), ETA_beta (related to electron-richness), and various derived ratios.
  • Coverage: PaDeL-Descriptor has ETA indices, but only a subset (ETAalpha, ETAbeta, and a few ratios). Mordred and RDKit do not implement ETA indices at all.
  • Why: ETA indices are compact, fast to compute, and have show strong predictive power for environmental, toxicity, and biological endpoints.
  • Implementation effort: Low. Purely graph-based computation with atom property lookups, well documented by Roy et al.

Tier 2: Matrix Framework Expansion

Dragon/AlvaDesc's single largest block is the 2-D matrix-based descriptors (~607 in AlvaDesc). Mordred already implements many of these, but the proprietary tools compute a richer combinatorial expansion.

Consider a general matrix-based descriptor engine with the following general form. For a given molecular matrix M, compute:

  • Leading eigenvalue: λ₁(M)
  • Spectral diameter: λ₁(M) - λₙ(M), where n is the rank of the matrix
  • Graph energy / spectral absolute deviation: ∑|λᵢ(M)|
  • Estrada index (EE): ∑e^(λᵢ(M))
  • Spectral moments: ∑λᵢ(M)ᵏ for k = 2, 3, 4, ...
  • Hosoya-like indices: ∑|λᵢ(M)|ᵏ for k = 2, 3, 4, ...
  • Eigenvector-based indices (VE1, VE2, etc.): functions of the leading eigenvector of M.
  • Wiener-like, Harary-like, Randic-like, and Balaban-like indices: various sums and products over the eigenvalues, often with specific weighting schemes.

Then, apply the above set of spectral and eigenvector-based calculations:

  • Adjacency matrix (A) - already present in Mordred
  • Distance matrix (D) - already present in Mordred
  • Detour matrix (Dt) - already present in Mordred
  • Distance/Detour quotient (D/Dt) - not present in open source
  • Laplace matrix (L) - partially present in Mordred (Laplacian eigenvalues only)
  • Chi matrix (X) - not present in open source
  • Reciprocal squared distance (H^2) - not present in open source
  • Barysz matrix weighted by various atomic properties - Mordred has a few of these, but not all 6 categories seen in Dragon: Atomic mass, Atomic Van der Waals volume, Electronegativity, Polarizability, Ionization potential, Atomic Number, and Burden Matrix weighted by the same atomic properties.

Mordred has a few of these, but not all 6 categories seen in Dragon.

A well-designed matrix-based descriptor framework could implement all of the above calculations in a modular way, allowing for easy expansion to new matrices and new spectral/eigenvector calculations. By filling in the gaps of the above matrix-based descriptor calculations, 200-300 additional descriptors can be implemented, many of which have shown to be useful in QSAR modeling.

Tier 3: Categorical and Count Descriptors

Various other block families provided by AlvaDesc which are not currently implemented in open-source software:

Atom-Centered Fragments
  • ~100 descriptors in AlvaDesc
  • What: Counts of specific atom-centered fragments, based on Ghose-Crippen-Sonntag (GCS) atom types. For example, the number of tertiary carbons, the number of sp2-hybridized nitrogens, etc.
  • Coverage: RDKit has Ghose-Crippen atom type assignments internally (used for LogP/MR). Mordred does not expose fragment counts as individual descriptors.
  • Why: Atom-centered fragment counts are simple, interpretable descriptors that often correlate with specific chemical properties and activities. They can capture local structural features that global indices miss.
  • Implementation effort: Low. This is essentially a subgraph counting problem with a well-defined set of atom types and fragment patterns, and exposing already existing atom type assignments as scalar descriptors.
Functional Group Counts
  • ~150 descriptors in AlvaDesc
  • What: Counts common functional groups: -OH, -NH2, -COOH, etc. AlvaDesc defines a comprehensive set of functional groups based on SMARTS patterns.
  • Coverage: RDKit has rdMolDescriptors.CalcNumAmideBonds() and a few others, plus Fragments.fr_* descriptors, whereas Mordred has FragmentComplexity; neither of these are systematic functional group counts.
  • Why: Functional group counts are fundamental descriptors that often have direct mechanistic relevance to chemical properties and biological activity. They are especially easy to interpret.
  • Implementation effort: Low. This is a straightforward application of SMARTS pattern matching to count occurrences of predefined functional groups.
Drug-Like Indices
  • ~30 descriptors in AlvaDesc
  • What: A set of indices designed to capture "drug-likeness", such as Lipinski's Rule of Five violations, Veber's rotatable bond count, Ghose filter, lead-likeness, PAINS alerts, Brenk alerts, synthetic accessibility score, etc.
  • Coverage: RDKit and Mordred both implement Lipinski descriptors, RDKit also computes qed and SyntheticAccessibilityScoreFilter. However, many of the other drug-likeness indices are not implemented in open-source software.
  • Why: Drug-likeness indices often correlate with pharmacokinetic properties and specific biological activities in ligand-based modeling.
  • Implementation effort: Low. Mostly covered. The remaining indices can be implemented as specific functions based on well-defined rules and thresholds.
Randic Molecular Profiles
  • ~50 descriptors in AlvaDesc
  • What: A set of graph-theoretical descriptors based on the Randic connectivity index, which captures branching and cyclicity in the molecular graph; essentially, the sorted eigenvalues of various weighted molecular matrices, summarized as statistics (mean, std, skew, kurtosis, etc.)
  • Coverage: None. RDKit and Mordred do not compute Randic molecular profiles as scalar descriptors.
  • Why: Randic indices have been shown to correlate with various chemical properties and biological activities, particularly those related to molecular branching and cyclicity.
  • Implementation effort: Medium. Requires computation of the Randic connectivity index for various weighted molecular matrices, followed by statistical summarization of the resulting eigenvalues. Would benefit from the framework proposed for Tier 2 matrix-based descriptors.
Chirality Descriptors
  • AlvaDesc 3, 69 descriptors
  • What: Counts and indices encoding stereochemical information: number of chiral centres, presence of axial chirality, E/Z isomerism counts.
  • Coverage: RDKit can detect a chiral center with Chem.FindMolChiralCenters() and E/Z geometry, but does not package this as a scalar descriptor.
  • Why: stereochemical features are critical for many biological and pharmacokinetic properties, in addition to particular quantum chemical properties.
  • Implementation effort: Low. This is essentially a matter of exposing existing stereochemistry detection functionality as scalar descriptors.
Molecular Distance Edge Descriptors
  • 18 descriptors in AlvaDesc
  • What: encode topological distances similar to Edge Adjacency Indices, but with a focus on specific edge rather than vertex types (e.g., bond orders) and specific distance bins.
  • Coverage: None.
  • Why: These niche descriptors have shown utility in specific QSAR contexts, particularly for properties sensitive to bond-level topology, e.g. reactivity.
  • Implementation effort: Medium. Similar to Edge Adjacency Indices, but with a different focus on edge types and distance bins.
Weighted Holistic Atom Localization and Entity Shape (WHALES) Descriptors
  • 32 descriptors in AlvaDesc
  • What: a recent descriptor family from the Schneider group which compresses 3-D pharmacophoric information into a fixed-length vector using a partial-charge weighted distance matrix and spectral decomposition.
  • Coverage: None.
  • Why: WHALES descriptors have shown strong performance in deep learning applications, particularly in scaffold-hopping virtual screening, where they outperform traditional 3-D pharmacophore descriptors.
  • Implementation effort: Medium. Requires computation of a partial-charge weighted distance matrix, followed by spectral decomposition and summarization into a fixed-length vector.

Conclusion

The proprietary tools' real moat is not mathematical sophistication. Rather, the descriptor equations are all published, and the moat is validated, consistent, fast implementation across thousands of descriptors with thorough testing. This reduces the scope of the problem to that of engineering a well-designed, modular, and efficient descriptor calculation framework.

Cheminformatics is in a sorry state compared to bioinformatics. Bioinformatics had BLAST, Biopython, and Bioconductor as community-driven foundations. Cheminformatics has RDKit, ASE, OpenBabel, xtb, and GPAW, for now. Mordred does a good job at a basic set of descriptors, but there is a long tail of useful descriptors that are not implemented in any open-source software.

New Tooling

The cheminformatics community should prioritize the development of an open-source chemical descriptor calculator that implements the high-leverage gaps identified above, as well as a comprehensive matrix-based descriptor framework.

I have begun work on this, starting with DescJocky, which provides a modular plugin framework for implementing new descriptors, implements process concurrency with a process executor pool, and provides a command-line interface for descriptor calculation with optional semi-empirical geometric optimization using xtb. It currently implements both RDKit and Mordred descriptors, but provides a standard interface for implementing any new descriptor.