November 14, 2025

Synthetic Peptides, Nucleotides, Antibodies and Their Drug Conjugates: For Biologists, Chemists and Computers

Dr. Alex M. Clark

The informatics realms of chemistry, biochemistry and biology have long existed side by side in the drug discovery industry. Cheminformatics focuses on the molecule as its fundamental object, represented as a graph of atoms and bonds. Bioinformatics focuses on the sequence of either peptides or nucleotides. Biology deals with higher order descriptions of complex organic machinery. Each of these computer-aided drug design specializations has its own unique data representations, software tools, and expert communities which have been highly productive for many decades.

As drug hunters investigate ever more creative ways to deliver therapies the divisions between these scientific realms have been dissolving, and so too must happen for the corresponding informatics disciplines. Any natural biopolymer can be functionalized with chemical groups, which need to be recorded. Modified peptide sequences and synthetic DNA/RNA may still follow the chain-like pattern of the conventional analog, but the details of the modified chemical building blocks are essential. Macrocyclic peptides can be represented as regular molecules, but it is important to capture the analogous natural amino acid composition. Antibodies are enormous biomolecules that are often described by little more than partial sequences, but chain connectivity is fundamental to their structure, and they are increasingly being used to attach chemical warheads for exotic therapies.

All of these use cases, and many more, can be described as chemicals, biochemicals, or biological entities - the representation needs are dependent on the scenario. In an ideal world you would be able to describe everything that you make and test in a way that captures all information and makes it readily accessible for all uses.

Spoiler alert: you can.

Introduction

Most conventional molecular entities belong unambiguously to the realm of cheminformatics or bioinformatics, but it is those which need to span both that are the subject of this article. The figure below illustrates both singular and dual examples:

Biotin (a) is clearly a small molecule. When we want to describe it in a form that can be visualized by chemists, archived in databases, and used for calculations that do not require 3D conformations, the best way to describe it is to use a graph that is made up of atoms connected by bonds. Aesthetic information like 2D diagram layout is usually associated with the representation, which is mainly for the benefit of communication between human scientists.

Biotin receptor (b) is a protein, which is shown represented as a sequence of letter codes. Because it is composed entirely of natural amino acids, this very compact low resolution representation is quite useful: while it says nothing about the folding of the protein, or cross-linking, or aqueous modifications to labile functional groups (such as protonation state and interactions with water molecules), it is nonethess suitable for archiving and as a minimalistic input to higher order calculations.

Cyclosporin (c) is a molecule that is small enough to be comfortably represented as a molecule, but to do so would leave out the fact that it is composed primarily of naturally occurring amino acids or similar derivatives. Capturing the natural analog characteristics is valuable both to cognitive presentation and to algorithmic analysis.

Eteplirsen (d) is a synthetic nucleic acid fragment which preserves the naturally occurring DNA bases but substitutes the sugar/phosphate backbone with heavy modifications. While it is small enough that most algorithms would have no difficulty treating it as a chemical, it is inconvenient to visualize in this form. It is also very valuable to record what it shares with the corresponding natural analog, and any other related derivatives.

In the above figure what we see is four different stylized representations. For the traditional examples ((a) and (b)) there are standard ways to store and represent them, but for the hybrid examples ((c) and (d)) the diagram is custom drawn for the publication, and the scientist who is recording the data in electronic form will be presented with an awkward choice: as a molecule or a sequence?

Technical Details

One of the popular approaches to describing synthetic analogs of linear biomolecules is to use a sequence that refers to a larger dictionary of monomers than nature usually provides. This works well enough for many use cases, but it suffers from two main problems:

(1) the dictionary of unnatural monomers has to be managed, and if you just made up a new one, you have to register it in a place that is accessible to everybody you want to share your data with

(2) if the structure is not just a simple linear catenation of predefined building blocks, things can get complicated

A well known technology that was invented to solve the second problem is HELM, which can be used to describe complex structures by stitching together monomer units. Thinking further to a more idealized solution, the following properties would be highly desirable for capturing all kinds of macromolecules:

the layout is a sketch: you draw it the way you want it to look
templates are stored within the structure definition rather than referring to a global resource
monomers are represented as chemical structures with all connection points and leaving groups defined
regular chemical fragments (atoms & bonds) can be freely intermixed with monomers
able to represent complicated and unusual structures, not just common biomolecule patterns
uses a publicly documented industry standard format
scales well by storing just the unique templates for the monomers
can be flawlessly converted into the all-atom version of the structure
the bioinformatics sequence(s) can be extracted
higher order biological information can be encoded (see later: antibodies)

It just so happens that there is a file format that meets all of these criteria, and if you haven't heard this one before, it might surprise you: the V3000 molfile specification has provided all of these characteristics for well over a decade. For anybody who is interested, the functionality is "SCSR" (self-contained sequence representation). The only catch is that the documentation is thin gruel, and only one company ever implemented it - until recently. This discovery was made by our erstwhile colleagues from the Ketcher team at EPAM, and we realized that this would serve as the best choice for representing macromolecules, for all of the reasons listed above. The Ketcher team did all the hard work of figuring out how to implement the features correctly, and extending their chemical diagram sketcher to be able to create and edit macromolecules too.

If you are curious to see what it looks like under the hood, the following example demonstrates the basic concepts:

The structure itself (a) is made up of two peptides (glycine and cysteine) represented as monomer templates. There are three functionalization sites, which are represented by bonding them with ordinary atoms. Column (b) shows the main part of the V3000 molfile, which looks just like a regular one, except that atoms 16 and 17 have some special extra properties to link them to the corresponding monomer template, and make sure the connection points are correctly matched. Columns (c) and (d) show the definitions for the two amino acid templates. These consist of the embedded chemical structure of the fragment, with details about the leaving groups encoded at the end.

Integration into Vault

Bringing all of this macromolecule technology into CDD Vault involves a number of steps. The first step came with the enhancements to Ketcher, which we already use as our method of choice for editing chemical structures. Once Ketcher gained the ability to edit macromolecule structures the key difference is that the Molfiles that are passed back and forth make use of extra functionality. To us, a macromolecule is just a regular molecule that uses some fancy features that most cheminformatics tools don't yet support.

The second step was to improve our own front-end rendering tools and our back-end cheminformatics algorithms.

Rendering a molecular structure that uses monomer templates starts out straightforward, because each of the atoms in the main part of the structure is either a regular atom or it is a placeholder for a monomer, which is drawn in a cartoony style. Because the format being used is a sketch we already have 2D layout coordinates for all of these atoms, which means we know where to place them in the diagram. The fun part is the connections between monomers: these aren't always just lines that connect the two objects, they are more like connectors for flow-chart diagrams.

The cartoon representation is concise relative to the full molecular structure, and packed full of meaning, but it has the obvious limitation that the chemical composition of each monomer is not shown. To a biochemist viewing structures made up of natural monomers this might not be a problem, but for those of us who lean more toward the chemical persuasion, or for anyone who is using a large library of custom monomers, this needs to be addressed.

The first strategy is to use interactive mouseover views:

In the example above the tooltip popover shows the chemical composition of the monomer. Note that the monomer has been oriented so that its connection points match the orientation of the overall structure, and they are also color-coded to match the neighboring connection.

Observing monomer structures one at a time is a good solution for many use cases, but there are occasions where you just want to see the whole chemical structure. This is particularly relevant for oligomers:

You should notice three things about the chemical structure. The first thing that stands out most obviously is that the structure looks good, according to the criteria of chemical diagrams. Secondly that it is correct: all of the atoms and bonds are present and accounted for, including the leaving groups. Thirdly, the alignment of the all-atom layout matches that of the cartoon from which it was derived.

The template definitions for each of the monomers have not just been haphazardly plonked down on the canvas, rather they have been elaborately positioned according to criteria that attempts to produce a chemist-friendly layout (e.g. 120 degree angles, uniform bond distances, etc.) being traded off against keeping the same relative positions as for the original. These constraints are not always possible to resolve perfectly, and the algorithm does make a few strategic approximations because it is invoked in real time, but generally the aesthetics are quite agreeable.

Our ability to convert the template-containing molecules into the correct all-atom representation means that any cheminformatics algorithm is applicable - from molecular weight calculations, to substructure searching, to advanced property prediction, to generating conformations for physics based models. These calculations work the same way whether your structure is drawn out as a regular molecule made up of atoms and bonds, or whether it subsumes part of the structure into templates: you still get calculated properties, substructure/similarity searching, and unique molecule registration.

In Vault we use RDKit for most of these calculations, and because it is open source, we decided to implement the monomer expansion algorithm and contribute it to the public repository. Since Ketcher is also open source, this means that anyone has access to powerful tools for creating and analyzing macromolecules - while CDD Vault is here to provide a lot of additional value.

Creation at Scale

Using Ketcher to draw macromolecules or other monomer-containing structures is effective for workflows that involve low throughput, or structures that are particularly unusual. One of the things that we learned from our extensive interactions with customers interested in using this technology is that laboratories that are producing macromolecules at scale are mostly creating peptides or nucleotides according to one of several major patterns. With synthetic building blocks. For the most part these scientists are able to use existing tools to string together a linear sequence of monomer codes. By providing a number of categories and configuration options, we can take these incoming sequences and build them into fully defined structures which capture the chemistry and biochemistry, and we can do this to any number of structures in one go.

The simplest case for conversion of sequences to template structures is linear peptides: everything is linked together in just the same way as for the incoming text string, and the cartoon version of the molecule looks rather similar to the text itself:

In the above example performing the markup from text sequence to template structure only codifies what we can infer already from the sequence, but that changes as soon as we bring in custom monomers:

The definitions for the custom monomers can be provided in several different ways within the Vault workflow, and we will describe that in later articles. Suffice to say that the fusion of text sequence + chemical definitions for all of the non-natural monomers into a self-contained datastructure that has all of the chemical and biochemical data in a form that is ready to use and easy to view is a major step forward compared to most contemporary workflows.

For linear peptides the order is amine on the left, carbonyl to the right. The implied leaving group is hydrogen for the nitrogen side (-NH₂) and hydroxy for the carbon side (-CO₂H). Arbitrary fragments with two attachment points can be used as linkers, and those with one attachment point can be used at either terminus.

Cyclic peptides are technologically only slightly different from linear peptides insofar as the two ends of the peptide are joined up and a water molecule is eliminated from the overall molecular formula. The layout uses a circular positioning with clockwise position and ordering being the default setting. Cyclic peptides are a hot area of drug discovery, and being able to represent them using a template-based sketch format with non-natural monomers is a valuable capability.

Consider several renditions of Tyrocidine A:

For the published structure shown in (a) it is quite difficult to cognitively process the diagram. Even for a chemist who has memorized the natural amino acid structures, the brain processing cycles needed to perceive the peptide boundaries and rotate/flip the structures onto their monomer units is uncomfortably high. The cartoon representation shown in (b) is a major simplification, and it can be imported into Vault by providing the string "Val,Orn,Leu,D-Phe,Pro,Phe,Phe,Asn,Gln,Tyr" and ensuring the pre-registration of two non-natural amino acids ("Orn" = Ornithine and "D-Phe" = inverted stereoisomer of Phenylalanine). As a data entry technique this is not only far less labor intensive but perhaps more importantly less error prone than drawing out the chemical structure.

One common compromise for building amino acid structures is to concatenate together SMILES strings corresponding to each of the building block units which can make content creation easier, but it does not solve the rendering and comprehension problems: (c) above shows a layout and depiction of a SMILES string for the same cyclic peptide. The atom placement does not guarantee any particular order, orientation or consistency. By contrast, using Vault's algorithm for generating a full-atom version of the cartoon, shown in (d), preserves the orientation and makes a noble effort to present the chemistry according to idealized aesthetics.

Nucleotides add an extra layer of complexity. Creating peptides from an input sequence is relatively straightforward because each sequence code generates a single monomer in the marked up diagram, but for nucleotides each code generates 3 or 6 monomers, depending on whether it is single- or double-stranded:

The individual nucleotide units are broken up into base, sugar and phosphate units, and then reassembled as shown above. One might reasonably ask: why? We could have chosen to represent each nucleotide as a single monomer block, and gone with a 1:1 correlation for single strands, or 1:2 for double strands. That would have been a lot easier, but the reality is that commercially available custom nucleotide fragments can be quite strange:

Sometimes the deliverable unit has all three components (base, sugar and phosphate). Sometimes the phosphate is absent, and so a natural one has to be connected up when building the RNA or DNA strand. And sometimes the incoming fragment is just the base. If there is a phosphate attached, there is no guarantee that it is attached at the 5' position (i.e. it could be on the right hand side rather than the left). The method for assembling monomers into RNA/DNA-analog structures has to know which fragments map to the sugar/phosphate/base roles, and the directionality of the fragment, so that building blocks can be spliced in or cleaved off to get the right outcome.

Antibodies

For the macromolecule use cases described in the previous section the sweet spot is generally in the oligomer category: biomolecule analogs that are just a bit awkward to represent using atoms and bonds, and benefit from storing and displaying the monomer sequences. Things get much more interesting when we look at antibodies. Most of the common antibody types are made up of two light and two heavy chains of amino acids, which may or may not be symmetrical.

When deciding how to represent an antibody for purposes of informatics and representation there is a fork in the road. Assuming that the full sequence information is available for the 4 chains:

Do I know the locations of the disulfide bonds that hold the chains together?
- no: register the antibody using just the sequences
- yes: build a macromolecule structure that uses all this and more

This decision point is critical and relevant, because without knowing the location of the cysteine bridges that hold the antibody together, you simply cannot represent the structure. You can still register the entity using the sequence information to disambiguate. This is an extremely useful workflow capability, but the interesting science & technology part kicks in when that extra information becomes available. As it stands in 2025, most biologists need a little bit of encouragement in order to be explicit about indicating the chain locations of antibody features.

In the absence of a common notation standard we have come up with a very straightforward one of our own, which is a syntax that is tacked onto the end of each of the incoming antibody chains. These features include:

inter-chain cysteine connection sites
intra-chain cysteine-cysteine links
named domain regions (e.g. H for hinge, VL/CL/VH/CH_n)
drug conjugate connection sites

The syntax is not very complicated, but it does require that you indicate sequence positions using their index in the chain (1 being the first), and not according to any of the numerous available classification schemes.

When our import composer is given this information it will build a full macromolecule by converting each of the sequences into a chain of monomer units, apply the additional connections, and then arrange the chains into a "Y" shape:

The figure above shows sequences and metadata (a) being converted into a monomer-based structure (b), zoomed in on the hinge region. So far this is following a similar pathway to the linear/cyclic peptides and nucleotides described in the previous section, except with a different layout and bonding pattern for the monomers. And that would be most of the story except that antibodies are big. Their size means that even the cartoon view that is so effective for large oligomers and mini proteins is not viable: it would take a wall-sized screen to be able to view all the monomers at a reasonable resolution.

This is the point where we switch to biology.

Anybody who has read a textbook or done a quick internet search on the subject will have seen numerous figures where a biologist has drawn a very stylized "Y" shaped glyph, and drawn attention to whichever parts of it are needed to make the scientific point:

There are even software tools that will create such a glyph on demand, but what all of these diagrams have in common is that they are bespoke drawings or clip art. What they do not do is explicitly capture the underlying biological sequence connectivity and chemical composition.

Ours does.

Up until now we have been viewing macromolecules that have two levels: the implied underlying chemistry level and the explicit top biochemical level, both of which work together harmoniously. For viewing antibodies we need to add a biology level on top of that.

During the data creation process we record the additional metadata about connectivity and domain regions, and we use this to create a biologist-friendly glyph that is based on the actual data.

The first step in this construction process (a) is to pull out the hinge regions on the heavy chains and line these up, keeping the scale proportional. If there are any further domains indicated we divide up the segments and annotate them (b). Inter-chain linkages are drawn (c), as are any indicated disulfide bonds within individual chains (d). Finally the identity for each amino acid is indicated using a color "bar code" style (e).

Even though the sequence identity is condensed too densely to easily pick out individual monomers, the important point is that the relative shape, size and composition of the diagram is based on the actual composition. If two antibodies have a significantly different variable region, or one of them is asymmetric, the diagrams will reflect this. The ability to create diagrams that show the differences is possible because the information necessary to do it is actually present within the datastructure, so the onion can be peeled all the way down to atoms and bonds, at which point everything is present and accounted for.

In the tradition of saving the best for last, one of the main reasons why antibody informatics is so important right now is the design of antibody-drug-conjucates (ADCs) for which some number of bioactive functionalities are attached to positions on the antibody. These drug conjugates (or "warheads" as they are sometimes called) are activated when the antibody itself achieves a recognition event. The biological engineering that goes into inventing these therapeutic agents is something we leave to the experts, but what we can do at CDD is provide ways to register each distinctive ensemble so that they can all be archived, viewed, tracked, and associated with experimental measurements.

Registration of an ADC involves one and a bit extra steps: in addition to providing the sequences for the 4 chains that make up the antibody, the chemical structure of each drug conjugate must be indicated. At a given stage of development the drug conjugate positions may not be known: if not, we will store the information in disconnected form, and optionally capture an estimate of the ratio (which does not have to be a whole number). But if you do know where the drugs are connected, we can incorporate that into the diagram:

Just like with the other parts of the antibody, these connected molecular fragments are expressed all the way down to atomic connectivity. It is possible to perform any kind of analysis on the molecule (e.g. calculate molecular weight), analyze the bioinformatics sequences, infer properties from the composition and location of the drug conjugates, or even create a 3D embedding of the entire structure for a molecular dynamics simulation. The most important point is that all of these use cases are possible because we captured all of the information and stored it in a format that doesn't throw anything away.

Summary

The macromolecule functionality that we have been building addresses a number of drug discovery workflows, in a way that allows you to have your cake an eat it too: customizable sketches with aesthetically pleasing graphics; complete information capture for chemists, biochemists and biologists; registration of molecular objects; using industry standard data formats; and with major contributions to open source projects.

We have a lot more coming up on the roadmap. Expect to hear more about creating and managing monomer libraries in the near future. There are a lot of other related projects on our wishlist and we are prioritizing them based on customer demand - so don't be shy, it always helps to weigh in and let us know what you need.

Try CDD Vault with Complex Biomolecules

Model peptides, oligos, ADCs, and more in an environment built for collaborative research.

See how CDD Vault can help your team move faster, book a demo.

Tag(s): CDD Insights , CDD Blog