Finding MoRI - Accurately Identifying Molecules of Relevant Interest (MoRI) Using Various Machine Learning Methods

Abstract number
Presentation Form
Poster & Flash Talk
Corresponding Email
[email protected]
Atomic and Molecular Resolution Phenomena via AFM, STM and Scanning Probes
Mr Max Gamill (2), Dr Mauricio Alverez (1), Dr Alice Pyne (2)
1. University of Manchester
2. University of Sheffield

Machine learning, Image analysis, Neural networks, DNA, Topology, Segmentation, AFM, Atomic Force Microscopy,

Abstract text

The inherent heterogeneity of many biomolecular structures, such as DNA, is driven by its inherent flexibility. The flexibility of DNA allows it to self-interact within the cell, often interacting with cellular proteins, intertwining its strands into complex topologies. This inherent property of DNA is essential to its biological function but, alongside its small (nanometre) scale, can present challenges for the visualisation of its structure. Atomic force microscopy (AFM) is unique in its ability to image single molecules in liquid with sub-molecular resolution, without the need for labelling or averaging. This enables us to probe biomolecular structures in native-like states, even observing the double-helical structure of DNA on single molecules. However there remains a lack of generalised automated analysis tools for AFM, which can take raw data as an input, and provide effective structural characterisation. This is due, in part, to AFM’s unique file formats, image artefacts, and unbounded data, which require specialised pre-processing pipelines before image analysis. Furthermore, the range of potential conformations of flexible biomolecules makes high throughput characterisation difficult, and currently much is done by hand. 

We have developed an open-source AFM image analysis pipeline TopoStats ( which enables quantitative single-molecule analyses of biomolecular structures and interactions from raw AFM images. DNA is a particularly challenging molecule to characterise and classify, due to its aforementioned flexibility. This flexibility drives it into an inherent heterogeneous range of structures, which are difficult to differentiate, and the addition of proteins into such systems further increases sample heterogeneity. This drives our question; how can one distinguish between different conformations of DNA or any molecule of relevant interest (MoRI) within AFM images? 

This problem is made more challenging by a lack of automated analysis tools in AFM, and slow integration of machine learning, which limits quantitative structural analysis from this information- and conformation- rich source. Classical image processing uses poorly generalisable thresholding methods based on single-image height and area distributions to distinguish molecules of interest (MoIs) from other objects, such as contamination or surface features within an AFM image. These methods require sample domain knowledge to set thresholds which is not always the case, and image means standard deviations can vary image-to-image making generalisations poor across a dataset. Machine learning techniques can establish and use patterns within the data to help analyse images by identifying MoRI’s for use within AFM image analysis pipelines.

By employing a variety of machine learning techniques at different granularities of prior sample knowledge, we can obtain MoRI segmentations: without, partially with, and sub-categorise even further, with more sample knowledge. We have shown recursive density-based clustering methods (DBSCAN) have been shown to distinguish objects according to their relative heights and areas without prior sample knowledge, and increase image analysis throughput by segmenting each identified cluster in a single pass through the pipeline, without selecting parameters for individual MoRI’s. Random forests can segment and classify MoRI’s into pre-defined categories, allowing faster analysis by filtering for specific individual MoRIs, with multi-class recall of 78.5% compared to 71.2% in classical processing compared to hand segmentations.

We also guide Neural Network instance segmentation labelling via surveys to empirically distinguish DNA minicircle conformations. Trained via transfer learning, we assess the viability of Mask R-CNN to: localise DNA for further single molecule analysis, classify conformations to observe distribution changes in DNA-protein interactions, and segment molecules for geometric statistics in large AFM datasets to help unravel the role of structure in DNA interactions.

Generally, these machine learning methods show the capability to increase throughput of image analysis pipelines through the identification and segregation of different biomolecular species within a sample. These methods help obtain more precise single molecule analyses further down the analysis pipeline by selecting for certain MoRI within our sample at varying insights into prior knowledge of the sample. The highest level, requiring some image labelling, is able to differentiate different topological conformations and thus quantitatively assess changes in molecular interactions to further understand the influence of proteins on sample topology.


Beton, J. G. et al. TopoStats – A program for automated tracing of biomolecules from AFM images. Methods 193 (2021) 68–7

Arlia, D. & Coppola, M. Experiments in Parallel Clustering with DBSCAN. in Euro-Par 2001 Parallel Processing (eds. Sakellariou, R., Gurd, J., Freeman, L. & Keane, J.) 32–331 (Springer, 2001). doi:10.1007/3-540-44681-8_46.

Schroff, F., Criminisi, A. & Zisserman, A. Object Class Segmentation using Random Forests. in Procedings of the British Machine Vision Conference 2008 54.1-54.10 (British Machine Vision Association, 2008). doi:10.5244/C.22.54.

He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN. Preprint at (2018).