Cryo electron microscopy (cryo-EM) involves imaging biological samples flash frozen at cryogenic temperatures using a transmission electron microscope (TEM). Cryogenic freezing in a frozen-hydrated state prevents the biological sample from structurally deforming during sample preparation [1]. Unlike traditional TEM or X-ray crystallography, which also offer molecular to atomic level resolution, cryo-EM thus enables the imaging of macromolecular complexes, assemblies, cells and even tissues in a near native state [2].
Cryo-electron tomography (cryo-ET) collects data by exposing the sample to an electron beam over multiple tilting angles (Fig. 1 a), enabling 3-D reconstruction of individual objects from the resulting 2-D projections. This reconstruction, which involves inversion of the 3-D Radon transform, does not require a priori assumptions about the objects structure [2]. However, there are a number of factors which limit the resolution of cryo-ET reconstructions. First, due to limitations in the degree of tilt of the mounting stage, the incomplete range of view angles causes a “missing wedge” in the Fourier (projection) domain data [2]. This in turn causes resolution of the 3-D reconstruction perpendicular to the sample surface to be worse than in the plane of the sample surface (Fig. 1b). Secondly, the total amount of radiation damage is proportional to the number of view angles times the radiation dose of the incident beam at a single view angle [2]. This multiplicative effect imposes strict constraints on intensity of the incident beam [3]. The consequently low and uneven resolution means that it can be difficult to identify the structure of individual objects from their cryo-ET reconstructions.

Cryo-EM tomography: a Schematic of single axis cryo-EM tomography. The (small) sample placed on a stage which is progressively tilted along a plane in a typical range of?±?60°, while exposed to an electron beam. Each tilt angle generates a planar projection image, which are collectively processed using an algorithm such as filtered back projection to generate a 3-d reconstruction of the original object. Due to the limited range of tilt angles, the reconstruction doesn’t have uniform resolution across the object. Typically, opposite ends (top and bottom) of the 3-d reconstruction in one direction have poor contrast resolution. b Mid-level cross-section from a 3-d cryo-EM reconstruction of an E. Coli cell. The small enclosures (black arrows) are micro-compartments (MC). The red arrow shows the cell membrane. Note that part of the cell membrane is missing. The scale bar is 200 nm
Here we investigate the structure of bacterial micro-compartments (BMC): thin walled protein enclosures inside bacterial cells which separate certain metabolic pathways from the remaining cytoplasm [4]. Previous cryo-ET analyses of a class of BMCs known as carboxysomes, in two other strains of bacteria suggest that they have a polyhedral, specifically icosahedral, external structure [5, 6]. Visual inspection of reconstructed slices (Fig. 1b) and 3-d volume rendering (Fig. 2f) for our BMCs also suggest a convex polyhedral structure. In recombinant BMCs in E.coli, we demonstrate large variation in size and shape across copies within the same bacteria.

Extraction of polyhedral graph from cryo-EM reconstructions: a Upper level cross-sectional slice showing an MC with boundary partially visible due to poorer resolution. The scale bar is 50 nm. b Hand-drawn segmentation showing interior (deep green), exterior (yellow). Light green indicates a conservatively drawn region of uncertainty (caused by poor resolution) in which we suspect the object boundary lies. This region of uncertainty is used to constrain the 3-d reconstruction. c Stacked hand drawn boundaries from slices along the z-axis. Note that boundary information is completely missing for slices above and below this stack. d Volume rendering of regularized least squares reconstruction of object using data from stack in (b). Note missing wedge on right hand side. e Volume rendering of regularized least squares reconstruction of object using data from stacks of slices along x, y and z-axis. f Ball and stick diagram of polyhedral graph (PG) for object in (e), drawn using Chimera. Blue balls are observed vertices. Red lines are completed edges. Yellow lines are incomplete edges. Green balls are ends of incomplete edges
Previous methods identifying structure from cryo-ET reconstructions with missing wedge involves extracting multiple subvolumes (subtomograms) of the structure of interest, and then ‘averaging’ them, after appropriate alignment, to improve the resolution [7]. Another approach is by matching subtomograms against a high resolution template [8]. The limitations of present methods are thus: i) the structure of the template needs to be known/guessed in advance or ii) subtomogram averaging fails to capture variation of shapes across multiple copies of the object. Instead, we propose to realize the full potential of cryo-ET by identifying shapes using data from individual objects, without averaging of any sort or a priori assumptions about its shape. Further, we examine how accurately shapes can be identified from reconstructions of poor resolution using this methodology.
With a polyhedral structure in mind, we represent object shapes as incomplete polyhedral graphs (PG), i.e. a set of vertices connected by edges. We have developed a pipeline for extracting the PG from cryo-ET reconstructions. We also identify a library of reference polyhedra to which these objects should belong. Shape identification can thus be achieved by classifying the observed incomplete PG (which can be subject to measurement error) to one of the reference polyhedra. Apart from BMCs, these techniques could potentially also be applied to other biological objects exhibiting polyhedral structure, such as several types of viruses [1, 7, 9, 10] and protein complexes such as clathrin [11].
Developing optimal classification rules in this setting raises a number of methodological questions. Firstly, we need an appropriate stochastic model for PGs. Stochastic models for graphs typically assume that their edges are generated by a random process, e.g. Gaussian graphical models [12], exponentially generated random graphs [13] or stochastic block models [14]. In our library, two regular polyhedra can differ from each other in only a face or two. It would be complicated to define a stochastic model at edge level which could capture such small differences. Instead, we propose to model the observed PG as an incompletely sampled version of an underlying deterministic complete PG. Based on this model, we propose a method for estimating the sampling distribution of an incomplete PG for cryo-ET reconstructed images. Given that PGs are typically high dimensional, a general non-parametric density estimate would appear to suffer from the curse of dimensionality [15]. However, the highly structured form of PGs allows us to treat the sampling distribution like a discrete random variable with limited support, enabling convergence of the proposed density estimate at the parametric rate.
A second issue relates to how to incorporate information from edges that are only partially visible due to poor resolution (e.g. Fig. 2a): we can only identify one vertex of such an edge. Such edges cannot be incorporated into the adjacency matrix, which is commonly used to encode and analyse graphs [16]. To address this, we develop statistics for incomplete PGs, somewhat analogous to those used for right censored data.
Finally, previous approaches to classification with incomplete data involve a strategy of data augmentation, writing the posterior density as: p(y
i
|x
i
o
)?=??p(y
i
|x
i
o
,?x
i
m
)p(x
i
m
|x
i
o
)dx
i
m
, where y
i
is the i-th class label, x
i
o
and x
i
m
are the observed and missing features respectively [17]. The difficulty in implementing this approach lies in constructing an appropriate model for p(x
i
m
|x
i
o
). Because the PG is uniquely specified by the polyhedron type, it is natural to first condition on y
i
, i.e. obtain p(x
i
m
|y
i
,?x
i
o
) and then take an expectation over the polyhedron class, i.e. p(x
i
m
|x
i
o
)?=??p(x
i
m
|y
i
,?x
i
o
)p(y
i
). The marginal probability p(x
i
m
|x
i
o
) thus becomes dependent on the class of polyhedra chosen, making it a circular formulation. Instead, we propose a simpler procedure based solely on observed (incomplete) data. By modelling the incompleteness as a censoring mechanism, we propose a simulation based estimate of the probability density p(x
i
o
|y
i
). We construct the Bayes classifier using this density estimate and demonstrate that this classifier is accurate for most polyhedra.
Extraction of the PG from tomographic reconstructions involves a number of processing steps such as vertex and edge identification, which are potentially liable to error, e.g. missing vertices and edges. We show the accuracy of the Bayes classifier seriously deteriorates in the presence of such errors. We propose two strategies for robust inference in this setting: i) selection of PG features, such as local topology, which are nearly preserved despite random missing edges or vertices ii) use of distance based classification methods, such as support vector machines (SVM), which can recognize near preservation. The methodology is illustrated by application to a set of E. coli MCs and the results are compared to those obtained for other types of bacteria.
