Gipson, Bryant. Information recovery in the biological sciences : protein structure determination by constraint satisfaction, simulation and automated image processing. 2010, Doctoral Thesis, University of Basel, Faculty of Science.
|
PDF
14Mb |
Official URL: http://edoc.unibas.ch/diss/DissB_9081
Downloads: Statistics Overview
Abstract
Regardless of the field of study or particular problem, any experimental science always poses the same question: ÒWhat object or phenomena generated the data that we see, given what is known?Ó
In the field of 2D electron crystallography, data is collected from a series of two-dimensional images, formed either as a result of diffraction mode imaging or TEM mode real imaging. The resulting dataset is acquired strictly in the Fourier domain as either coupled Amplitudes and Phases (as in TEM mode) or Amplitudes alone (in diffraction mode). In either case, data is received from the microscope in a series of CCD or scanned negatives of images which generally require a significant amount of pre-processing in order to be useful.
Traditionally, processing of the large volume of data collected from the microscope was the time limiting factor in protein structure determination by electron microscopy. Data must be initially collected from the microscope either on film-negatives, which in turn must be developed and scanned, or from CCDs of sizes typically no larger than 2096x2096 (though larger models are in operation). In either case, data are finally ready for processing as 8-bit, 16-bit or (in principle) 32-bit grey-scale images.
Regardless of data source, the foundation of all crystallographic methods is the presence of a regular Fourier lattice. Two dimensional cryo-electron microscopy of proteins introduces special challenges as multiple crystals may be present in the same image, producing in some cases several independent lattices. Additionally, scanned negatives typically have a rectangular region marking the film number and other details of image acquisition that must be removed prior to processing.
If the edges of the images are not down-tapered, vertical and horizontal ÒstreaksÓ will be present in the Fourier transform of the image --arising from the high-resolution discontinuities between the opposite edges of the image. These streaks can overlap with lattice points which fall close to the vertical and horizontal axes and disrupt both the information they contain and the ability to detect them. Lastly, SpotScanning (Downing, 1991) is a commonly used process where-by circular discs are individually scanned in an image. The large-scale regularity of the scanning patter produces a low frequency lattice which can interfere and overlap with any protein crystal lattices.
We introduce a series of methods packaged into 2dx (Gipson, et al., 2007) which simultaneously addresses these problems, automatically detecting accurate crystal lattice parameters for a majority of images. Further a template is described for the automation of all subsequent image processing steps on the road to a fully processed dataset.
The broader picture of image processing is one of reproducibility. The lattice parameters, for instance, are only one of hundreds of parameters which must be determined or provided and subsequently stored and accessed in a regular way during image processing. Numerous steps, from correct CTF and tilt-geometry determination to the final stages of symmetrization and optimal image recovery must be performed sequentially and repeatedly for hundreds of images.
The goal in such a project is then to automatically process as significant a portion of the data as possible and to reduce unnecessary, repetitive data entry by the user. Here also, 2dx (Gipson, et al., 2007), the image processing package designed to automatically process individual 2D TEM images is introduced. This package focuses on reliability, ease of use and automation to produce finished results necessary for full three-dimensional reconstruction of the protein in question.
Once individual 2D images have been processed, they contribute to a larger project-wide 3-dimensional dataset. Several challenges exist in processing this dataset, besides simply the organization of results and project-wide parameters. In particular, though tilt-geometry, relative amplitude scaling and absolute orientation are in principle known (or obtainable from an individual image) errors, uncertainties and heterogeneous data-types provide for a 3D-dataset with many parameters to be optimized. 2dx_merge (Gipson, et al., 2007) is the follow-up to the first release of 2dx which had originally processed only individual images. Based on the guiding principles of the earlier release, 2dx_merge focuses on ease of use and automation. The result is a fully qualified 3D structure determination package capable of turning hundreds of electron micrograph images, nearly completely automatically, into a full 3D structure.
Most of the processing performed in the 2dx package is based on the excellent suite of programs termed collectively as the MRC package (Crowther, et al., 1996). Extensions to this suite and alternative algorithms continue to play an essential role in image processing as computers become faster and as advancements are made in the mathematics of signal processing. In this capacity, an alternative procedure to generate a 3D structure from processed 2D images is presented. This algorithm, entitled ÒProjective Constraint OptimizationÓ (PCO), leverages prior known information, such as symmetry and the fact that the protein is bound in a membrane, to extend the normal boundaries of resolution. In particular, traditional methods (Agard, 1983) make no attempt to account for the Òmissing coneÓ a vast, un-sampled, region in 3D Fourier space arising from specimen tilt limitations in the microscope. Provided sufficient data, PCO simultaneously refines the dataset, accounting for error, as well as attempting to fill this missing cone.
Though PCO provides a near-optimal 3D reconstruction based on data, depending on initial data quality and amount of prior knowledge, there may be a host of solutions, and more importantly pseudo-solutions, which are more-or-less consistent with the provided dataset. Trying to find a global best-fit for known information and data can be a daunting challenge mathematically, to this end the use of meta-heuristics is addressed. Specifically, in the case of many pseudo-solutions, so long as a suitably defined error metric can be found, quasi-evolutionary swarm algorithms can be used that search solution space, sharing data as they go. Given sufficient computational power, such algorithms can dramatically reduce the search time for global optimums for a given dataset.
Once the structure of a protein has been determined, many questions often remain about its function. Questions about the dynamics of a protein, for instance, are not often readily interpretable from structure alone. To this end an investigation into computationally optimized structural dynamics is described. Here, in order to find the most likely path a protein might take through Òconformation spaceÓ between two conformations, a graphics processing unit (GPU) optimized program and set of libraries is written to speed of the calculation of this process 30x. The tools and methods developed here serve as a conceptual template as to how GPU coding was applied to other aspects of the work presented here as well as GPU programming generally.
The final portion of the thesis takes an apparent step in reverse, presenting a dramatic, yet highly predictive, simplification of a complex biological process. Kinetic Monte Carlo simulations idealize thousands of proteins as interacting agents by a set of simple rules (i.e. react/dissociate), offering highly-accurate insights into the large-scale cooperative behavior of proteins. This work demonstrates that, for many applications, structure, dynamics or even general knowledge of a protein may not be necessary for a meaningful biological story to emerge. Additionally, even in cases where structure and function is known, such simulations can help to answer the biological question in its entirety from structure, to dynamics, to ultimate function.
In the field of 2D electron crystallography, data is collected from a series of two-dimensional images, formed either as a result of diffraction mode imaging or TEM mode real imaging. The resulting dataset is acquired strictly in the Fourier domain as either coupled Amplitudes and Phases (as in TEM mode) or Amplitudes alone (in diffraction mode). In either case, data is received from the microscope in a series of CCD or scanned negatives of images which generally require a significant amount of pre-processing in order to be useful.
Traditionally, processing of the large volume of data collected from the microscope was the time limiting factor in protein structure determination by electron microscopy. Data must be initially collected from the microscope either on film-negatives, which in turn must be developed and scanned, or from CCDs of sizes typically no larger than 2096x2096 (though larger models are in operation). In either case, data are finally ready for processing as 8-bit, 16-bit or (in principle) 32-bit grey-scale images.
Regardless of data source, the foundation of all crystallographic methods is the presence of a regular Fourier lattice. Two dimensional cryo-electron microscopy of proteins introduces special challenges as multiple crystals may be present in the same image, producing in some cases several independent lattices. Additionally, scanned negatives typically have a rectangular region marking the film number and other details of image acquisition that must be removed prior to processing.
If the edges of the images are not down-tapered, vertical and horizontal ÒstreaksÓ will be present in the Fourier transform of the image --arising from the high-resolution discontinuities between the opposite edges of the image. These streaks can overlap with lattice points which fall close to the vertical and horizontal axes and disrupt both the information they contain and the ability to detect them. Lastly, SpotScanning (Downing, 1991) is a commonly used process where-by circular discs are individually scanned in an image. The large-scale regularity of the scanning patter produces a low frequency lattice which can interfere and overlap with any protein crystal lattices.
We introduce a series of methods packaged into 2dx (Gipson, et al., 2007) which simultaneously addresses these problems, automatically detecting accurate crystal lattice parameters for a majority of images. Further a template is described for the automation of all subsequent image processing steps on the road to a fully processed dataset.
The broader picture of image processing is one of reproducibility. The lattice parameters, for instance, are only one of hundreds of parameters which must be determined or provided and subsequently stored and accessed in a regular way during image processing. Numerous steps, from correct CTF and tilt-geometry determination to the final stages of symmetrization and optimal image recovery must be performed sequentially and repeatedly for hundreds of images.
The goal in such a project is then to automatically process as significant a portion of the data as possible and to reduce unnecessary, repetitive data entry by the user. Here also, 2dx (Gipson, et al., 2007), the image processing package designed to automatically process individual 2D TEM images is introduced. This package focuses on reliability, ease of use and automation to produce finished results necessary for full three-dimensional reconstruction of the protein in question.
Once individual 2D images have been processed, they contribute to a larger project-wide 3-dimensional dataset. Several challenges exist in processing this dataset, besides simply the organization of results and project-wide parameters. In particular, though tilt-geometry, relative amplitude scaling and absolute orientation are in principle known (or obtainable from an individual image) errors, uncertainties and heterogeneous data-types provide for a 3D-dataset with many parameters to be optimized. 2dx_merge (Gipson, et al., 2007) is the follow-up to the first release of 2dx which had originally processed only individual images. Based on the guiding principles of the earlier release, 2dx_merge focuses on ease of use and automation. The result is a fully qualified 3D structure determination package capable of turning hundreds of electron micrograph images, nearly completely automatically, into a full 3D structure.
Most of the processing performed in the 2dx package is based on the excellent suite of programs termed collectively as the MRC package (Crowther, et al., 1996). Extensions to this suite and alternative algorithms continue to play an essential role in image processing as computers become faster and as advancements are made in the mathematics of signal processing. In this capacity, an alternative procedure to generate a 3D structure from processed 2D images is presented. This algorithm, entitled ÒProjective Constraint OptimizationÓ (PCO), leverages prior known information, such as symmetry and the fact that the protein is bound in a membrane, to extend the normal boundaries of resolution. In particular, traditional methods (Agard, 1983) make no attempt to account for the Òmissing coneÓ a vast, un-sampled, region in 3D Fourier space arising from specimen tilt limitations in the microscope. Provided sufficient data, PCO simultaneously refines the dataset, accounting for error, as well as attempting to fill this missing cone.
Though PCO provides a near-optimal 3D reconstruction based on data, depending on initial data quality and amount of prior knowledge, there may be a host of solutions, and more importantly pseudo-solutions, which are more-or-less consistent with the provided dataset. Trying to find a global best-fit for known information and data can be a daunting challenge mathematically, to this end the use of meta-heuristics is addressed. Specifically, in the case of many pseudo-solutions, so long as a suitably defined error metric can be found, quasi-evolutionary swarm algorithms can be used that search solution space, sharing data as they go. Given sufficient computational power, such algorithms can dramatically reduce the search time for global optimums for a given dataset.
Once the structure of a protein has been determined, many questions often remain about its function. Questions about the dynamics of a protein, for instance, are not often readily interpretable from structure alone. To this end an investigation into computationally optimized structural dynamics is described. Here, in order to find the most likely path a protein might take through Òconformation spaceÓ between two conformations, a graphics processing unit (GPU) optimized program and set of libraries is written to speed of the calculation of this process 30x. The tools and methods developed here serve as a conceptual template as to how GPU coding was applied to other aspects of the work presented here as well as GPU programming generally.
The final portion of the thesis takes an apparent step in reverse, presenting a dramatic, yet highly predictive, simplification of a complex biological process. Kinetic Monte Carlo simulations idealize thousands of proteins as interacting agents by a set of simple rules (i.e. react/dissociate), offering highly-accurate insights into the large-scale cooperative behavior of proteins. This work demonstrates that, for many applications, structure, dynamics or even general knowledge of a protein may not be necessary for a meaningful biological story to emerge. Additionally, even in cases where structure and function is known, such simulations can help to answer the biological question in its entirety from structure, to dynamics, to ultimate function.
Advisors: | Stahlberg, Henning |
---|---|
Committee Members: | Engel, Andreas |
Faculties and Departments: | 05 Faculty of Science > Departement Biozentrum > Structural Biology & Biophysics |
UniBasel Contributors: | Gipson, Bryant and Stahlberg, Henning |
Item Type: | Thesis |
Thesis Subtype: | Doctoral Thesis |
Thesis no: | 9081 |
Thesis status: | Complete |
Number of Pages: | 128 S. |
Language: | English |
Identification Number: |
|
edoc DOI: | |
Last Modified: | 22 Jan 2018 15:51 |
Deposited On: | 23 Jul 2010 07:26 |
Repository Staff Only: item control page