Computational analysis of next generation sequencing data : from transcription start sites in bacteria to human non-coding RNAS
Date Issued
2014
Author(s)
DOI
10.5451/unibas-006482096
Abstract
The advent of next generation sequencing (NGS) technologies has revolutionized the field of molecular biology by providing a wealth of sequence data. “Transcriptomics”, which aims to identify and annotate the complete set of RNA molecules transcribed from a genome, is one of the main applications of these high-throughput methods. Special attention has been paid in determining the exact position of the 5’ ends of RNA transcripts, the transcription start sites (TSSs), and subsequently in identifying the regulatory motifs that are ultimately responsible for governing gene expression. Recently, a novel experimental approach termed dRNA-seq has emerged which enables TSS identification in prokaryotic genomes at a genome-wide scale. While the experimental procedure has reached a point of maturity, the computational downstream analysis of dRNA-seq data is still in its infancy. Analysis of dRNA-seq data was previously done manually, a tedious task that is prone to errors and biases. In order to automate this process we developed a computational tool for accurate and systematic analysis of dRNA-seq data to identify the TSSs genome-wide. In particular, we used a Bayesian framework for TSS calling and a Hidden Markov Model to infer the canonical motifs in the promoter regions of TSSs in order to further capture TSSs that show low evidence of expression. In a second contribution, we exploited the power of next generation sequencing to identify and characterize the expression and processing mechanisms of snoRNAs. SnoRNAs are a particular class of non-protein coding RNAs whose main function is post-transcriptional modification of other non-protein coding RNAs. SnoRNAs carry out their function as part of ribonucleoprotein complexes (RNPs). In order to gain insights into these protein-RNA interactions, we used a technique called PAR-CLIP (Photoactivatable-Ribonucleoside-Enhanced Crosslinking and Immunoprecipitation) that allows the identification of protein-RNA contacts at nucleotide resolution. Using PAR-CLIP data, we were able to demonstrate that snoRNAs undergo precise processing and that many loci in the human genome generate snoRNA-like transcripts whose evolutionary conservation and expression are considerably lower than currently catalogued snoRNAs. Finally, we set out to use small RNA-seq data from the ENCODE project to construct a comprehensive catalog of genomic loci that give rise to snoRNAs. In addition we expanded the current catalog of human snoRNAs and studied the plasticity of snoRNA expression across different cell types. Our analysis confirmed prior observations that several snoRNAs show cell type specific expression, mainly in neurons. A more striking observation was that snoRNA expression appears to be strongly dysregulated in cancers which could lead to the identification of novel biomarkers.
File(s)![Thumbnail Image]()
Loading...
Name
my_thesis_problem.pdf
Size
3.11 MB
Format
Adobe PDF
Checksum
(MD5):84c8caf7cbb090655ea8aa2a29d3d8dc