Bayesian methods in transcriptomics

Grobecker, Pascal. Bayesian methods in transcriptomics. 2021, Doctoral Thesis, University of Basel, Faculty of Science.

Preview

PDF
12Mb

Official URL: https://edoc.unibas.ch/96229/

Downloads: Statistics Overview

Abstract

Transcriptomics techniques provide expression measurements across all genes and are therefore crucial for characterising and understanding cellular states in multicellular organisms. The dominant technique in the last decade has been RNA-seq, which can either be applied in bulk or in single cells. For the former, researchers are often interested in identifying marker genes that can be used in subsequent studies to differentiate between two or more classes of samples (e.g. cell types). We developed a novel statistical model for identifying such marker genes from RNA-seq data. Our model is based on a conditional entropy score that works well even when the number of gene expression measurements per class is small and when more than two groups were compared. Single-cell RNA-seq (scRNA-seq) has become a popular experimental method to study variation of gene expression within a population of cells. A main application of scRNA-seq is to obtain an exhaustive picture of the variation in cell types that exist within a given tissue by clustering cells into subsets with distinct gene expression patterns. One challenge to such analysis is that the measured gene expression states of single cells are subject to a large amount of unwanted noise from inherent stochastic fluctuations due to the small mRNA numbers as well as technical noise from the experiment. Existing computational pipelines often try to disentangle these unwanted sources of noise from genuine biological signals by applying several layers of ad hoc steps including feature selection, normalisation, and dimensionality reduction, before clustering cells into subtypes. However, such pre-processing can dramatically distort the measurements by erroneously filtering true biological variability and introducing artefactual correlations. Here we propose a new computational method, called cellstates, that takes raw UMI counts of an scRNA-seq experiment as input and rigorously models the structure of both biological and experimental noise to find maximally resolved clusters of cells, i.e. groups of cells whose gene expression states are statistically indistinguishable. The cellstates method has no tuneable parameters, automatically optimises the number of clusters and returns directly interpretable results, thereby overcoming many issues of other available tools. In addition, cellstates also provides a data analysis toolbox that allows to place the cellstates within a hierarchy and identify differentially expressed genes at each level of this hierarchy, and several novel data visualizations.

Advisors:	van Nimwegen, Erik
Committee Members:	Zavolan, Mihaela and Huber, Wolfgang
Faculties and Departments:	05 Faculty of Science > Departement Biozentrum > Computational & Systems Biology > Bioinformatics (Zavolan) 05 Faculty of Science > Departement Biozentrum > Computational & Systems Biology > Bioinformatics (van Nimwegen)
UniBasel Contributors:	van Nimwegen, Erik and Zavolan, Mihaela
Item Type:	Thesis
Thesis Subtype:	Doctoral Thesis
Thesis no:	15265
Thesis status:	Complete
Number of Pages:	iv, 76
Language:	English
Identification Number:	urn: urn:nbn:ch:bel-bau-diss152652
edoc DOI:	10.5451/unibas-ep96229
Last Modified:	08 Feb 2024 05:30
Deposited On:	07 Feb 2024 07:56

Repository Staff Only: item control page