Shankar Raman, Sudhir. Bayesian grouped variable selection. 2012, Doctoral Thesis, University of Basel, Faculty of Science.
|
PDF
3489Kb |
Official URL: http://edoc.unibas.ch/diss/DissB_9927
Downloads: Statistics Overview
Abstract
Traditionally, variable selection in the context of linear regression has been approached using optimization based approaches like the classical Lasso. Such methods provide a sparse
point estimate with respect to regression coefficients but are unable to provide more information regarding the distribution of regression coefficients like expectation, variance
estimates etc. In the recent years, there has been some progress on the Bayesian formulation for variable selection like for example, the Bayesian Lasso. Motivated by these developments, in this thesis, we build an omnibus Bayesian framework for grouped-variable
selection in linear regression models. This framework is capable of summarizing the posterior distribution over the regression coefficients with estimates for the moments and
the mode. The inference is carried out using Markov Chain Monte Carlo (MCMC) sampling. The estimate for the mode of the posterior distribution over regression coefficients is also generated from the same MCMC sampling algorithm with minimal changes using simulated annealing.
Going beyond simple linear regression, the framework is also extended further to accommodate generalized linear models like Poisson and binomial models with minimal changes to the framework. On the algorithm side, we develop a highly efficient MCMC sampling algorithm for inference purposes. Apart from the Poisson and binomial models, another model that has been incorporated into this framework is the Weibull model which is extensively used for survival analysis. This extension has been combined with an additional clustering component using a survival mixture-of-experts model. The clustering component is particularly useful for performing variable selection (per cluster) simultaneously with cluster identification using Dirichlet processes which avoids the need for fixing the number of clusters in advance.
The resulting framework has been applied to several biological applications like identification of novel compound bio-markers for breast cancer from tissue microarray data and analyzing splice site data for identifying distinguishing features of true splice sites.
Survival data for breast cancer patients has been used to identify low-risk and high-risk
patients and the significant compound markers of each group.
point estimate with respect to regression coefficients but are unable to provide more information regarding the distribution of regression coefficients like expectation, variance
estimates etc. In the recent years, there has been some progress on the Bayesian formulation for variable selection like for example, the Bayesian Lasso. Motivated by these developments, in this thesis, we build an omnibus Bayesian framework for grouped-variable
selection in linear regression models. This framework is capable of summarizing the posterior distribution over the regression coefficients with estimates for the moments and
the mode. The inference is carried out using Markov Chain Monte Carlo (MCMC) sampling. The estimate for the mode of the posterior distribution over regression coefficients is also generated from the same MCMC sampling algorithm with minimal changes using simulated annealing.
Going beyond simple linear regression, the framework is also extended further to accommodate generalized linear models like Poisson and binomial models with minimal changes to the framework. On the algorithm side, we develop a highly efficient MCMC sampling algorithm for inference purposes. Apart from the Poisson and binomial models, another model that has been incorporated into this framework is the Weibull model which is extensively used for survival analysis. This extension has been combined with an additional clustering component using a survival mixture-of-experts model. The clustering component is particularly useful for performing variable selection (per cluster) simultaneously with cluster identification using Dirichlet processes which avoids the need for fixing the number of clusters in advance.
The resulting framework has been applied to several biological applications like identification of novel compound bio-markers for breast cancer from tissue microarray data and analyzing splice site data for identifying distinguishing features of true splice sites.
Survival data for breast cancer patients has been used to identify low-risk and high-risk
patients and the significant compound markers of each group.
Advisors: | Roth, Volker |
---|---|
Committee Members: | Seeger, Matthais |
Faculties and Departments: | 05 Faculty of Science > Departement Mathematik und Informatik > Informatik > Biomedical Data Analysis (Roth) |
UniBasel Contributors: | Shankar Raman, Sudhir and Roth, Volker |
Item Type: | Thesis |
Thesis Subtype: | Doctoral Thesis |
Thesis no: | 9927 |
Thesis status: | Complete |
Number of Pages: | 138 S. |
Language: | English |
Identification Number: |
|
edoc DOI: | |
Last Modified: | 22 Jan 2018 15:51 |
Deposited On: | 13 Aug 2012 12:48 |
Repository Staff Only: item control page