Ongoing Study

Nonparametric Regression

Forward Selection

Least Square App.

Previous Studies

Mixture Model Diagnosis

Willingness to Pay

Self-Organized Map

ChIP-Seq

Course Project

Smoothing and Nonparametric Regression

Sparse PCA

Machine Learning

Dimension Reduction

Model Selection

Ongoing Study

Variable Selection for Nonparametric Quantile Regression

With Dr. Hao Helen Zhang, Dr. Howard D. Bondell and Dr. Hui Zou

cosso package now available at CRAN.

Key Reference:

Li, Y., Liu, Y. and Zhu, J. (2007) Quantile Regression in Reproducing Kernel Hilbert Spaces.  J. Am. Stat. Assoc., 102, 255-268

Lin, Y. and Zhang, H. H. (2006)  Component Selection and Smoothing in Smoothing Spline Analysis of Variance Models. Ann. Stat. 34, 2272-2297

Storlie, C., Bondell, H., Reich, B., and Zhang, H.H. (2010) Surface Estimation, Variable Selection, and the Nonparametric Oracle Property. Stat. Sinica, 21, 679-705

Forward Selection in High-Dimensional Feature Space

With Dr. Howard Bondell, Dr. Hao Helen Zhang and Dr. Leonard Stefanski

Key Reference:

Fan, J. and Lv, J. (2008) Sure Independence Screeing for Ultrahigh Dimensional Feature Space with Discussion. J. Roy. Stat. Soc. B  70, 849-911

Wang, H. (2009) Forward Regression for Ultra-High Dimensional Variable Screening. J. Am. Stat. Assoc. 104, 1512-1524

Least Squares Approximation

Key Reference: 

Wang, H. and Leng C. (2007) Unified LASSO estimation by Least Squares Approximation . J. Am. Stat. Assoc. 102, 1039-1048


 

 

Previous Studies

Diagnosis of Multivariate Normal Mixture Model

Picture taken from LosAlamos National Laboratory. http://www.lanl.gov

Motivated by analyzing flow cytometry data, Dr. Lung-An Li, ISS, AS, adopted the multivariate normal mixture model to study this kind of data.

In this study, we are orginally dealt with a 300000 observations by 9 biomarkers dataset per mice. Due to the difficulties arisen from dimensionality and of validity normal assumption, we used only 4 of these 9 biomarkers and further partition the data into proper subsets.

The purpose of this study aims to provide rapid screening procedures of disease phenotypes via high throughput methods.Furthermore, biologists can identify mutant genes responsible for pheno-deviant mouse model.

As a well-established modeling technique, mixture model has been widely applied in various fields, the statistical chellenges in this study is to provide a visual inspection procedure to examines the model fitting. Unlike univariate or multivariate cases, various statistical procedures are reaily available to perform normality test. However, a tool for multivariate nomal mixture data is still unavailable.

 

Willingness-to-Pay

The willingness-to-pay (WTP) problems are not new for economists. For economists, the statistical models proposed in the literature are derived from maximizing customers' utility (Bateman, 2002, Hanemann, Loomis and Kanninen 1991).

In a statistical point of view, the data collected from dichotomous question can be treated as a incompleted data, or more specifcally censored data. Numerous survival models have been proposed to fit this kind of data (Kalbfleisch and Prentice, 2002). The problem is that WTP problem often involves three types of consumers, namely willing to pay nothing, willing to pay a certain price and willing to pay anything. Traditional survival model might not fit too well in this case. CH Chen et al. (to appear) proposed a three-component mixture model that incorporates AFT model and logistic regression to solve this application.

In my thesis, I studied how much the consumers are willing to pay a premium for a non-genetically modified agriculture goods rather than its genetically modified counter part. In this study, by using similar idea proposed by Chen, I improved the statistical efficiency and reduced potential estimation biasness in previous studies.

 

Self-Organized Map (SOM)

 

 

Chromatin immunoprecipitation (ChIP) Sequencing

A more efficient technology to identify transcription factor binding site of DNA-associated proteins, Illumina ChIP-Seq tecnhology improves several limitations of traditional ChIP-chip tecnhology, allowing researchers to investigate how proteins interact with DNA to regulate gene expression.

In this study, under the guidence of Dr. Chen-Hsin Chen and Dr. Ueng-Cheng Yang, I was assigned to study those peak finding algorithms implemented in currently-available ChIP-Seq analysis programss, such as CisGenome(Ji and Wang), Peak Finder (The Wold Lab).

Picture taken from Illumina company. http://www.illumina.com

Course Projects

Smoothing and Nonparamteric Regreesion

 

Sparse Principal Component Analysis


Machine Learning

Some multi-class Support Vector Machine (SVM) techniques are applied to a gene expresssion dataset to classify normal control versus bipolar disorder or schizophrenia patients. In addition, two regularized SVM methods, 1norm SVM (Zhu, J. et al., 2003) and SCAD SVM (Zhang, H.H. et al.,2006) are also implemented to this dataset. Analysis result suggests regularized SVM provides satisfactory classification accurancy. However, more work is required to imporve the performance for the multi-class classification problem.

Dimension Reduction

To better predict survival outcome based on microarray data, various dimension reduction methods, including PCA+cSIR (Li and Li, 2004), LA+cSIR (Wu et al., 2008) and Supervised PCA (Bail et al. 2006), are studied and compared using Rosenward et al. (2002) diffuse large-B-cell lymphoma data. Result shows the gene signature constructed by Supervised PCA provides the best preformance in terms of time-dependent AUC.