Research

Ongoing Work

Statistical Inference for Generalized Linear Models based on Synthetic Data
Phase Diagram of QDA Classifier for High-Dimensional Gaussian Mixtures

MSc Thesis

Title: Binary Classification of High-Dimensional Data using Quadratic Discriminant Analysis

Abstract: Classification of high-dimensional data has naturally arisen in modern statistical research and clinical studies. Classification becomes difficult, if not impossible, when the number of features, p, exceeds the number of observations, n, also known as high dimensionality. High dimensional classification techniques are applicable in diverse fields, such as genomics, where the microarray gene expression data, characterized by thousands of features, is used to classify patients as healthy/unhealthy. However, only a small fraction of those features are informative, with the vast majority contributing negligibly to the classification, if at all. Prior knowledge of useful features is unavailable before sampling. Introducing multivariate Gaussian distribution on the class conditionals results in a family of classifiers known as Discriminant Analysis. Linear Discriminant Analysis (LDA) assumes identical covariance matrices, while Quadratic Discriminant Analysis (QDA) relaxes this assumption. QDA and LDA both encounter difficulties in high-dimensional setting, where their performance degrade to the point of being no better than selecting either class randomly. In this study, we propose a QDA classifier tailored to three models that incorporate progressively complicated structures in the contrast mean and precision matrix. By varying the sparsity level and weakness level indexes of the parameters, relative to p, we examine the optimality of QDA under our models. Additionally, we identify regions within the parameter space where successful classification is impossible for any binary classifier in the sense that the misclassification rate is always lower bounded by a nonzero constant. Building a theoretical framework, we derive and interpret the phase diagram for each model, which quantifies the influence of the magnitude and proportion of useful features on classification success. The theoretical results are then validated by simulation studies and visualizations for convergence evidence in the regions found in theory and comparing the convergence rates of the misclassification rate in different parts of the parameter space.

Thesis available at University of Calgary's digital repository, PRISM