Abstract for presentation at 11th International Congress of Human Genetics

Bias in classification based on gene expression data

  • Dr Ian Wood, Queensland University of Technology, Australia
  • Prof Peter Visscher, Queensland Institute of Medical Research, Australia
  • Prof Kerrie Mengersen, Queensland University of Technology, Australia
  • Gene expression data offers a large number of potentially useful covariates for the classification of patients into classes such as diseased and non-diseased, or subclasses of a disease classification. A number of sophisticated techniques have been brought to bear on this type of problem, including nearest shrunken centroids and support vector machines.
    In this presentation we investigate potential sources of bias and misinterpretation in reported estimates of accuracy (or one minus the error rate). We illustrate these through two published examples and a simulation study.
    The first data set examined was first analysed in Khan et al (Nature Medicine, 2001, pp.673-679) and gives the expression levels of 6567 genes in 63 samples from biopsies and cell lines taken from small round blue-cell tumours. The task was to fit a model to classify a sample into one of four tumour types. The second data set was first analysed in Sharma et al (Breast Cancer Research, 2005, pp.R634-R644) and contains the expression levels of 1368 genes in 60 labelled peripheral blood samples. The task was to build a classifier to predict whether a blood sample came from a patient with breast cancer or not. We simulated a third dataset of 100 observations where each covariate and the class label were generated independently. Each observation contained 2000 normally distributed covariates and was labelled with one of two equiprobable classes.
    We describe a form of selection bias that occurs when the accuracy estimated using cross-validation or similar methods is also used to select the optimal value of a parameter or perform variable selection. We also consider possible bias deriving from the large number of covariates combined with the typically small sample size. Classifiers were refit to a large number of random permutations of the labels to examine properties of the accuracy statistic in the absence of a relationship between the labels and the covariates.

    Conference Organiser - ICMS Pty Ltd