We investigated methods for selecting haplotype tagging SNPs (htSNPs) using Principal Components Analysis (PCA). This work has been summarized in:

Lin and Altman, Finding Haplotype Tagging SNPs by Use of Principal Components Analysis, American Journal of Human Genetics 2004 Nov;75(5):850-61. Epub 2004 Sep 23. [PMID][PDF].

The immense volume and rapid growth of human genomic data, especially single nucleotide polymorphisms (SNPs), present special challenges for both biomedical researchers and automatic algorithms. One such challenge is to select an optimal subset of SNPs, commonly referred as "haplotype tagging SNPs" (htSNPs), to capture most of the haplotype diversity of each haplotype block or gene-specific region. This information-reduction process facilitates cost-effective genotyping and, subsequently, genotype-phenotype association studies. It also has implications for assessing the risk of identifying research subjects on the basis of SNP information deposited in public domain databases. We have investigated methods for selecting htSNPs by use of principal components analysis (PCA). These methods first identify eigenSNPs and then map them to actual SNPs. We evaluated two mapping strategies, greedy discard and varimax rotation, by assessing the ability of the selected htSNPs to reconstruct genotypes of non-htSNPs. We also compared these methods with two other htSNP finders, one of which is PCA based. We applied these methods to three experimental data sets and found that the PCA-based methods tend to select the smallest set of htSNPs to achieve a 90% reconstruction precision.

Here are color-version of the figures from the paper.

Here are graphic displays of htSNPs that we identified from three experimental data sets explaining 90% of variance in the data (Display correctly only with Microsoft IE):

Download the source code of eigen2htSNP.

More descriptions and an example of finding htSNPs using the varimax rotation algorithm.