Font Size: a A A

Bioinformatics approaches to medical imaging and microarray studies

Posted on:2004-11-04Degree:Ph.DType:Dissertation
University:The Catholic University of AmericaCandidate:Wang, ZuyiFull Text:PDF
GTID:1460390011468237Subject:Engineering
Abstract/Summary:
This dissertation study focuses on the development of analytic methods and tools for the exploration of large volume and high dimensional biological data, e.g., mammogram mass feature data and gene expression microarray data. The mammogram feature data are used in Computer-Aided Diagnosis (CAD) systems for breast cancer detection. Gene expression microarrays are designed to measure the expression levels of thousands of known sequenced genes simultaneously. The resulting large volume and high dimensional microarray data contains rich and important information for searching the genetic factors behind cancers or other diseases. Currently, no effective and powerful analytic tool is available for exploring the data structure of these types of data. The objectives of this study are: (1) cluster discovery—develop and optimize pattern recognition techniques to discover cluster structure through statistically modeling the data so as to detect mass/cancer subtypes; (2) gene selection—define criterion and improve algorithms for class separability-based gene selection that may identify the most information-rich gene subset.; A model-supported hierarchical visual data exploration tool that is developed in this study for cluster discovery mainly includes: (1) statistical modeling of the data using a Standard Finite Normal Mixture (SFNM) distribution; (2) discriminatory dimensionality reduction and visual data mining scheme with multi-schematic unsupervised learning processes; (3) probabilistic clustering through Expectation-Maximization (EM) algorithm and model selection validated by Minimum Description Length (MDL); (4) hierarchical data exploration scheme. Reducing dimensionality enables the visualization of the complete data set at the top level, and the data set is then partitioned into sub-clusters that can consequently be visualized at lower levels and if necessary partitioned again. The solution to analyze and handle the singularity problem in EM algorithm is proposed and implemented. Sub-level probabilistic dimensionality reduction and model selection is implemented to identify the most appropriate mixture sub-model for each sub-cluster. Clustering evaluation framework is proposed and implemented to quantitatively assess the robustness of probabilistic clustering results. Unsupervised gene selection criterion is proposed and implemented to enable the discovery of unknown cancer subtypes.
Keywords/Search Tags:Data, Gene, Proposed and implemented, Microarray, Selection
Related items