| The comparison of samples, or beta diversity, is one of the essential problems in ecological studies. Next generation sequencing (NGS) technologies make it possible to obtain large amounts of metagenomic and metatranscriptomic short read sequences across many microbial communities. The short read sequences are randomly sampled from different parts of multiple genomes in microbial communities. De novo assembly of the short reads can be especially challenging because the number of genomes and their sequences are generally unknown and the coverage of each genome can be very low. Thus, traditional alignment-based sequence comparison methods cannot be used to compare microbial communities based on NGS read data. Alignment-free approaches based on-tuple frequencies, on the other hand, have yielded promising results for the comparison of metagenomic samples. However, it is not known if these approaches can be used for the comparison of metatranscriptomic data from multiple microbial communities and which dissimilarity measures perform the best when used to cluster metatranscriptomic samples.We applied several beta diversity measures based on-tuple frequencies to real metatranscriptomic datasets to evaluate their effectiveness for the clustering of metatranscriptomic samples, including three d2-type dissimilarity measures, one dissimilarity measure in CVTree, one relative entropy based measure S2and three classicalp-norm distances. Results showed that the measure d2scan achieve superior performance on clustering metatranscriptomic samples into different groups under different sequencing depths for both454and Illumina datasets, recovering environmental gradients affecting microbial samples, classifying coexisting metagenomic and metatranscriptomic datasets, and being robust to sequencing errors. We also investigated the effects of tuple size and order of the background Markov model. And we built a software pipeline to implement all the steps of analysis in this study. We then do some further research by designing three experiments to investigate the clustering characteristics on similar microbial communities, similar species and sequencing data from different platform. The results indicate that the RNA sequence samples are easier to be seperated than DNA seqequence samples, and the experiments detect the high sensitivity of sequencing platform and bad performance on complex microbial samples for sequence nature measures.And we try set k=30-40to analyze the microbial samples and find some initial results.Thek-tuple based sequence signature measures can effectively reveal major groups and gradient variation among metatranscriptomic samples from NGS reads. The d2sdissimilarity measure performs well in all application scenarios and its performance is robust with respect to tuple size and order of the Markov model. However, the sequence signature method also have some aspects to be improved, currently it has limitations and shortages, like high sensitivity of sequencing platform and bad performance of clustering complicate microbial samples. |