综述k-mer counting方法

A review of k-mer counting methods

Posted by Xuan on February 14, 2020

A benchmark study of k-mer counting methods for high-throughput sequencing – a review paper

Catagories of k-mer counting methods

Ontology of k-mer counting approaches

  • Using the sorting approach

    sorting all k-mers extracted from each read; Thus k-mer frequencies can be easily counted because, after sorting, repeating k-mers lay at adjacent positions in the sorted list.
    sorting and compaction (SAC) -based algorithm ??
    aTurtle[used] / cTurtle
    Pattern Block Bloom Filter is a cache-friendly variant of the BLoom Filter, which has a very small cache miss ratio.

  • Using a hash table

    k-mer are stored as keys and their counts are stored as values.
    e.g. Jellyfish, lock-free hash table and Compare-and-swap(CAS) assembly instruction Jellyfish2, use an additional Bloom filter-based mode to remove all singleton k-mers. KAT

  • Using a Bloom Filter and its Variants and counting quotient filter

    majority of any genomic dataset is made up of single-frequency k-mers, which mainly attributed to sequencing errors. A bloom filter is a probabilistic data structure used for dynamic membership query lookup. BFCounter. Squeaker: uses a counting filter data structure(counting quotient filter[CQF])

  • Using the concept of super k-mer: minimizers and signatures

    MSPKmerCounter is the 1st tool to implement MSP(Minimum Substring Partitioning) for k-mer counting.
    KMC3.

Computation based on k-mer counts

  • 计算基因组🧬的大小

    把每个k-mer看作基因组中的一个碱基,通过k-mer分布 -> k-mer总数(T)和每个碱基平均出现的次数(u).
    基因组大小 = T/u

  • 重复序列

    ?? 理论认为, 但拷贝序列k-mer,出现在主峰1.6倍以后的概率非常低;认为该区间为重复k-mer.
    重复序列长度 = 重复k-mer序列总数/ u

  • 杂合率估算

1.为啥组装时,k要选择基数?

只有奇数,才能保证每个kmer序列的反向互补Kmer与自身是不同的,从而避免了正反链混淆。
如 :5-mer的 CGCGC,反向互补后是 GCGCG, 它们是不同的;这就不会像 4-mer,CGCG发现它反向互补后仍然是CGCG

Reference

为什么要进行k-mer分析