Bioinformatics
STAT115-2020
Chapter 3.4 Blast and Suffix Array

Chapter-3.4-BLAST-and-suffix-Array

Outline

  • Sequencing technologues.
  • Fastq and FASTQC
  • Sequence mapping algorithm:
    • seed
    • suffix Array
    • Borrows-Wheeler transformation & LF mapping.
    • BW Alignment
  • Alignment output: SAM and BED.

Suppose if we are growing vell of cancer in petridish we want to map with human genome

Read mapping

  • Mapping Hundreds of millions of reads back to the refrence genome is CPU and RAM intensive and slow
  • Most mappers allow ~2 mismatch within first 30bp (4^28 could still uniquely identify most 30bp sequence in a 3GB genome), slower when allowing indels.

seed

  • Bread DB Sequence into K-mer words (seed) and hash their locations to speed later searchers.

BLAST Algorithm steps

  • seed and extend paradigm
  • for each k-mer in query, find possible DB k-mer that matchers well with it.
  • for each DB seq with a high scoring word,try to extend it in both ends.
    • from HSP (High Scoring Segment Pairs)
  • Keep only statistically significant HSPs:
    • Based on the scores of aligning 2 random seqs.
  • Use Smith-Waterman* Algorithm to join the HSPs and get optimal alignemtn.

suffix Tree

Some what related to Binary Tree

  • A tree of all the Siffix of te refrence
    • "BANANA" has suffixes : BANANA, ANANA, NANA,..., A
  • Used in alignemtn tools such as MUMmer
  • O(n) time to build.
    • n = genome length.
  • O(m) time to search.
    • m = query length

Suffix Array (in RNA Sequence)

Some what related to Binary Tree

  • The ith entry corresponds to the i^th smallest suffix.
  • Used in alignment tools such as STAR
  • O(n) time to build.
    • n = genome length
  • O(mlogn) time to search
    • Binary search
    • m = query length
  • Index size is moderate
    • ~15GB