Chapter-3.4-BLAST-and-suffix-Array

Outline

Sequencing technologues.
Fastq and FASTQC
Sequence mapping algorithm:
- seed
- suffix Array
- Borrows-Wheeler transformation & LF mapping.
- BW Alignment
Alignment output: SAM and BED.

Suppose if we are growing vell of cancer in petridish we want to map with human genome

Read mapping

Mapping Hundreds of millions of reads back to the refrence genome is CPU and RAM intensive and slow
Most mappers allow ~2 mismatch within first 30bp (4^28 could still uniquely identify most 30bp sequence in a 3GB genome), slower when allowing indels.

seed

Bread DB Sequence into K-mer words (seed) and hash their locations to speed later searchers.

BLAST Algorithm steps

seed and extend paradigm
for each k-mer in query, find possible DB k-mer that matchers well with it.
for each DB seq with a high scoring word,try to extend it in both ends.
- from HSP (High Scoring Segment Pairs)
Keep only statistically significant HSPs:
- Based on the scores of aligning 2 random seqs.
Use Smith-Waterman* Algorithm to join the HSPs and get optimal alignemtn.

suffix Tree

Some what related to Binary Tree

A tree of all the Siffix of te refrence
- "BANANA" has suffixes : BANANA, ANANA, NANA,..., A
Used in alignemtn tools such as MUMmer
O(n) time to build.
- n = genome length.
O(m) time to search.
- m = query length

Suffix Array (in RNA Sequence)

Some what related to Binary Tree

The ith entry corresponds to the i^th smallest suffix.
Used in alignment tools such as STAR
O(n) time to build.
- n = genome length
O(mlogn) time to search
- Binary search
- m = query length
Index size is moderate
- ~15GB

Chapter 3.2 Fastq and Fastqc Chapter 3.5.1 Bwt and Lf Mapping