Chapter-3.4-BLAST-and-suffix-Array
Outline
- Sequencing technologues.
- Fastq and FASTQC
- Sequence mapping algorithm:
- seed
- suffix Array
- Borrows-Wheeler transformation & LF mapping.
- BW Alignment
- Alignment output: SAM and BED.
Suppose if we are growing vell of cancer in petridish we want to map with human genome
Read mapping
- Mapping Hundreds of millions of reads back to the refrence genome is CPU and RAM intensive and slow
- Most mappers allow ~2 mismatch within first 30bp (4^28 could still uniquely identify most 30bp sequence in a 3GB genome), slower when allowing indels.
seed
- Bread DB Sequence into K-mer words (seed) and hash their locations to speed later searchers.
BLAST Algorithm steps
- seed and extend paradigm
- for each k-mer in query, find possible DB k-mer that matchers well with it.
- for each DB seq with a high scoring word,try to extend it in both ends.
- from HSP (High Scoring Segment Pairs)
- Keep only statistically significant HSPs:
- Based on the scores of aligning 2 random seqs.
- Use Smith-Waterman* Algorithm to join the HSPs and get optimal alignemtn.
suffix Tree
Some what related to Binary Tree
- A tree of all the Siffix of te refrence
- "BANANA" has suffixes : BANANA, ANANA, NANA,..., A
- Used in alignemtn tools such as MUMmer
- O(n) time to build.
- n = genome length.
- O(m) time to search.
- m = query length
Suffix Array (in RNA Sequence)
Some what related to Binary Tree
- The ith entry corresponds to the i^th smallest suffix.
- Used in alignment tools such as STAR
- O(n) time to build.
- n = genome length
- O(mlogn) time to search
- Binary search
- m = query length
- Index size is moderate
- ~15GB