Fundamentals - Rudra Joshi

There are several types of single-cell (sc) sequencing technologies, each providing different insights into the molecular biology of individual cells. The sc* prefix refers to single-cell methods designed to study various aspects of cellular function, including gene expression, chromatin accessibility, epigenetics, and more. Here’s a list of common sc* methods:

1. scRNA-seq (Single-cell RNA sequencing)

Purpose: Measures gene expression by sequencing the transcriptome of individual cells.
Insight: Provides data on the active genes in each cell, enabling cell-type identification, states, and the study of gene expression heterogeneity across cell populations.

2. scATAC-seq (Single-cell Assay for Transposase-Accessible Chromatin using sequencing)

Purpose: Profiles chromatin accessibility at the single-cell level.
Insight: Identifies open chromatin regions where regulatory elements (promoters, enhancers) are located, providing information on gene regulation and epigenetic states.

3. scDNA-seq (Single-cell DNA sequencing)

Purpose: Sequences the DNA of individual cells.
Insight: Used to study genetic variations, mutations, copy number variations, and clonal evolution, often in cancer research.

4. scChIP-seq (Single-cell Chromatin Immunoprecipitation sequencing)

Purpose: Profiles the binding of DNA-associated proteins like histones and transcription factors at the single-cell level.
Insight: Provides epigenetic information by identifying histone modifications or transcription factor binding sites, revealing how gene expression is regulated.

5. scBS-seq (Single-cell Bisulfite sequencing)

Purpose: Measures DNA methylation at the single-cell level.
Insight: Provides insights into epigenetic modifications by detecting methylated cytosines, which influence gene expression and cell differentiation.

6. scCUT&Tag (Single-cell Cleavage Under Targets and Tagmentation)

Purpose: A targeted method to profile histone modifications and transcription factor binding at the single-cell level.
Insight: This method offers a more focused approach to studying epigenetic modifications in specific chromatin regions with reduced background noise compared to scChIP-seq.

7. scTCR-seq (Single-cell T-cell Receptor sequencing)

Purpose: Analyzes the diversity of T-cell receptors (TCRs) in individual T cells.
Insight: Used to study the immune system’s T-cell receptor repertoire, which is important for understanding immune responses, especially in cancer immunotherapy and infectious diseases.

8. scBCR-seq (Single-cell B-cell Receptor sequencing)

Purpose: Sequences the B-cell receptor (BCR) genes in individual B cells.
Insight: Provides information on the diversity of BCRs, helping to understand how the immune system recognizes antigens and produces antibodies.

9. scMS (Single-cell Mass Spectrometry)

Purpose: Measures the proteome of individual cells.
Insight: Enables the quantification of proteins in single cells, which is useful for studying protein expression and post-translational modifications.

10. scHi-C (Single-cell Hi-C sequencing)

Purpose: Captures chromatin conformation and 3D interactions in individual cells.
Insight: Provides information about chromatin organization, such as long-range interactions between different genomic regions, which can affect gene regulation.

11. scMET-seq (Single-cell Methylation sequencing)

Purpose: Profiles methylation patterns in individual cells.
Insight: Offers data on epigenetic regulation by mapping methylation at specific sites, allowing researchers to explore how DNA methylation influences gene expression and cellular differentiation.

12. scProteomics (Single-cell Proteomics)

Purpose: Measures protein levels and interactions at the single-cell level.
Insight: Provides direct information about protein abundance, modifications, and protein-protein interactions, giving a detailed view of cellular function beyond transcriptomics.

13. scCITE-seq (Cellular Indexing of Transcriptomes and Epitopes by Sequencing)

Purpose: Simultaneously measures both RNA and protein levels in individual cells.
Insight: Combines transcriptomics and proteomics at the single-cell level, offering a multi-omic view of gene expression and protein expression in the same cell.

14. scRNA-seq + scATAC-seq (Multi-omic Integration)

Purpose: Simultaneously measures gene expression (RNA-seq) and chromatin accessibility (ATAC-seq) in the same cells.
Insight: Provides a more comprehensive understanding of how chromatin accessibility correlates with gene expression, revealing mechanisms of gene regulation.

Summary of sc* Technologies:

Each of these single-cell methods provides unique insights into cellular biology:

Transcriptomics (scRNA-seq): Studies gene expression.
Epigenomics (scATAC-seq, scChIP-seq, scBS-seq): Explores chromatin accessibility, DNA-protein interactions, and methylation.
Genomics (scDNA-seq): Investigates genetic variations and mutations.
Proteomics (scMS, scCITE-seq): Focuses on proteins and their roles.
Immune Profiling (scTCR-seq, scBCR-seq): Analyzes immune cell receptors.
Chromatin Organization (scHi-C): Examines 3D chromatin interactions.

The choice of single-cell method depends on the research focus, whether it is gene expression, chromatin dynamics, DNA mutations, or protein quantification.

In RNA sequencing (RNA-seq), several types of data files are generated and used for different stages of analysis. Each file format plays a critical role in storing and interpreting sequencing data, ranging from raw sequence data to processed expression levels. Below is a detailed explanation of the key file types and formats commonly used in RNA-seq analysis, along with how they are useful for a PhD student.

1. FASTQ Files

Format: Text-based file that contains raw sequence reads generated from the sequencer. Each entry in the file contains:
1. Read Identifier: Describes the sequence (usually includes machine ID and read index).
2. Sequence: The actual nucleotide sequence (A, T, C, G).
3. Quality Score: Phred quality scores that indicate the probability of a base being called incorrectly.
4. Optional Additional Data: Additional tags or metadata associated with the sequence.
Purpose: FASTQ files contain the raw sequencing data, and their quality must be assessed and cleaned (trimming low-quality sequences, removing adapters) before further analysis.
Usefulness for PhD students:
- Quality Control: FastQC is often used to assess the quality of the sequencing data, giving insights into read quality, GC content, and sequence duplications.
- Read Trimming: Tools like Trimmomatic or Cutadapt are used for trimming reads to remove low-quality sequences.
- Initial Step: This is the starting point for all downstream RNA-seq analyses, making it crucial for data integrity.

2. SAM/BAM Files

Format:
- SAM (Sequence Alignment/Map): A text file that contains aligned sequencing reads to a reference genome.
- BAM (Binary Alignment/Map): The binary (compressed) version of the SAM file. It is more efficient in terms of storage and speed.
Purpose: These files store the results of mapping reads to a reference genome or transcriptome using tools like STAR, HISAT2, or Bowtie2. BAM files are indexed for quick access during analysis.
Key Fields:
- Read name: The identifier for each read.
- Flag: Describes features of the alignment (e.g., if a read is paired-end, mapped, or unmapped).
- Chromosome: The chromosome to which the read aligns.
- Position: The position of the alignment on the chromosome.
- CIGAR string: Describes how the read aligns with the reference (e.g., matches, insertions, deletions).
Usefulness for PhD students:
- Alignment: BAM files are used to analyze how well the RNA-seq reads align to a reference genome.
- Downstream Analysis: Tools like Samtools or Picard are used to manipulate these files, filtering reads, and flagging duplicates.
- Visualization: BAM files can be visualized in genome browsers (like IGV or UCSC Genome Browser) to manually inspect the alignments, exon-intron structures, or gene fusions.
- Quantification: They are essential for quantifying gene expression levels and discovering variants.

3. GTF/GFF Files

Format:
- GTF (General Transfer Format): Tab-delimited text file that contains gene annotations, such as the location of genes, exons, introns, and transcripts on a reference genome.
- GFF (General Feature Format): Similar to GTF, but with a slightly different structure for specifying genomic features.
Purpose: GTF/GFF files define the gene models, including the structure of transcripts (exons, UTRs, coding sequences), and are crucial for accurately mapping RNA-seq data to specific genes and transcripts.
Usefulness for PhD students:
- Gene Annotations: Provides the annotation of genomic elements, making it easier to interpret which regions of the genome correspond to exons, coding regions, and other functional elements.
- Guides Alignment: These files help during alignment and quantification steps, as they define where genes are located in the genome.
- Isoform Analysis: Important for detecting different isoforms (alternative splicing) and for gene expression analysis.

4. Count Matrix (Gene/Transcript Expression Table)

Format: Tab-delimited or CSV files where rows represent genes (or transcripts), and columns represent individual samples or cells.
- Rows: Genes (or transcripts) in the dataset.
- Columns: Samples (bulk RNA-seq) or cells (single-cell RNA-seq).
- Values: The number of reads or fragments mapped to each gene or transcript, which represents expression levels.
Purpose: The count matrix is used in differential expression analysis or clustering, providing the raw or normalized counts for statistical analyses.
Usefulness for PhD students:
- Differential Expression Analysis: Count matrices are input into tools like DESeq2, edgeR, or limma to identify genes that are differentially expressed between conditions or treatments.
- Single-cell Analysis: For scRNA-seq, tools like Seurat or Scanpy use this matrix to cluster cells, identify marker genes, and infer cell states or types.
- Normalization: Normalization methods (e.g., TPM, RPKM, FPKM) can be applied to adjust for sequencing depth and gene length.

5. VCF Files (Variant Call Format)

Format: A text file that contains information on genetic variants (SNPs, indels) found during RNA-seq.
- CHROM: Chromosome number.
- POS: Position of the variant.
- ID: Variant identifier.
- REF/ALT: Reference and alternative alleles.
- QUAL: Quality score for the variant call.
Purpose: VCF files store information on mutations or genetic variations discovered during RNA-seq, which can reveal transcriptome-wide variants (e.g., in cancer studies or population genetics).
Usefulness for PhD students:
- Mutation Analysis: Useful for identifying novel mutations, splice variants, or fusion genes.
- Genotype-Phenotype Studies: VCF files can be integrated with phenotypic data to study the relationship between genetic variations and disease states.
- Personalized Medicine: In cancer research, for instance, RNA-seq-derived VCFs can be used to tailor therapies based on detected mutations.

6. TSV/CSV Files (Metadata)

Format: Tab-separated (TSV) or comma-separated (CSV) files that contain metadata about the samples, such as experimental conditions, sample identifiers, cell types, or treatments.
Purpose: Used to annotate and group samples or cells based on experimental design, batch effects, or biological conditions.
Usefulness for PhD students:
- Experiment Tracking: Metadata helps ensure that you are properly organizing your data by conditions (e.g., control vs. treated, different time points).
- Batch Effect Analysis: Metadata is essential for correcting batch effects using tools like Combat or Seurat’s batch correction methods.

7. MTX/Matrix File (Sparse Matrix)

Format: Used in single-cell RNA-seq, this is a sparse matrix format to store large but sparse data (most values are zeros due to low expression of many genes).
- MTX File: Contains the non-zero expression values.
- Genes.tsv: List of genes corresponding to rows in the MTX file.
- Barcodes.tsv: List of cell barcodes corresponding to columns in the MTX file.
Purpose: Efficient storage of scRNA-seq data where only a small portion of the matrix contains non-zero values (due to the sparsity of gene expression in individual cells).
Usefulness for PhD students:
- Single-cell Analysis: MTX files are common in scRNA-seq studies and are used by tools like Seurat, Scanpy, and CellRanger for single-cell analysis.
- Efficient Computation: It makes working with large datasets computationally feasible due to the sparse nature of the data.

8. H5AD Files

Format: A file format used by Scanpy (single-cell RNA-seq) to store annotated data in an HDF5-based structure.
Purpose: Stores both the count matrix and metadata (annotations of genes and cells), making it a convenient format for working with single-cell RNA-seq data.
Usefulness for PhD students:
- Single-cell Data Storage: Ideal for storing and manipulating scRNA-seq data efficiently in a compact, accessible manner.
- Integrated Data: Allows storing multiple types of annotations and embeddings (e.g., PCA, UMAP) in a single file.

Summary of File Types and Their Uses:

FASTQ: Raw sequencing data.
SAM/BAM: Aligned reads to a reference genome.
GTF/GFF: Gene annotations, defining gene structures.
Count Matrix: Quantified expression data, useful for differential expression analysis.
VCF: Genetic variants from RNA-seq.
TSV/CSV: Metadata for experimental design and grouping.
**MTX

**: Sparse matrix storage for single-cell RNA-seq.

H5AD: Single-cell RNA-seq data storage in Scanpy.

For a PhD student, understanding and using these file formats effectively will allow you to conduct high-quality RNA-seq analysis, interpret biological data accurately, and publish robust research.

Here’s a comprehensive list of tools commonly used for different types of single-cell and bulk sequencing, categorized by sequencing technology:

1. Single-cell RNA-seq (scRNA-seq)

Data Processing:
- Cell Ranger: Popular tool by 10X Genomics for aligning, filtering, and generating gene-barcode matrices from scRNA-seq data.
- STARsolo: A tool that adds single-cell capabilities to the STAR aligner.
- kallisto | bustools: Fast pseudo-alignment for single-cell RNA-seq.
Analysis:
- Seurat: One of the most widely used R packages for analyzing scRNA-seq data, offering dimensionality reduction, clustering, and differential expression analysis.
- Scanpy: Python-based tool designed for scalable analysis of scRNA-seq data, particularly useful for large datasets.
- Monocle: Used for trajectory analysis to study cell differentiation paths.
- SC3: An R package for unsupervised clustering of scRNA-seq data.
- Velocyto: Estimates RNA velocity from spliced and unspliced transcripts to infer cell trajectories.

2. Single-cell ATAC-seq (scATAC-seq)

Data Processing:
- Cell Ranger ATAC: A tool by 10X Genomics for processing scATAC-seq data.
- SnapATAC: Tool designed to analyze scATAC-seq data, including clustering and trajectory inference.
- ArchR: R-based tool that supports the analysis of large scATAC-seq datasets, including dimensionality reduction, clustering, and pseudo-bulk analyses.
- cisTopic: Performs topic modeling for scATAC-seq data to discover regulatory regions.
- Signac: An extension of Seurat for analyzing chromatin accessibility data from scATAC-seq.

3. Single-cell DNA-seq (scDNA-seq)

Data Processing:
- Ginkgo: A web-based platform for analyzing and visualizing single-cell DNA copy number variation (CNV) data.
- CONICSmat: Combines CNV and gene expression data from single cells.
- VarTrix: A tool to extract variant information from aligned single-cell DNA sequencing data.
Analysis:
- sccaller: Identifies mutations and heterogeneity in single-cell DNA-seq data.
- SCOPE: Detects copy number variations (CNVs) in single-cell DNA-seq.

4. Single-cell Epigenomics (e.g., scMethyl-seq)

Data Processing:
- Bismark: For aligning bisulfite sequencing data (used in single-cell methylation).
- methylKit: R package for analyzing methylation data, especially for differential methylation analysis.
- SeSAMe: For analyzing single-cell DNA methylation data (from platforms like scBS-seq).
- scbs: A tool for single-cell bisulfite sequencing data processing.
Analysis:
- methyKit: For comparative methylation studies.
- ChromHMM: For chromatin state modeling and epigenetic data analysis, also applicable to single-cell epigenomics.

5. Single-cell Protein Expression (CITE-seq)

Data Processing:
- Cell Ranger CITE-seq: Part of the Cell Ranger pipeline for integrating RNA and protein data.
- CITE-seq-count: Tool for counting protein (antibody) tags in CITE-seq data.
Analysis:
- Seurat: Provides integration of RNA and protein data for CITE-seq.
- TotalVI: A probabilistic model for jointly analyzing CITE-seq protein and RNA data.

6. Spatial Transcriptomics (spRNA-seq)

Data Processing:
- Spaceranger: From 10X Genomics for spatial transcriptomics.
- STUtility: R package for spatial transcriptomics analysis.
- SpatialDE: Python-based tool for detecting spatially variable genes.
Analysis:
- Seurat: Also supports spatial transcriptomics, allowing integration of spatial and single-cell RNA-seq data.
- Scanpy: Includes support for spatial transcriptomics data analysis.

7. Bulk RNA-seq

Data Processing:
- STAR: A fast RNA-seq aligner that maps reads to a reference genome.
- HISAT2: Efficient for aligning RNA-seq reads, especially for spliced reads.
- Salmon: A lightweight method for quantifying transcript expression using fast quasi-mapping.
- kallisto: For fast, lightweight transcript-level quantification.
Analysis:
- DESeq2: For differential gene expression analysis from bulk RNA-seq.
- edgeR: Another popular tool for differential expression analysis.
- limma: R package for differential expression analysis, often used in conjunction with RNA-seq and microarray data.
- Ballgown: For analyzing transcript assembly, expression, and differential expression.
- DEXSeq: Specifically designed for analyzing differential exon usage in RNA-seq data.

8. Single-cell Multiome (Combined RNA and ATAC-seq)

Data Processing:
- Cell Ranger ARC: Used for multiome data processing (combined scRNA-seq and scATAC-seq).
- SnapATAC: Can be used for single-cell multiome analysis, especially focusing on ATAC data.
Analysis:
- Seurat: Can handle both RNA and chromatin data when combined with Signac.
- ArchR: Integrates multi-omic data, including RNA and ATAC from single-cell experiments.

9. Single-cell Immune Profiling (scVDJ-seq)

Data Processing:
- Cell Ranger VDJ: For analyzing V(D)J recombination in single cells from 10X Genomics.
- TraCeR: Tool for reconstructing T-cell receptor sequences.
- BraCeR: For reconstructing B-cell receptor sequences.
Analysis:
- VDJtools: Analyzes and compares immune repertoire data.
- Immcantation: Framework for V(D)J recombination analysis and immune repertoire profiling.

10. Single-cell CRISPR Screens

Data Processing:
- Cell Ranger CRISPR: For processing single-cell CRISPR screening data from 10X Genomics.
- CROP-seq: A pipeline for processing and analyzing single-cell CRISPR screening data.
Analysis:
- MAGeCK: For analyzing CRISPR screen data to identify essential genes.
- CRISPResso: Used for analyzing CRISPR editing outcomes in pooled and single-cell screens.

11. Single-cell RNA Velocity

Tools:
- Velocyto: Python tool for RNA velocity estimation, detecting the dynamics of mRNA transcription.
- scVelo: An extension of Velocyto with a more scalable and flexible framework in Python.
- CellRank: Adds probabilistic state transitions for cell fate mapping using RNA velocity.

12. Single-cell Pseudotime and Trajectory Inference

Tools:
- Monocle: Pseudotime and trajectory inference from scRNA-seq data.
- Slingshot: Provides robust pseudotime and trajectory inference in R.
- SCORPIUS: Another tool for trajectory inference that uses dimensionality reduction and clustering.
- TSCAN: Topology-based trajectory analysis tool.

Key Recommendations for Starting:

For scRNA-seq, begin with Seurat or Scanpy for data analysis and Cell Ranger for data processing.
For scATAC-seq, use ArchR or Signac (extension of Seurat) for analysis, and Cell Ranger ATAC for processing.
For multi-omics, start with Cell Ranger ARC and integrate RNA/ATAC data using Seurat + Signac or ArchR.

Each tool has a large community and extensive documentation, making them suitable for different types of single-cell analyses.

Hisat2 Rna Seq Basics