Bioinformatics
single-cell
Fundamentals

There are several types of single-cell (sc) sequencing technologies, each providing different insights into the molecular biology of individual cells. The sc* prefix refers to single-cell methods designed to study various aspects of cellular function, including gene expression, chromatin accessibility, epigenetics, and more. Here’s a list of common sc* methods:

1. scRNA-seq (Single-cell RNA sequencing)

  • Purpose: Measures gene expression by sequencing the transcriptome of individual cells.
  • Insight: Provides data on the active genes in each cell, enabling cell-type identification, states, and the study of gene expression heterogeneity across cell populations.

2. scATAC-seq (Single-cell Assay for Transposase-Accessible Chromatin using sequencing)

  • Purpose: Profiles chromatin accessibility at the single-cell level.
  • Insight: Identifies open chromatin regions where regulatory elements (promoters, enhancers) are located, providing information on gene regulation and epigenetic states.

3. scDNA-seq (Single-cell DNA sequencing)

  • Purpose: Sequences the DNA of individual cells.
  • Insight: Used to study genetic variations, mutations, copy number variations, and clonal evolution, often in cancer research.

4. scChIP-seq (Single-cell Chromatin Immunoprecipitation sequencing)

  • Purpose: Profiles the binding of DNA-associated proteins like histones and transcription factors at the single-cell level.
  • Insight: Provides epigenetic information by identifying histone modifications or transcription factor binding sites, revealing how gene expression is regulated.

5. scBS-seq (Single-cell Bisulfite sequencing)

  • Purpose: Measures DNA methylation at the single-cell level.
  • Insight: Provides insights into epigenetic modifications by detecting methylated cytosines, which influence gene expression and cell differentiation.

6. scCUT&Tag (Single-cell Cleavage Under Targets and Tagmentation)

  • Purpose: A targeted method to profile histone modifications and transcription factor binding at the single-cell level.
  • Insight: This method offers a more focused approach to studying epigenetic modifications in specific chromatin regions with reduced background noise compared to scChIP-seq.

7. scTCR-seq (Single-cell T-cell Receptor sequencing)

  • Purpose: Analyzes the diversity of T-cell receptors (TCRs) in individual T cells.
  • Insight: Used to study the immune system’s T-cell receptor repertoire, which is important for understanding immune responses, especially in cancer immunotherapy and infectious diseases.

8. scBCR-seq (Single-cell B-cell Receptor sequencing)

  • Purpose: Sequences the B-cell receptor (BCR) genes in individual B cells.
  • Insight: Provides information on the diversity of BCRs, helping to understand how the immune system recognizes antigens and produces antibodies.

9. scMS (Single-cell Mass Spectrometry)

  • Purpose: Measures the proteome of individual cells.
  • Insight: Enables the quantification of proteins in single cells, which is useful for studying protein expression and post-translational modifications.

10. scHi-C (Single-cell Hi-C sequencing)

  • Purpose: Captures chromatin conformation and 3D interactions in individual cells.
  • Insight: Provides information about chromatin organization, such as long-range interactions between different genomic regions, which can affect gene regulation.

11. scMET-seq (Single-cell Methylation sequencing)

  • Purpose: Profiles methylation patterns in individual cells.
  • Insight: Offers data on epigenetic regulation by mapping methylation at specific sites, allowing researchers to explore how DNA methylation influences gene expression and cellular differentiation.

12. scProteomics (Single-cell Proteomics)

  • Purpose: Measures protein levels and interactions at the single-cell level.
  • Insight: Provides direct information about protein abundance, modifications, and protein-protein interactions, giving a detailed view of cellular function beyond transcriptomics.

13. scCITE-seq (Cellular Indexing of Transcriptomes and Epitopes by Sequencing)

  • Purpose: Simultaneously measures both RNA and protein levels in individual cells.
  • Insight: Combines transcriptomics and proteomics at the single-cell level, offering a multi-omic view of gene expression and protein expression in the same cell.

14. scRNA-seq + scATAC-seq (Multi-omic Integration)

  • Purpose: Simultaneously measures gene expression (RNA-seq) and chromatin accessibility (ATAC-seq) in the same cells.
  • Insight: Provides a more comprehensive understanding of how chromatin accessibility correlates with gene expression, revealing mechanisms of gene regulation.

Summary of sc* Technologies:

Each of these single-cell methods provides unique insights into cellular biology:

  • Transcriptomics (scRNA-seq): Studies gene expression.
  • Epigenomics (scATAC-seq, scChIP-seq, scBS-seq): Explores chromatin accessibility, DNA-protein interactions, and methylation.
  • Genomics (scDNA-seq): Investigates genetic variations and mutations.
  • Proteomics (scMS, scCITE-seq): Focuses on proteins and their roles.
  • Immune Profiling (scTCR-seq, scBCR-seq): Analyzes immune cell receptors.
  • Chromatin Organization (scHi-C): Examines 3D chromatin interactions.

The choice of single-cell method depends on the research focus, whether it is gene expression, chromatin dynamics, DNA mutations, or protein quantification.

In RNA sequencing (RNA-seq), several types of data files are generated and used for different stages of analysis. Each file format plays a critical role in storing and interpreting sequencing data, ranging from raw sequence data to processed expression levels. Below is a detailed explanation of the key file types and formats commonly used in RNA-seq analysis, along with how they are useful for a PhD student.

1. FASTQ Files

  • Format: Text-based file that contains raw sequence reads generated from the sequencer. Each entry in the file contains:

    1. Read Identifier: Describes the sequence (usually includes machine ID and read index).
    2. Sequence: The actual nucleotide sequence (A, T, C, G).
    3. Quality Score: Phred quality scores that indicate the probability of a base being called incorrectly.
    4. Optional Additional Data: Additional tags or metadata associated with the sequence.
  • Purpose: FASTQ files contain the raw sequencing data, and their quality must be assessed and cleaned (trimming low-quality sequences, removing adapters) before further analysis.

  • Usefulness for PhD students:

    • Quality Control: FastQC is often used to assess the quality of the sequencing data, giving insights into read quality, GC content, and sequence duplications.
    • Read Trimming: Tools like Trimmomatic or Cutadapt are used for trimming reads to remove low-quality sequences.
    • Initial Step: This is the starting point for all downstream RNA-seq analyses, making it crucial for data integrity.

2. SAM/BAM Files

  • Format:

    • SAM (Sequence Alignment/Map): A text file that contains aligned sequencing reads to a reference genome.
    • BAM (Binary Alignment/Map): The binary (compressed) version of the SAM file. It is more efficient in terms of storage and speed.
  • Purpose: These files store the results of mapping reads to a reference genome or transcriptome using tools like STAR, HISAT2, or Bowtie2. BAM files are indexed for quick access during analysis.

  • Key Fields:

    • Read name: The identifier for each read.
    • Flag: Describes features of the alignment (e.g., if a read is paired-end, mapped, or unmapped).
    • Chromosome: The chromosome to which the read aligns.
    • Position: The position of the alignment on the chromosome.
    • CIGAR string: Describes how the read aligns with the reference (e.g., matches, insertions, deletions).
  • Usefulness for PhD students:

    • Alignment: BAM files are used to analyze how well the RNA-seq reads align to a reference genome.
    • Downstream Analysis: Tools like Samtools or Picard are used to manipulate these files, filtering reads, and flagging duplicates.
    • Visualization: BAM files can be visualized in genome browsers (like IGV or UCSC Genome Browser) to manually inspect the alignments, exon-intron structures, or gene fusions.
    • Quantification: They are essential for quantifying gene expression levels and discovering variants.

3. GTF/GFF Files

  • Format:

    • GTF (General Transfer Format): Tab-delimited text file that contains gene annotations, such as the location of genes, exons, introns, and transcripts on a reference genome.
    • GFF (General Feature Format): Similar to GTF, but with a slightly different structure for specifying genomic features.
  • Purpose: GTF/GFF files define the gene models, including the structure of transcripts (exons, UTRs, coding sequences), and are crucial for accurately mapping RNA-seq data to specific genes and transcripts.

  • Usefulness for PhD students:

    • Gene Annotations: Provides the annotation of genomic elements, making it easier to interpret which regions of the genome correspond to exons, coding regions, and other functional elements.
    • Guides Alignment: These files help during alignment and quantification steps, as they define where genes are located in the genome.
    • Isoform Analysis: Important for detecting different isoforms (alternative splicing) and for gene expression analysis.

4. Count Matrix (Gene/Transcript Expression Table)

  • Format: Tab-delimited or CSV files where rows represent genes (or transcripts), and columns represent individual samples or cells.

    • Rows: Genes (or transcripts) in the dataset.
    • Columns: Samples (bulk RNA-seq) or cells (single-cell RNA-seq).
    • Values: The number of reads or fragments mapped to each gene or transcript, which represents expression levels.
  • Purpose: The count matrix is used in differential expression analysis or clustering, providing the raw or normalized counts for statistical analyses.

  • Usefulness for PhD students:

    • Differential Expression Analysis: Count matrices are input into tools like DESeq2, edgeR, or limma to identify genes that are differentially expressed between conditions or treatments.
    • Single-cell Analysis: For scRNA-seq, tools like Seurat or Scanpy use this matrix to cluster cells, identify marker genes, and infer cell states or types.
    • Normalization: Normalization methods (e.g., TPM, RPKM, FPKM) can be applied to adjust for sequencing depth and gene length.

5. VCF Files (Variant Call Format)

  • Format: A text file that contains information on genetic variants (SNPs, indels) found during RNA-seq.

    • CHROM: Chromosome number.
    • POS: Position of the variant.
    • ID: Variant identifier.
    • REF/ALT: Reference and alternative alleles.
    • QUAL: Quality score for the variant call.
  • Purpose: VCF files store information on mutations or genetic variations discovered during RNA-seq, which can reveal transcriptome-wide variants (e.g., in cancer studies or population genetics).

  • Usefulness for PhD students:

    • Mutation Analysis: Useful for identifying novel mutations, splice variants, or fusion genes.
    • Genotype-Phenotype Studies: VCF files can be integrated with phenotypic data to study the relationship between genetic variations and disease states.
    • Personalized Medicine: In cancer research, for instance, RNA-seq-derived VCFs can be used to tailor therapies based on detected mutations.

6. TSV/CSV Files (Metadata)

  • Format: Tab-separated (TSV) or comma-separated (CSV) files that contain metadata about the samples, such as experimental conditions, sample identifiers, cell types, or treatments.

  • Purpose: Used to annotate and group samples or cells based on experimental design, batch effects, or biological conditions.

  • Usefulness for PhD students:

    • Experiment Tracking: Metadata helps ensure that you are properly organizing your data by conditions (e.g., control vs. treated, different time points).
    • Batch Effect Analysis: Metadata is essential for correcting batch effects using tools like Combat or Seurat’s batch correction methods.

7. MTX/Matrix File (Sparse Matrix)

  • Format: Used in single-cell RNA-seq, this is a sparse matrix format to store large but sparse data (most values are zeros due to low expression of many genes).

    • MTX File: Contains the non-zero expression values.
    • Genes.tsv: List of genes corresponding to rows in the MTX file.
    • Barcodes.tsv: List of cell barcodes corresponding to columns in the MTX file.
  • Purpose: Efficient storage of scRNA-seq data where only a small portion of the matrix contains non-zero values (due to the sparsity of gene expression in individual cells).

  • Usefulness for PhD students:

    • Single-cell Analysis: MTX files are common in scRNA-seq studies and are used by tools like Seurat, Scanpy, and CellRanger for single-cell analysis.
    • Efficient Computation: It makes working with large datasets computationally feasible due to the sparse nature of the data.

8. H5AD Files

  • Format: A file format used by Scanpy (single-cell RNA-seq) to store annotated data in an HDF5-based structure.

  • Purpose: Stores both the count matrix and metadata (annotations of genes and cells), making it a convenient format for working with single-cell RNA-seq data.

  • Usefulness for PhD students:

    • Single-cell Data Storage: Ideal for storing and manipulating scRNA-seq data efficiently in a compact, accessible manner.
    • Integrated Data: Allows storing multiple types of annotations and embeddings (e.g., PCA, UMAP) in a single file.

Summary of File Types and Their Uses:

  • FASTQ: Raw sequencing data.
  • SAM/BAM: Aligned reads to a reference genome.
  • GTF/GFF: Gene annotations, defining gene structures.
  • Count Matrix: Quantified expression data, useful for differential expression analysis.
  • VCF: Genetic variants from RNA-seq.
  • TSV/CSV: Metadata for experimental design and grouping.
  • **MTX

**: Sparse matrix storage for single-cell RNA-seq.

  • H5AD: Single-cell RNA-seq data storage in Scanpy.

For a PhD student, understanding and using these file formats effectively will allow you to conduct high-quality RNA-seq analysis, interpret biological data accurately, and publish robust research.

Here’s a comprehensive list of tools commonly used for different types of single-cell and bulk sequencing, categorized by sequencing technology:

1. Single-cell RNA-seq (scRNA-seq)

  • Data Processing:
    • Cell Ranger: Popular tool by 10X Genomics for aligning, filtering, and generating gene-barcode matrices from scRNA-seq data.
    • STARsolo: A tool that adds single-cell capabilities to the STAR aligner.
    • kallisto | bustools: Fast pseudo-alignment for single-cell RNA-seq.
  • Analysis:
    • Seurat: One of the most widely used R packages for analyzing scRNA-seq data, offering dimensionality reduction, clustering, and differential expression analysis.
    • Scanpy: Python-based tool designed for scalable analysis of scRNA-seq data, particularly useful for large datasets.
    • Monocle: Used for trajectory analysis to study cell differentiation paths.
    • SC3: An R package for unsupervised clustering of scRNA-seq data.
    • Velocyto: Estimates RNA velocity from spliced and unspliced transcripts to infer cell trajectories.

2. Single-cell ATAC-seq (scATAC-seq)

  • Data Processing:
    • Cell Ranger ATAC: A tool by 10X Genomics for processing scATAC-seq data.
    • SnapATAC: Tool designed to analyze scATAC-seq data, including clustering and trajectory inference.
    • ArchR: R-based tool that supports the analysis of large scATAC-seq datasets, including dimensionality reduction, clustering, and pseudo-bulk analyses.
    • cisTopic: Performs topic modeling for scATAC-seq data to discover regulatory regions.
    • Signac: An extension of Seurat for analyzing chromatin accessibility data from scATAC-seq.

3. Single-cell DNA-seq (scDNA-seq)

  • Data Processing:
    • Ginkgo: A web-based platform for analyzing and visualizing single-cell DNA copy number variation (CNV) data.
    • CONICSmat: Combines CNV and gene expression data from single cells.
    • VarTrix: A tool to extract variant information from aligned single-cell DNA sequencing data.
  • Analysis:
    • sccaller: Identifies mutations and heterogeneity in single-cell DNA-seq data.
    • SCOPE: Detects copy number variations (CNVs) in single-cell DNA-seq.

4. Single-cell Epigenomics (e.g., scMethyl-seq)

  • Data Processing:
    • Bismark: For aligning bisulfite sequencing data (used in single-cell methylation).
    • methylKit: R package for analyzing methylation data, especially for differential methylation analysis.
    • SeSAMe: For analyzing single-cell DNA methylation data (from platforms like scBS-seq).
    • scbs: A tool for single-cell bisulfite sequencing data processing.
  • Analysis:
    • methyKit: For comparative methylation studies.
    • ChromHMM: For chromatin state modeling and epigenetic data analysis, also applicable to single-cell epigenomics.

5. Single-cell Protein Expression (CITE-seq)

  • Data Processing:
    • Cell Ranger CITE-seq: Part of the Cell Ranger pipeline for integrating RNA and protein data.
    • CITE-seq-count: Tool for counting protein (antibody) tags in CITE-seq data.
  • Analysis:
    • Seurat: Provides integration of RNA and protein data for CITE-seq.
    • TotalVI: A probabilistic model for jointly analyzing CITE-seq protein and RNA data.

6. Spatial Transcriptomics (spRNA-seq)

  • Data Processing:
    • Spaceranger: From 10X Genomics for spatial transcriptomics.
    • STUtility: R package for spatial transcriptomics analysis.
    • SpatialDE: Python-based tool for detecting spatially variable genes.
  • Analysis:
    • Seurat: Also supports spatial transcriptomics, allowing integration of spatial and single-cell RNA-seq data.
    • Scanpy: Includes support for spatial transcriptomics data analysis.

7. Bulk RNA-seq

  • Data Processing:
    • STAR: A fast RNA-seq aligner that maps reads to a reference genome.
    • HISAT2: Efficient for aligning RNA-seq reads, especially for spliced reads.
    • Salmon: A lightweight method for quantifying transcript expression using fast quasi-mapping.
    • kallisto: For fast, lightweight transcript-level quantification.
  • Analysis:
    • DESeq2: For differential gene expression analysis from bulk RNA-seq.
    • edgeR: Another popular tool for differential expression analysis.
    • limma: R package for differential expression analysis, often used in conjunction with RNA-seq and microarray data.
    • Ballgown: For analyzing transcript assembly, expression, and differential expression.
    • DEXSeq: Specifically designed for analyzing differential exon usage in RNA-seq data.

8. Single-cell Multiome (Combined RNA and ATAC-seq)

  • Data Processing:
    • Cell Ranger ARC: Used for multiome data processing (combined scRNA-seq and scATAC-seq).
    • SnapATAC: Can be used for single-cell multiome analysis, especially focusing on ATAC data.
  • Analysis:
    • Seurat: Can handle both RNA and chromatin data when combined with Signac.
    • ArchR: Integrates multi-omic data, including RNA and ATAC from single-cell experiments.

9. Single-cell Immune Profiling (scVDJ-seq)

  • Data Processing:
    • Cell Ranger VDJ: For analyzing V(D)J recombination in single cells from 10X Genomics.
    • TraCeR: Tool for reconstructing T-cell receptor sequences.
    • BraCeR: For reconstructing B-cell receptor sequences.
  • Analysis:
    • VDJtools: Analyzes and compares immune repertoire data.
    • Immcantation: Framework for V(D)J recombination analysis and immune repertoire profiling.

10. Single-cell CRISPR Screens

  • Data Processing:
    • Cell Ranger CRISPR: For processing single-cell CRISPR screening data from 10X Genomics.
    • CROP-seq: A pipeline for processing and analyzing single-cell CRISPR screening data.
  • Analysis:
    • MAGeCK: For analyzing CRISPR screen data to identify essential genes.
    • CRISPResso: Used for analyzing CRISPR editing outcomes in pooled and single-cell screens.

11. Single-cell RNA Velocity

  • Tools:
    • Velocyto: Python tool for RNA velocity estimation, detecting the dynamics of mRNA transcription.
    • scVelo: An extension of Velocyto with a more scalable and flexible framework in Python.
    • CellRank: Adds probabilistic state transitions for cell fate mapping using RNA velocity.

12. Single-cell Pseudotime and Trajectory Inference

  • Tools:
    • Monocle: Pseudotime and trajectory inference from scRNA-seq data.
    • Slingshot: Provides robust pseudotime and trajectory inference in R.
    • SCORPIUS: Another tool for trajectory inference that uses dimensionality reduction and clustering.
    • TSCAN: Topology-based trajectory analysis tool.

Key Recommendations for Starting:

  • For scRNA-seq, begin with Seurat or Scanpy for data analysis and Cell Ranger for data processing.
  • For scATAC-seq, use ArchR or Signac (extension of Seurat) for analysis, and Cell Ranger ATAC for processing.
  • For multi-omics, start with Cell Ranger ARC and integrate RNA/ATAC data using Seurat + Signac or ArchR.

Each tool has a large community and extensive documentation, making them suitable for different types of single-cell analyses.