Single-Cell Clustering
Introduction
This dataset contains Single-cell RNA sequencing data coming from FACS sorted cells, sequenced using the SMART-seq2 protocol for the library construction. These cells come from fluorescent transgenic zebrafish lines that label distinct blood cell types. The aim is to study the hematopoietic and renal cell heterogeneity in adult zebrafish at single-cell resolution.
Dataset description
Organism: Danio rerio
Instrument: Illumina NextSeq 500
Library construction: SMART-seq2
Layout: Paired-end. 38 pb / read.
Number of cells: 246
Publication
Tang, Q., Iyer, S., Lobbardi, R., Moore, J., Chen, H., & Lareau, C. et al. (2017). Dissecting hematopoietic and renal cell heterogeneity in adult zebrafish at single-cell resolution using RNA sequencing. Journal Of Experimental Medicine, 214(10), 2875-2887. https://doi.org/10.1084/jem.20170976
Recent advances in single-cell, transcriptomic profiling have provided unprecedented access to investigate cell heterogeneity during tissue and organ development. In this study, we used massively parallel, single-cell RNA sequencing to define cell heterogeneity within the zebrafish kidney marrow, constructing a comprehensive molecular atlas of definitive hematopoiesis and functionally distinct renal cells found in adult zebrafish. Because our method analyzed blood and kidney cells in an unbiased manner, our approach was useful in characterizing immune-cell deficiencies within DNA–protein kinase catalytic subunit (prkdc), interleukin-2 receptor γ a (il2rga), and double-homozygous–mutant fish, identifying blood cell losses in T, B, and natural killer cells within specific genetic mutants. Our analysis also uncovered novel cell types, including two classes of natural killer immune cells, classically defined and erythroid-primed hematopoietic stem and progenitor cells, mucin-secreting kidney cells, and kidney stem/progenitor cells. In total, our work provides the first, comprehensive, single-cell, transcriptomic analysis of kidney and marrow cells in the adult zebrafish.
Original Data
GEO Dataset: GSE100911
Reference Genome: Danio rerio GRCz11 dna_rm primary assembly
Annotation: Danio rerio GRCz11.104 gtf
Bioinformatic Analysis
The first step in Single-Cell RNAseq analysis usually consists of the identification of clusters of cells, that is, groups of cells that share similar expression patterns. This information is useful because these groups putatively correspond to different cell types, which are the object of study. The clustering algorithm needs as input a count table, that is, a table containing genes in rows, cells in columns, and gene expression level in values. Creating the count table consists of two steps: aligning the reads coming from each one of the cells to a reference genome or transcriptome, and counting how many reads align in each gene or transcript.
1.- RNA-seq Alignment
The first step of the pipeline is to align the raw reads against the reference genome or transcriptome. The choice will depend on the data available and whether it is preferable to quantify by gene (then reference genome should be used) or by transcript (then reference transcriptome should be used).
In this example, reads will be mapped against a reference genome. Since the species under study is a eukaryotic organism, that is, it has exons in its genome, the right software to use is STAR. STAR is a fast RNA-Seq read mapper, with support for splice-junction and fusion read detection, and it was designed to align non-contiguous sequences directly to a reference genome.
Application
Input
Sequencing data in FASTQ format (or script to download it from NCBI)
Parameters
Input Files: * specify here all the FASTQ files *
Upstream Files Pattern: _1
Downstream Files Pattern: _2
Provide Annotations: true
Annotation File: Danio_rerio.GRCz11.104.gtf
Overhang: 35
2-pass Mapping: true
Sort by Coordinate: true
Min. Intron Length: 20
Max. Intron Length: 1000000
Max. Distance Between Mates: 1000000
Max. # of Multiple Alignments: 20
Max. # of Mismatches: 999
Include Chimeric Alignments: false
Add Read Group Information: false
Save Splice Junctions: false
Save Unmapped Reads: false
Execution Time
Around 4’5 hours.
Output
BAM files. There’s more information about the format in the OmicsBox manual.
The percentage of reads aligned to only one location in the reference genome is high, indicating that the alignment was good (Figure 1).

Figure 1. Relative alignments per category (unique mapped, multi-mapped, chimeric, unmapped) for each of the cells. Only 30 cells are displayed.
2.- Expression Quantification
The next step is to quantify the gene expression for each one of the cells. The aim is to obtain the count table necessary for the clustering algorithm.
Since we’ve used a reference genome in the alignment step, the option to choose at this point is the “Gene-level Quantification”. This tool expects files with aligned sequencing reads in SAM/BAM format and a GTF/GFF file with coordinates of genomic features. It uses the HTSeq package to count how many reads map to each feature of interest (e.g. genes, exons, etc.).
Application
Create Count Table, Gene-level Quantification
Input
RNA-Seq Alignments in BAM format (from the 1.- RNA-seq Alignment step).
Parameters
Feature File: Danio_rerio.GRCz11.104.gtf
Quantification Level: gene
Group by: gene_name
Strand Specificity: Non Strand Specific
Overlap Mode: Union
Lowest Mapping Quality: 10
Execution Time
Around 20 minutes.
Output
Count Table containing expression values for each gene and cell.
The output is a count table with one cell in each column. It is possible to generate a boxplot with the distribution of the library sizes (total number of reads for each cell) (Figure 2). This output is useful to detect outlier cells, that is, cells with too much or too few counts. It is recommendable to filter them during the clustering algorithm.

Figure 2. Distribution of library sizes (total number of reads for each cell).
3.- scRNA-seq Clustering
The next step is to perform the clustering of cells coming from Single-Cell RNA Sequencing (scRNA-seq) data. That is, find groups of cells that share similar expression patterns, which should correspond to the same cell type or state. To this end, OmicsBox’s tool uses the widely-used Seurat package.
Application
Input
Count Table (from 2.- Expression Quantification step).
Parameters
Input Type: Count Table Project
Minimum Cells: 1
Set Minimum Counts Cutoff: true
Minimum Counts: 100000
Set Maximum Counts Cutoff: true
Maximum Counts: 1500000
Set Minimum Features Cutoff: true
Minimum Features: 500
Set Maximum Features Cutoff: false
Filter by % of Mitochondrial Genes: false
Multi Sample Analysis: false
Normalize Data: true
Normalization Method: Regularized Negative Binomial Regression
High Variable Features: 3000
Scale Data: false
Center Data: true
Regress Out Mitochondrial Genes: false
Regress Out Cell Cycle Genes: false
Principal Components: 50
Define Dimensions by: Manual
Number of Dimensions: 20
k-value: 20
Resolution: 0.8
Point's Minimum Distance: 0.3
Point's Spread: 1.0
Execution Time
Around 10 minutes.
Output
A total of 6 different clusters have been obtained (Figure 3). These 6 groups putatively correspond to different cell types, but we don’t know to which ones for the moment.
In addition, the Elbow Plot helps to assess if the number of dimensions used during the clustering is adequate. In this case, 20 is near the elbow point so it is a good approximation (Figure 4). There is more information about this plot in OmicsBox’s manual.

Figure 3. UMAP of the cells colored by the cluster they belong to. The cluster label has been obtained with the graph-based algorithm used by Seurat.

Figure 4. The amount of variance (Y-axis) explained for each of the dimensions (X-axis) during Principal Component Analysis.
4.- Marker-based Cell Type Identification
The last step in this pipeline is to label the clusters. One approach is to look at the expression of genes characteristic of a cell type, the so-called marker genes. These genes are more expressed in a particular cell type in comparison with the other. Thus, if we identify in which cluster they are more expressed we can annote that cluster with the corresponding cell type. For this example, the list of marker genes is obtained from the original paper (Tang et al., 2017).
OmicsBox offers different visualizations to help with the marker-based cell type identification.
Application
Single-cell RNAseq Clustering - Charts
Input
scRNA-Seq Clustering Results (from 3.- scRNA-seq Clustering step).
Gene list text file.
Parameters
Expression Profile
Input Genes: File
Genes File: genes.txt
Scale Gene Expression: true
UMAP
Color by Gene Expression: true
Gene:
zfpm1
lyz
il7r
cd79b
Execution Time
Seconds.
Output
By looking at the Expression Profile (Figure 5) and the information contained in the original paper (Tang et al., 2017), the cell types can be easily assigned:
cluster_1 → HSCs
cluster_2 → Neutrophils
cluster_3 → B Cells
cluster_4 → T Cells
cluster_5 → NK Cells
cluster_6 → Thrombocytes
This visualization it’s really useful since it allows to check for the expression level of multiple genes at the same time.
Still, another interesting visualization that also helps to the cell type identification is the UMAP colored by the gene expression level (Figure 7). For example, it can be easily seen that the gene lyz is more expressed in cluster_2 in comparizon with the rest of the clusters. Since this is a marker gene for Neutrophils, we can assign this label to cluster_2.
Now that we have assigned a label to each cluster, we can rename de columns of the scRNA-Seq Clustering project so they have a more meaningful name (Figure 6). This can be achieved with the “Rename Cluster” tool available in the Context Menu. The new labels will be applied to the plots as well (Figure 8).

Figure 5. Expression profile of zebrafish clustering results. Marker-genes are in columns, and clusters in rows. The size of the dot represents the percentage of cells expressing the gene in the cluster. The color of the dot represents the average gene expression computed with the cells belonging to each cluster.

Figure 6. Renamed scRNA-Seq Clustering Project.

Figure 7. UMAP with cells colored by the expression level of the genes lyz, zfpm1, il7r, and cd29b.

Figure 8. Renamed UMAP colored by cluster.