Single-cell RNAseq Clustering
This tool is designed to perform the clustering of cells coming from Single-Cell RNA Sequencing (scRNA-seq) data. That is, it aims to find groups of cells that share similar expression patterns, which should correspond to the same cell type or state. Prior to the clustering, this tool allows filtering and preprocessing the data in order to make it suitable for the clustering algorithm (Figure 1). This application is based on the widely-used Seurat package.
Please cite Seurat as:
Butler, A., Hoffman, P., Smibert, P. et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol 36, 411–420 (2018). https://doi.org/10.1038/nbt.4096
Stuart, T., Butler, A., Hoffman, P. et.al. Comprehensive Integration Of Single-Cell Data. Cell 177 (7): 1888-1902.e21 (2019). doi:10.1016/j.cell.2019.05.031.
Figure 1. General workflow for the single-cell RNA seq clustering analysis
Run single-cell RNA sequencing clustering
Go to transcriptomics → Single Cell RNA-seq →scRNA-seq Clustering. A new wizard will be opened. Here you can modify several parameters of the analysis, which are divided into five different sections: Input (Figure 2), Filtering (Figure 4 and 5), Multi-sample Analysis (Figure 8 and 9), Processing (Figure 11 and 12), and Clustering (Figure 14).
On this wizard (Figure 2), you can specify a file with a count table. This count table must contain cells in columns, genes in rows, and each value corresponds to the gene expression level (Figure 3). This gene expression level can be the raw number of reads or the number of UMIs detected for each gene and cell, or even already-normalized values.
It can be added either in a .box or a .txt file via the Count Table Project and Count Table file options, respectively.
The Count Table Project expects a .box file generated by OmicsBox. It can be obtained by:
Loading a text file containing a count table into OmicsBox. It can be achieved by going to “transcriptomics → Load → Load Count Table (expression data)” and saving the object.
Creating it with the "Create Count Table" functionality (see Quantify Expression section).
The Count Table File expects a text file (Figure 3). Moreover, it is possible to specify the column separator in the Column Separator option, so it accepts different file formats. It accepts files with columns separated by TAB (“\t”), Space (“ “), Comma (“,”), and Semicolon (“;”). In addition, the analysis does not allow for NA values, so there are two options to handle them in the NA Values parameter: “Skip Line”, that is, the whole line containing a NA value is removed, or “Assume Zero Values”, in which the NA is replaced by a 0 and the whole line is kept.
The “Create Count Table” tool is only suitable for full-length scRNA-seq data (e.g. SMARTseq, SMARTer, etc.).
Figure 2. Input Page
Figure 3. Count table file example.
Configuration 1 and 2. Filtering.
In addition to filtering low-quality reads like in bulk RNA-seq analysis, it is important to filter low-quality cells. This can be easily achieved by filtering cells with an excessive or low total number of reads or counts. The first case could correspond to doublets or multiples, that is, groups of two or more cells that have been sequenced together. The second case could correspond to broken cells that could have lost part of their RNA content. Broken cells can also be identified by the percentage of total counts that correspond to mitochondrial RNA (mtRNA). It is so because cells with broken cellular membranes could still conserve the mtRNA inside the mitochondrial membrane, so the percentage of mtRNA is greater compared to intact cells. Furthermore, it is also advisable to filter cells with a low number of mapped reads, because it could be indicative that the RNA content was degraded. This last case can be evidenced by a low number of detected features.
The available filtering options are (Figure 4 and 5):
Minimum Cells. Include features (genes, exons, etc.) detected in at least this number of cells. This filter is meant to exclude features that are not very informative. It removes rows from the count table.
The following filters remove cells (columns) from the count table:
Minimum Counts. Discard cells with less than this number of reads/counts.
Maximum Counts. Discard cells with more than this number of reads/counts.
Minimum Features. Discard cells with less than this number of features.
Maximum Features. Discard cells with more than this number of features.
Maximum % Mitochondrial Genes. Discard cells with more than this percentage of mitochondrial genes.
Mitochondrial Genes File. File with a list of mitochondrial genes, one per line (Figure 6).
The gene IDs or names present in the mitochondrial genes file must correspond to the ones used in the count table.
Figure 4. Filtering Data Page I.
Figure 5. Filtering Data Page II.
Figure 6. Mitochondrial Genes File example.
Configuration 3 and 4. Multi-sample Analysis.
This is an additional step needed when the input count table contains data coming from multiple samples e.g. from wild-type and mutant organisms, from control and stimulated samples, etc. All these situations could cause changes in gene expression that could make a joint analysis of all the data difficult, with cells clustering both by condition and by cell type (Figure 7-A). In order to avoid that, the Multi-sample Data Integration step aims to integrate scRNA-seq datasets by identifying common cell types based on common sources of variation. As a result, this step enables the identification of shared populations across datasets (Figure 7-B and C) and thus further downstream comparative analyses. For more details about the integration algorithm please read the papers from the Seurat Team: Butler et al. 2018 and Stuart et al. 2019. In this step, it is possible to modify the following parameters (Figure 8 and 9):
Multi-Sample Analysis. This option must be checked in order to perform the Data Integration step.
Experimental Design File. File with a list of all cells with the condition they belong, one per line. Each line must have two or more columns separated by a tab: the first one with a cell ID and the other ones with the conditions they belong to (Figure 10).
Condition. Choose the condition to integrate datasets by. That is, the condition that you don’t want to interfere during the clustering of cells.
The next parameters modify the behavior of the integration algorithm. In case the data integration has not been successful, it is advisable to modify these parameters to try to improve the results. See Figure 7-A for a “bad” and Figure 7-B for a “good” integrations.
The integration algorithm consists of two steps: it firsts computes a dimensional reduction using Canonical Correlation Analysis (CCA) and then performs a mutual nearest neighbors (MNNs) algorithm to identify shared subpopulations across datasets. In this way it finds “anchors” to integrate the datasets, that is, equivalent cells across datasets.
N. Dimensions for Integration. The number of dimensions to use from the Canonical Correlation Analysis for the integration step.
K Anchor. How many neighbors (k) to use during the MNNs when picking anchors.
K Filter. How many neighbors (k) to use during the MNNs when filtering anchors.
K Score. How many neighbors (k) to use during the MNNs when scoring anchors.
K Weight. How many neighbors (k) to use during the MNNs when weighing anchors.
Figure 7. UMAP plots of 10,799 cells coming from wild type (WT) and Rb mutant Drosophila eye disc, prior to (A) and post (B) alignment. After alignment, cells across conditions are grouped together based on shared cell type, allowing for a single joint clustering (C) to detect 11 populations.
Figure 8. Multi-sample Data Integration Page I.
Figure 9. Multi-sample Data Integration Page II.
Figure 10. Experimental Design File example.
Configuration 5 and 6. Data Processing.
These steps are meant to prepare data for the clustering analysis (Figure 11).
Normalization aims to reduce differences in gene expression due to technical variation. This step tries to ensure that the observed heterogeneity within cells is due to biological reasons, rather than technical biases.
Normalize Data. Check this option to perform the normalization step. Extremely recommendable unless your data is already normalized or you are sure that your data does not need it.
Normalization Method. Which normalization method to use:
Regularized Negative Binomial Regression. This option uses the SCTransform normalization method designed by the Seurat team. This method applies a generalized linear model for each gene, with counts as the response and sequencing depth as the explanatory variable. This is used to learn the parameters of the model that depend on a gene’s average expression. Then, a negative binomial regression is applied and the Person residuals (the difference between the expected and the observed expression) are treated as the normalized expression levels. For more details about the statistical method please see Hafemeister and Satija 2019.
Log Normalization. Feature (gene) counts for each cell are divided by the total counts for that cell and multiplied by a scale factor. This is then natural-log transformed using log1p.
Relative Counts. Feature counts for each cell are divided by the total counts for that cell and multiplied by a scale factor. No log transformation is applied.
Centered Log Ratio Transformation (CLR). For each feature, the CLR-transformation is defined as the logarithm of the feature counts for one cell, divided by the geometric mean of all counts for that cell.
These steps are necessary whether the normalization is done or not, in order to prepare data for the dimensional reduction step.
High Variable Genes. Number of high variable genes to keep for further analysis. Keeping only those genes reduce both computational cost and non-relevant signals.
Scale Data. If checked, it scales features to have unit variance.
Center Data. If checked, it centers features to have zero mean.
Regress Out Mitochondrial Genes. Check this option to reduce the heterogeneity between cells associated with the expression of mitochondrial genes. This could prevent grouping cells during the clustering step that present a higher expression of those genes. In some cases, this information could not be informative since the higher expression of mitochondrial genes could be caused by technical reasons (e.g. because the mRNA has been leaked during cell manipulation) rather than by biological reasons. In order to perform this analysis, a file with a list of mitochondrial genes must be provided with one gene per line (Figure 6). If there’s already a mitochondrial genes file uploaded on the Configuration 2 page (Figure 2), it will appear on the current page as well. It is possible to browse for another file if desired.
Regress Out Cell Cycle Genes. Check this option to reduce the heterogeneity between cells associated with the expression of cell cycle genes. This could prevent grouping cells during the clustering step that are in the same developmental stage, independently of its type. In some cases, these differences in cell cycle may be uninformative but, in other cases, they could be treated as indicative of proliferating cell populations which can be different across treatment conditions, for example. So whether to check this option or not will depend on the dataset under study and the target of the experiment. In order to perform this analysis, a file with cell cycle genes in one column and the cell cycle phase they belong to (S or G2/M) must be provided. One gene per line and columns separated by a tab (Figure 13).
Common single-cell RNAseq analysis implies from hundreds to thousands of cells and tens of thousands of genes. That is, it is really high dimensional data, so it is advisable to reduce the dimensionality of the dataset in order to reduce the computational cost of further analysis. The most widely used method for dimensional reduction is Principal Component Analysis (PCA). Keeping only the firsts Principal Components for further analysis reduces the dimensionality of the data while keeping the heterogeneity of the dataset, since they explain the largest amount of variance present in the sample. For more details please see "Orchestrating Single-Cell Analysis", 2020.
Principal Components. Number of Principal Components to compute from the Principal Component Analysis. It must be smaller than the number of cells present in the count table.
Figure 11. Data Processing Page I.
Figure 12. Data Processing Page II.
Figure 13. Cell Cycle Genes File.
Configuration 7. Clustering Page
This step performs the actual cell clustering. It groups cells with similar expression patterns, which should correspond to the same cell type. For this analysis, Seurat uses a graph-based clustering algorithm. For more information, please see "Orchestrating Single-Cell Analysis", 2020.
These parameters affect the clustering algorithm (Figure 14).
Define Dimensions by. How to decide the number of dimensions to use from the dimensional reduction (PCA) for the clustering step.
Elbow Point. This option automatically decides the number of dimensions to use. The assumption is that each of the top PCs capturing biological signals should explain much more variance than the remaining PCs. Thus, there should be a sharp drop in the percentage of variance explained from the last “biological” PC to the others. This is the so-called “Elow Point”.
Manual. With this option, it is necessary that the user specifies the number of PCs to use in the “Number of Dimensions” parameter.
It should be taken into account that strong biological variation in the early PCs will shift the elbow point, potentially excluding weaker (but still interesting) variation in the next PCs immediately following it. So it could be interesting to repeat the analysis increasing the number of dimensions established by the Elbow Point to see if it improves the results.
Number of Dimensions. The number of dimensions to use from the dimensional reduction for the clustering step. This option is only available if the “Manual” option from the “Dimensions Choice” is selected.
k-value. The number of neighboring cells computed during the clustering algorithm. Normally, a greater k-value would produce a smaller number of clusters, and vice versa.
Resolution. This parameter determines how "fine" the clustering is: values above 1.0 would produce a larger number of clusters, and vice versa.
These parameters affect only the UMAP visualization. The Uniform Manifold Approximation and Projection (UMAP) method is a non-linear dimensionality reduction technique (similar to PCA), but this time it is used to plot the high-dimensional single-cell data into two dimensions. The UMAP is used for visualization purposes because it represents the variability of single-cell data more accurately than the PCA. For more information, please refer to "Orchestrating Single-Cell Analysis", 2020.
Point’s Minimum Distance. This controls how tightly are the points (cells) compressed together.
Point’s Spread. This controls how expanded the points are. In combination with the minimum distance, this determines how clustered the points are.
Figure 14. Clustering Page.
Once the input counts have been processed and analyzed via the "scRNA-seq clustering '' tool, a new tab is opened containing the Clustering Results (Figure 15). This new tab contains a count table but, this time, columns are cell clusters. Each slot holds the mean gene expression of each gene, taking into account all the cells belonging to that cluster.
It also generates a Summary Report (Figure 16):
The “Input” section specifies the name of the input count table file used.
The “Data Overview” table shows the value of some statistics before and after the filtering step: regarding the total number of cells, the total number of counts (or reads) per cell, and the total number of features per cell.
The “Experimental Design” section informs about the conditions present in the Experimental Design File, if specified.
The “Clustering Results” table shows the total number and a link to a list of cells belonging to each cluster. Before the table, there’s one line specifying the number of dimensions from the PCA used for the clustering step.
The “Analysis Parameters” table shows the parameters used for the clustering analysis.
In addition, two charts are generated:
Elbow Plot (Figure 17). It shows the amount of variance (given by the standard deviation) explained by successive PCs. This helps to decide how many principal components to use for the clustering algorithm, in case you want to re-run the clustering with different parameters.
UMAP (Figure 18). Each dot is a cell, colored by the cluster assigned by the graph-based clustering algorithm. The X and Y axes are the values in the UMAP coordinate 1 and 2 of each cell, respectively.
Figure 15. scRNA-seq Clustering results example.
Figure 16. Part of Summary Report example.
Figure 17. Elbow Plot chart example.
Figure 18. UMAP chart example.
Side Panel Features
It shows the Summary Report previously explained in the above “Results” section (Figure 16).
Extract Cluster(s) Counts
It may be the case that one or more big clusters have been obtained during the analysis, so it could be desirable to extract them and re-run the clustering to obtain sub-clusters of cells within it. That could make it possible to find more specific cell types.
With the Extract Cluster(s) feature, it is possible to specify one or more clusters to extract by checking them in the “Cluster(s) to Extract” option of the wizard (Figure 19). The result is a new tab with a normalized count table containing only the cells that belong to the selected cluster(s) (Figure 20). A new clustering analysis can be performed with it.
This job could take a while for big datasets.
This feature extracts the normalized counts, not the raw ones. So in the case of running a new clustering, it is advisable not to normalize the counts again.
Figure 19. Extract Cluster(s) wizard.
Figure 20. Extracted count table.
It is possible to select which cluster(s) to remove by checking them in the “Cluster(s) to Remove” option of the wizard (Figure 21). This feature removes the clusters and cells belonging to them from the clustering object. They won’t appear anymore in the cluster’s count table, UMAP, heatmap, etc.
This option may be useful if you want to remove clusters prior to performing downstream analyses (e.g. non-interesting or contaminant cells, spurious clusters, etc.).
The result is a modified clustering result (Figure 22). It must be saved in order to conserve the modifications.
This job could take a while for big datasets.
Figure 21. Remove Cluster(s) wizard.
Figure 22. Removed clusters 0 and 1 from the Clustering result.
It shows the UMAP plot. It has different coloring options available in the wizard (Figure 23):
Color by Cluster. It shows the same UMAP plot as the one explained in the above “Results” section (Figure 15).
Color by Gene Expression. It shows the UMAP plot but this time cells (dots) are colored according to the expression level of the gene specified in the “Gene” text box (Figure 24-A).
Color by Condition. It shows the UMAP with cells colored by the condition they belong to (Figure 24 -B). This option is only available if an Experimental Design is specified during the Clustering Analysis (Figure 8).
With this feature, you can see the expression levels of known gene markers across the different clusters in order to identify putative cell types. To this end, a Bubble Plot is generated with clusters in rows and the specified genes in columns (Figure 25). The size of the dot represents the percentage of cells expressing the gene, that is, the percentage of cells that have a gene expression level greater than 0. The color represents the average gene expression in that cluster. You can configure the following options in the wizard (Figure 26):
Input Genes. You can specify here which genes to plot. The gene name or ID should correspond to the one in the input count table used during the clustering analysis. They can be specified in two ways:
Text: write the genes to plot in the “Genes List” text box, one per line.
File: specify a file containing the genes to plot, one gene per line.
This affects the visualization.
Scale Gene Expression. When checked, it applies the Z-Score transformation to scale average gene expression across clusters. It allows the visualization of both highly and lowly expressed genes on the same color scale. It should be noted that this may exaggerate the results, but it is still advisable if you are going to plot genes with different expression level ranges. If unchecked, it colors the dots by the raw average gene expression of each cluster.
Figure 23. UMAP plot wizard (A) with no experimental design and (B) with experimental design.
Figure 24. UMAP plots (A) colored by gene expression and (B) colored by condition.
Figure 25. Bubble Plot example.
Figure 26. Expression Profile Wizard.
Export Experimental Design
It will generate a text file in the selected folder with the experimental design containing the cell names, the cluster they belong to, and the conditions (if any) (Figure 27).
Export Normalized Count Table
It will generate a text file in the selected folder with a count table with the normalized values.
It will generate a text file in the selected folder with the cluster counts, that is, clusters in columns, genes in rows, and average gene expression in values.
Figure 27. Experimental Design file.
Context menu options
Once you know the cell type of one cluster, it could be interesting to rename it so it has a more descriptive name. In order to do that, go to the cluster column and click the mouse right button. It will open a list of options. On that list, click on “Rename Cluster”. A new wizard will appear to type the new name. The result is the same project but with the renamed cluster (Figure 27). It has to be saved in order to conserve the modification.
Figure 27. Rename a cluster (column) from a Seurat clustering object.