Genome Assembly Comparison with QUAST

Introduction

This dataset consists of a whole genome sequencing of the N2 strain of Caenorhabditis elegans. The aim of this tutorial is to obtain a de novo genome assembly of this strain.

Dataset Description

  • Organism: Caenorhabditis elegans

  • Instrument: Illumina Genome Analyzer IIx

  • Strategy: WGS

  • Source: Genomic

  • Layout: Paired-end 500pb

Original Data

Bioinformatic Analysis

Sometimes it is still useful to perform a de novo assembly of an already studied species. It could be helpful to study different strains and their genetic variations, for example. In this scenario, it is good practice to try different assembly algorithms with different configurations, since not the same analysis works in the same way for all datasets. Further, the different assemblies can be compared in order to identify the configuration that seems more suitable for our data.

In this tutorial, we will ensemble the genome of the N2 strain of C. elegans with two different algorithms and configurations so assemblies can then be compared using QUAST.

1.- FASTQ Quality Check

This tool performs quality control on the raw sequencing reads. It gives details about different aspects of the reads like per-base quality distribution, adapter content, etc. By looking at the results, it is easier to decide the parameters for further read preprocessing.

Application

FASTQ Quality Check

Input

Parameters

  • Input Reads: * specify here all the FASTQ files *

  • Chart Read Length Binning: true

  • Provide Adapter Sequences: false

  • Provide Contaminant Sequences: false

Execution Time

Around 20 minutes.

Output

Results show that the DRR008445 sample has failed in some of the quality control points (Figure 1). The first thing to notice is the Per Base Sequence Quality (Figure 2). Apparently, the end of the reads isn’t of good quality, so it should be removed before continuing with the assembly.

In addition, the Adapter Content check (Figure 3) shows that a high percentage of reads contains adapter sequence. That is, instead of the entire cDNA insert, the adapter has been sequenced as well.

This and the bad per-base quality mentioned above could be the reasons for the other failed quality checks. Since many reads contain the same adapter sequences, this could shift the base and the GC content. In addition, it could also increase the number of overrepresented sequences, and thus duplication levels.

2.- FASTQ Preprocessing

After detecting the weaknesses of the data, it is highly recommended to remove them. Sequences with poor quality and mainly adapter sequences could interfere with the assembly. It is not desirable to introduce adapters or wrong sequences in the assembly. Thus, the preprocessing of the raw sequencing reads is a really important step if we want to achieve assemblies of good quality.

Application

FASTQ Preprocessing

Input

Parameters

  • Paired-End: * specify here all the FASTQ files *

  • Quality Encoding: Autodetection

  • Remove Adapters: true

  • Use Adapters From: Default Adapter Sequences

  • Adapter Sequences: TruSeq3

  • Seed Mismatches: 2

  • Palindrome Clip Threshold: 30

  • Simple Clip Threshold: 15

  • Minimum Adapter Length: 8

  • Keep Both Reads: true

  • Trimming: true

  • Trimming Option: Sliding Window Trimming

  • Window Size: 4

  • Required Quality: 15

  • Filter By Quality: true

  • Average Quality: 25

  • Filter By Length: true

  • Minimum Length: 36

Execution Time

Around 1 hour.

Output

Trimmomatic removes the parts of the reads with low quality or belong to the adapters. Moreover, it filters out reads that are too short after the trimming. As a result, some reads are removed from the input FASTQ (Figure 4). In addition, FASTQ files with unpaired reads are generated as well. Unpaired reads are reads for which their couple has been removed.

The FASTQ Quality Check can be performed again with the output sequences. After doing it, it is possible to see how the Per Base Sequence Quality of the problematic files has noticeably increased (Figure 5). In addition, the Illumina adapter has been removed completely (Figure 6). However, the analysis has detected a little bit of SOLID adapter, so another round of FASTQ Preprocessing and Quality Check would be recommended.

3.- DNA-Seq de novo Assembly

Now the reads are prepared to perform the de novo assembly. The de novo assembly is a really complex analysis. There are many factors that can influence the result like the algorithm used, the input data, and the parameters. Thus, it is good practice to try different analyses and compare the obtained results. In this tutorial, we will ensemble the reads using two different algorithms (ABySS and SPAdes), both with and without using mate-pair data.

Application

DNA-Seq de Novo Assembly

Input

Parameters

ABySS

  • Paired-End: * specify here the DRR008443 and DRR008444 FASTQ files *

  • Upstream Files Pattern: _1

  • Downstream Files Pattern: _2

  • Use Additional Data: false

  • K-mer Size: 55

  • Use Paired de Bruijn graph: false

  • Minimum Alignment Length: 40

  • Hash Functions: 1

  • K-mer Count Threshold: 2

  • Save Graph Files: false

ABySS + mate-pair

  • Paired-End: * specify here the DRR008443 and DRR008444 FASTQ files *

  • Upstream Files Pattern: _1

  • Downstream Files Pattern: _2

  • Use Additional Data: true

  • Mate-pair: * specify here the DRR008445 FASTQ files *

  • K-mer Size: 55

  • Use Paired de Bruijn graph: false

  • Minimum Alignment Length: 40

  • Hash Functions: 1

  • K-mer Count Threshold: 2

  • Save Graph Files: false

SPAdes

  • Paired-End: * specify here the DRR008443 and DRR008444 FASTQ files *

  • IonTorrent Data: false

  • Single-cell Data: false

  • Upstream Files Pattern: _1

  • Downstream Files Pattern: _2

  • Use Additional Mate-Pair Data: false

  • Use Data for Hybrid Assembly: false

  • Automatic K-mer Sizes: true

  • Read Error Correction: true

  • Mismatch Careful Mode: false

  • Read Coverage Cutoff: Off

  • Save Graph Files: false

  • Autodetected K-mer Sizes: 21, 33, 55

SPAdes + mate-pair

  • Paired-End: * specify here the DRR008443 and DRR008444 FASTQ files *

  • IonTorrent Data: false

  • Single-cell Data: false

  • Upstream Files Pattern: _1

  • Downstream Files Pattern: _2

  • Use Additional Mate-Pair Data: true

  • Mate-pair: * specify here the DRR008445 FASTQ files *

  • Use Data for Hybrid Assembly: false

  • Automatic K-mer Sizes: true

  • Read Error Correction: true

  • Mismatch Careful Mode: false

  • Read Coverage Cutoff: Off

  • Save Graph Files: false

  • Autodetected K-mer Sizes: 21, 33, 55

Execution Time

Around 3h each of the assemblies.

Output

For each algorithm both scaffolds.fasta and contigs.fasta files are generated. In addition, an unitigs.fasta file is generated by ABySS. It is recommended to use scaffolds.fasta files for further analysis, since they contain the results of assembling the obtained contigs. Thus, scaffold sequences are longer and more complete.

In addition to the assembled sequences, a report and an Nx plot are generated as well. The report contains different metrics regarding the assembly (Figure 7), and the Nx plot gives an idea of the continuity of the assemblies (Figure 8). Both are shown for the contigs and scaffolds of each of the assemblies. Thus, it is easy to see that in each case, the scaffolds.fasta contains fewer sequences but they are longer, which is desirable.

However, metrics like Nx are not comparable between assemblies since they are computed taking into account the total assembly size, which is different in each case. This and the fact that the different metrics are distributed in many reports and charts, make the use of software dedicated to the quality assessment of the assemblies necessary.

4.- Genome Assembly Quality Assessment

The Genome Assembly Quality Assessment tool is designed to assess the quality of de novo genome assemblies. It does so by comparing the assemblies against the reference genome. The resulting plots and reports contain the statistics for each assembly, thus making the comparison between them easier.

Application

Genome Assembly Quality Assessment

Input

Parameters

  • Assemblies: * specify here all the FASTA files *

  • Genome Type: Eukaryote

  • Reference Genome: GCF_000002985.6_WBcel235_genomic.fna.gz

  • Fragmented Genome: false

  • Minimum Contig Size: 500

  • Minimum Alignment Length: 65

  • Min. Identity Alignment (%): 95.0

  • Max Extensive Mis. Size: 1000

  • Max Local Mis. Size: 200

  • Max Scaffold Gap Size: 10000

Execution Time

Around 15 minutes.

Output

The main QUAST results are a summary report and an NGx plot (Figure 9). Both can be generated from the OmicsBox object available for download in the link above. The NGx plot shows the same as the Nx but it takes into account the genome size instead of the assembly size. Thus, it can be compared between assemblies. The NGx plot shows that the assembly obtained with ABySS and mate-pair reads contains longer contigs. For example, an NG50 value of 1000 means that 50% of the genome size is covered by contigs of length 1000 pb or greater. So higher NGx values indicate that the assembly is composed of longer contigs than the rest.

In addition, the Genome Fraction plot (Figure 10) shows that a greater percentage of the genome has bases aligned to the assemblies produced by ABySS. That is, ABySS assemblies recover a greater percentage of the assembly. In between the two configurations for the ABySS assembly, it seems that the use of mate-pair reads really decreases the number of missassemblies (Figure 11).

Overall, with this analysis, it was possible to conclude that the ABySS assembly using mate-pair reads was the more suitable strategy for our data.