Long Read Prokaryotic Analysis
Introduction
Genome analysis of Bacillus subtilis.
Dataset Description
DNA sequencing data from different Bacillus subtilis experiments. This dataset comprises a long read sequencing library coming from PacBio technology and a short read sequencing library coming from Illumina technology.
Organism: Bacillus subtilis.
Instrument: Illumina HiSeq 1500 and PacBio RS II.
Layout: Paired-end (Illumina) and long single reads (PacBio)
Publication
Note that each sequencing sample comes from a different experiment. The article cited here provides information about the latest advances in the genome of B. subtillis genome but is not related to any of the sequencing libraries of this dataset.
Original Data
Illumina ENA Run: ERR2935851.
PacBio ENA Run: SRR7498042.
NCBI Genome: Bacillus subtilis subsp. subtilis str. 168.
Bioinformatic Analysis
1- DNA-Seq de novo Assembly
Application
DNA-Seq de novo Assembly (Flye).
Input
Parameters
Input Reads: [PacBio Raw] SRR7498042.fastq.gz
Provide Genome Size: false
Automatic Minimum Overlap: true
Polishing: true
Number of Polishing Iterations: 1
Plasmids: false
Keep Haplotypes: false
Trestle: true
Assembly Fasta: assembly.fasta
Save Graph File: true
Graph File: assembly_graph.gfa
Execution Time
30-40 minutes.
Output
assembly_graph.gfa: Text file containing the repeat graph in GFA format.
nx_plot_flye.box: Line chart about DNA-Seq de novo Assembly results.
report_flye.box: Report about DNA-Seq de novo Assembly results.
2 - DNA-Seq Alignment
Application
DNA-Seq Alignment (BWA).
Input
Illumina sequencing data in FASTQ format: ERR2935851_1.fastq.gz and ERR2935851_2.fastq.gz.
Assembled genome in FASTA format (from the 1- DNA-Seq de novo Assembly step).
Parameters
Paired-End: ERR2935851_1.fastq.gz and ERR2935851_2.fastq.gz
Upstream Files Pattern: _1
Downstream Files Pattern: _2
Genome Sequences: assembly.fasta
Minimum Seed Length: 19
Band Width: 100
Z-dropoff: 100
Trigger Re-seeding: 1.5
Seed Occurrence: 20
Skip Seeds: 500
Drop Chains: 0.5
Discard Chains: 0
Mate Rescue Rounds: 50
Skip Mate Rescue: false
Skip Pairing: false
Matching Score: 1
Mismatch Penalty: 4
Gap Open Penalty (DEL): 6
Gap Open Penalty (INS): 6
Gap Extension Penalty (DEL): 1
Gap Extension Penalty (INS): 1
5'-end Clipping Penalty: 5
3'-end Clipping Penalty: 5
Unpaired Read Penalty: 17
Minimum Score: 30
Split Alignments as Primary: false
MapQ of Supp. Alignments: false
Output All Alignments: false
Soft Clipping for Supp.: false
Shorter Split Hits as Secondary: false
Sort BAM File: By Coordinates
Add Read Group Information: false
Output BAMs: alignments
Execution Time
15-20 minutes.
Output
3- Polishing
Application
Input
Assembled genome (from the 1- DNA-Seq de novo Assembly step).
DNA-Seq Alignments in BAM format (from the 2- DNA-Seq Alignment step).
Parameters
Input Fasta: assembly.fasta
Input BAMs: ERR2935851.bam
Diploid: false
Issues to Fix: snps,indels,gaps,local
Duplicates: false
IUPAC: false
Failed Sequencer Quality: false
Default Quality: 10
Flank: 10
Gap Margin: 100000
K-mer Size: 47
Minimum Depth: 0.1
Unclosed Gaps: 10
Minimum Mapping Quality: 0
Minimum Base Quality: 0
Skip Stray Pairs Identification: false
Output FASTA: polished_sequences.fasta
Save Changes: true
Output Changes: changes.txt
Execution Time
10-15 minutes.
Output
polished_sequences.fasta: FASTA file containing polished sequences.
changes.txt: Text file containing a space-delimited record of every change made in the assembly.
dna_seq_polishing_results.box: Report about DNA-Seq Polishing results.
nx_plot_polishing.box: Line chart about DNA-Seq Polishing results.
fix_type_distribution.box: Pie chart that summarizes the fix types performed during polishing.
4- Gene Finding
Application
Prokaryotic Gene Finding by Glimmer.
Input
Polished assembly in FASTA format (from the 3- Polishing step).
Parameters
Input Sequences:polished_sequences.fasta
Select the genetic code: The Bacterial, Archaeal and Plant Plastid Code
Minimum gene length: 110
Maximum gene overlap: 30
Minimum gene score: 30
Select the genome shape: Circular
Choose ICM use option: Create new ICM model
Set advanced ICM parameters: false
Save ICM model: false
Select the run model: Iterated
Define the start codons: false
Define the stop codons: false
Define GC content: false
Execution Time
~ 5 min
Output
seqs_pgf.box: Sequence project containing predicted gene sequences.
gff_pgf.box: GFF project containing predicted gene coordinates.
report_pgf.box: Report about Prokaryotic Gene Finding results.
5- BLAST & InterProScan
Application:
CloudBLAST & InterProScan Annotation.
Input:
Predicted gene sequences in an OmicsBox project (from the 4- Gene Finding step).
Parameters
CloudBlast
Blast Program: blastx-fast
Blast DB: Non-redundant protein sequences (nr v5)
Taxonomy Filter: 186817 Bacillaceae
Filter option: Blast against a subset of taxonomies
Blast Expectation Value (e-Value): 1.0E-3
Number of Blast Hits: 20
Blast Description Annotator: true
Word Size: 6
Low Complexity Filter: true
HSP Length Cutoff: 33
HSP-Hit Coverage: 0
Filter by Description: No filter
Save results as XML2 files: false
Blast Program: blastx-fast
Blast DB: Non-redundant protein sequences (nr v5)
Taxonomy Filter: 2 Bacteria <bacteria>
Filter option: Blast against a subset of taxonomies
Blast Expectation Value (e-Value): 1.0E-3
Number of Blast Hits: 20
Blast Description Annotator: true
Word Size: 6
Low Complexity Filter: true
HSP Length Cutoff: 33
HSP-Hit Coverage: 0
Filter by Description: No filter
Save results as XML2 files: false
InterProScan
FPrintScan: true
HMMPIR: true
HMMPfam: true
HMMTigr: true
ProfileScan: true
HAMAP: true
PatternScan: false
SuperFamily: true
HMMPanther: true
Gene3D: true
Coils: false
CDD: true
SFLD: true
MobiDBLite: true
Execution Time
~1 hour.