Long Read Prokaryotic Analysis

Introduction

Genome analysis of Bacillus subtilis.

Dataset Description

DNA sequencing data from different Bacillus subtilis experiments. This dataset comprises a long read sequencing library coming from PacBio technology and a short read sequencing library coming from Illumina technology.

  • Organism: Bacillus subtilis.

  • Instrument: Illumina HiSeq 1500 and PacBio RS II.

  • Layout: Paired-end (Illumina) and long single reads (PacBio)

Publication

Borriss R, Danchin A, Harwood CR, Médigue C, Rocha EPC, Sekowska A, Vallenet D. Bacillus subtilis, the model Gram-positive bacterium: 20 years of annotation refinement. Microb Biotechnol. 2018 Jan;11(1):3-17. doi: 10.1111/1751-7915.13043. PMID: 29280348; PMCID: PMC5743806.

 Abstract

Genome annotation is, nowadays, performed via automatic pipelines that cannot discriminate between right and wrong annotations. Given their importance in increasing the accuracy of the genome annotations of other organisms, it is critical that the annotations of model organisms reflect the current annotation gold standard. The genome of Bacillus subtilis strain 168 was sequenced twenty years ago. Using a combination of inductive, deductive and abductive reasoning, we present a unique, manually curated annotation, essentially based on experimental data. This reveals how this bacterium lives in a plant niche, while carrying a paleome operating system common to Firmicutes and Tenericutes. Dozens of new genomic objects and an extensive literature survey have been included for the sequence available at the INSDC (AccNum AL009126.3). We also propose an extension to Demerec's nomenclature rules that will help investigators connect to this type of curated annotation via the use of common gene names.

Note that each sequencing sample comes from a different experiment. The article cited here provides information about the latest advances in the genome of B. subtillis genome but is not related to any of the sequencing libraries of this dataset.

Original Data

Bioinformatic Analysis

1- DNA-Seq de novo Assembly

Application

DNA-Seq de novo Assembly (Flye).

Input

Parameters

  • Input Reads: [PacBio Raw] SRR7498042.fastq.gz

  • Provide Genome Size: false

  • Automatic Minimum Overlap: true

  • Polishing: true

  • Number of Polishing Iterations: 1

  • Plasmids: false

  • Keep Haplotypes: false

  • Trestle: true

  • Assembly Fasta: assembly.fasta

  • Save Graph File: true

  • Graph File: assembly_graph.gfa

Execution Time

30-40 minutes.

Output

2 - DNA-Seq Alignment

Application

DNA-Seq Alignment (BWA).

Input

Parameters

  • Paired-End: ERR2935851_1.fastq.gz and ERR2935851_2.fastq.gz

  • Upstream Files Pattern: _1

  • Downstream Files Pattern: _2

  • Genome Sequences: assembly.fasta

  • Minimum Seed Length: 19

  • Band Width: 100

  • Z-dropoff: 100

  • Trigger Re-seeding: 1.5

  • Seed Occurrence: 20

  • Skip Seeds: 500

  • Drop Chains: 0.5

  • Discard Chains: 0

  • Mate Rescue Rounds: 50

  • Skip Mate Rescue: false

  • Skip Pairing: false

  • Matching Score: 1

  • Mismatch Penalty: 4

  • Gap Open Penalty (DEL): 6

  • Gap Open Penalty (INS): 6

  • Gap Extension Penalty (DEL): 1

  • Gap Extension Penalty (INS): 1

  • 5'-end Clipping Penalty: 5

  • 3'-end Clipping Penalty: 5

  • Unpaired Read Penalty: 17

  • Minimum Score: 30

  • Split Alignments as Primary: false

  • MapQ of Supp. Alignments: false

  • Output All Alignments: false

  • Soft Clipping for Supp.: false

  • Shorter Split Hits as Secondary: false

  • Sort BAM File: By Coordinates

  • Add Read Group Information: false

  • Output BAMs: alignments

  • Input Fasta: assembly.fasta

  • Input BAMs: ERR2935851.bam

  • Diploid: false

  • Issues to Fix: snps,indels,gaps,local

  • Duplicates: false

  • IUPAC: false

  • Failed Sequencer Quality: false

  • Default Quality: 10

  • Flank: 10

  • Gap Margin: 100000

  • K-mer Size: 47

  • Minimum Depth: 0.1

  • Unclosed Gaps: 10

  • Minimum Mapping Quality: 0

  • Minimum Base Quality: 0

  • Skip Stray Pairs Identification: false

  • Output FASTA: polished_sequences.fasta

  • Save Changes: true

  • Output Changes: changes.txt

Execution Time

10-15 minutes.

Output

4- Gene Finding

Application

Prokaryotic Gene Finding by Glimmer.

Input

Parameters

  • Input Sequences:polished_sequences.fasta

  • Select the genetic code: The Bacterial, Archaeal and Plant Plastid Code

  • Minimum gene length: 110

  • Maximum gene overlap: 30

  • Minimum gene score: 30

  • Select the genome shape: Circular

  • Choose ICM use option: Create new ICM model

  • Set advanced ICM parameters: false

  • Save ICM model: false

  • Select the run model: Iterated

  • Define the start codons: false

  • Define the stop codons: false

  • Define GC content: false

Execution Time

~ 5 min

Output

5- BLAST & InterProScan

Application:

CloudBLAST & InterProScan Annotation.

Input:

Parameters

CloudBlast
  • Blast Program: blastx-fast

  • Blast DB: Non-redundant protein sequences (nr v5)

  • Taxonomy Filter: 186817 Bacillaceae

  • Filter option: Blast against a subset of taxonomies

  • Blast Expectation Value (e-Value): 1.0E-3

  • Number of Blast Hits: 20

  • Blast Description Annotator: true

  • Word Size: 6

  • Low Complexity Filter: true

  • HSP Length Cutoff: 33

  • HSP-Hit Coverage: 0

  • Filter by Description: No filter

  • Save results as XML2 files: false

  • Blast Program: blastx-fast

  • Blast DB: Non-redundant protein sequences (nr v5)

  • Taxonomy Filter: 2 Bacteria <bacteria>

  • Filter option: Blast against a subset of taxonomies

  • Blast Expectation Value (e-Value): 1.0E-3

  • Number of Blast Hits: 20

  • Blast Description Annotator: true

  • Word Size: 6

  • Low Complexity Filter: true

  • HSP Length Cutoff: 33

  • HSP-Hit Coverage: 0

  • Filter by Description: No filter

  • Save results as XML2 files: false

InterProScan
  • FPrintScan: true

  • HMMPIR: true

  • HMMPfam: true

  • HMMTigr: true

  • ProfileScan: true

  • HAMAP: true

  • PatternScan: false

  • SuperFamily: true

  • HMMPanther: true

  • Gene3D: true

  • Coils: false

  • CDD: true

  • SFLD: true

  • MobiDBLite: true