Long Reads Transcriptome Analysis with SQANTI3

Introduction

SQANTI3 is a bioinformatics tool designed for the quality control and filtering of full-length transcripts sequenced with PacBio’s long-read technology. It is designed as the next step of the IsoSeq pipeline. The interest in this tool comes from the usefulness of long-read transcriptome sequencing to describe eukaryotic transcriptomes and replace the use of second-generation sequencing. Illumina short-reads cannot contain a whole transcript and are not able to well-characterize eukaryotic transcriptomes.

Dataset Description

Consensus transcripts obtained after using IsoSeq3 in OmicsBox in FASTA format. This FASTA file contains the full-length transcriptome of COLO829T melanoma cell line obtained with long-read sequencing.

  • Organism: Homo sapiens

  • Instrument: PacBio

  • Layout: PacBio Single Molecule, Real-Time (SMRT) Sequencing

Publication

Tseng, E., Galvin, B., Hon, T., Kloosterman, W. P., & Ashby, M. (2019). Full length transcriptome sequencing of melanoma cell line complements long read sequencing assessment of genomic rearrangements.

 Abstract

Transcriptome sequencing has proven to be an important tool for understanding the biological changes in cancer genomes including the consequences of structural rearrangements. Short-read sequencing has been the method of choice, as the high throughput at low cost allows for transcript quantitation and the detection of even rare transcripts. However, the reads are generally too short to reconstruct complete isoforms. Conversely, long-read sequencing can provide unambiguous full-length isoforms, but lower throughput has complicated quantitation and high RNA input requirements has made working with cancer samples challenging.

Recently, the COLO 829 cell line was sequenced to 50-fold coverage with PacBio Single Molecule, Real-Time (SMRT) Sequencing. To validate and extend the findings from this effort, we have generated long-read transcriptome data using an updated PacBio Iso-Seq method, the results of which will be shared at the AACR 2019 General Meeting. With this complimentary transcriptome data, we demonstrate how recent innovations in the PacBio Iso-Seq method sample preparation and sequencing chemistry have made long-read sequencing of cancer transcriptomes more practical. In particular, library preparation has been simplified and throughput has increased. The improved protocol has reduced sample prep time from several days to one day while reducing the sample input requirements ten-fold. In addition, the incorporation of unique molecular identifier (UMI) tags into the workflow has improved the bioinformatics analysis. Yield has also increased, with 3.0 sequencing chemistry typically delivering >30 Gb per SMRT Cell 1M. By integrating long and short read data, we demonstrate that the Iso-Seq method is a practical tool for annotating cancer genomes with high-quality transcript information.

Original Data

Unprocessed long-read data can be obtained from:

https://downloads-ap.pacbcloud.com/public/dataset/Melanoma2019_IsoSeq/subreads/COLO829T/

Nevertheless, SQANTI3 has as input IsoSeq output, that can be downloaded from this link.

Bioinformatic Analysis

1- Analysis Step

Application

SQANTI3

Input

Parameters

Quality Control Parameters

  • Ignore Transcript ID Nomenclature: False

  • Min. Length of Reference Transcript: 200

  • Skip ORF Prediction: False

  • Set of Splice Sites: ATAC,GCAG,GTAG

Filtering Parameters

  • Filtering: True

  • Adenine Percentage: 0.6

  • Adenines in a Row: 6

  • Distance to Annotated TTS: 50

  • Minimum Short-Read Coverage: 3

  • Filter Mono Exonic Transcripts: False

Execution Time

90 minutes aprox.

Output

Workflow

The long-read transcriptomics submodule allows the user to use as input the subreads or CSS BAM files from PacBio sequencing, transform them into consensus transcripts and have an analysis and quality control of the generated transcriptome.