TY - JOUR
T1 - A high-resolution single-molecule sequencing-based Arabidopsis transcriptome using novel methods of Iso-seq analysis
AU - Zhang, Runxuan
AU - Kuo, Richard
AU - Coulter, Max
AU - Calixto, Cristiane P. G.
AU - Entizne, Juan Carlos
AU - Guo, Wenbin
AU - Marquez, Yamile
AU - Milne, Linda
AU - Riegler, Stefan
AU - Matsui, Akihiro
AU - Tanaka, Maho
AU - Harvey, Sarah
AU - Gao, Yubang
AU - Wießner-Kroh, Theresa
AU - Paniagua, Alejandro
AU - Crespi, Martin
AU - Denby, Katherine
AU - Hur, Asa ben
AU - Huq, Enamul
AU - Jantsch, Michael
AU - Jarmolowski, Artur
AU - Koester, Tino
AU - Laubinger, Sascha
AU - Li, Qingshun Quinn
AU - Gu, Lianfeng
AU - Seki, Motoaki
AU - Staiger, Dorothee
AU - Sunkar, Ramanjulu
AU - Szweykowska-Kulinska, Zofia
AU - Tu, Shih Long
AU - Wachter, Andreas
AU - Waugh, Robbie
AU - Xiong, Liming
AU - Zhang, Xiao Ning
AU - Conesa, Ana
AU - Reddy, Anireddy S. N.
AU - Barta, Andrea
AU - Kalyna, Maria
AU - Brown, John W. S.
N1 - Funding Information:
This work was jointly supported by funding from the Biotechnology and Biological Sciences Research Council (BBSRC) BB/P009751/1 to JB; BB/R014582/1 to RW and RZ; BB/S020160/1 to RZ; BB/S004610/1 (16 ERA-CAPS BARN) to RW; the Scottish Government Rural and Environment Science and Analytical Services division (RESAS) [to RZ, RW, and JB]; the National Science Foundation (MCB-2014408) and the National Institute of Health (NIH) (GM-114297) to E.H.; S. H. was supported by funding to K.D. from the University of York; the Austrian Science Fund (FWF) SFB F43 to AB and MJ and [P26333] to MK; The French Agence Nationale de la Recherche grant ANR-16-CE12-0032 to MC; the Japan Science and Technology Agency (JST), the Core Research for Evolutionary Science and Technology (CREST; Grant Number JPMJCR13B4) to M.S.; the National Science Foundation (Grant No. DBI1949036 to A.b.H and A.S.N.R, and Grant No. MCB 2014542 to E.H. and A.S.N.R.); and the DOE Office of Science, Office of Biological and Environmental Research (Grant No. DE-SC0010733) to A.S.N.R and A.b.H.; the Deutsche Forschungsgemeinschaft (DFG) STA653/14-1 and STA653/15-1 to DS; the National Science Foundation grant (IOS-154173) to Q.Q.L.; the German Research Foundation (DFG) WA2167/8-1 to AW and SFB1101/C03 to AW and TWK; the Research Grants Council (RGC) of Hong Kong (GRF 12103020) to LX. NSF grant IOS-1849708 and NSF EPSCoR grant 1826836 to RS; the Academia Sinica to S.-L. T.
Publisher Copyright:
© 2022, The Author(s).
PY - 2022/12
Y1 - 2022/12
N2 - Background: Accurate and comprehensive annotation of transcript sequences is essential for transcript quantification and differential gene and transcript expression analysis. Single-molecule long-read sequencing technologies provide improved integrity of transcript structures including alternative splicing, and transcription start and polyadenylation sites. However, accuracy is significantly affected by sequencing errors, mRNA degradation, or incomplete cDNA synthesis.Results: We present a new and comprehensive Arabidopsis thaliana Reference Transcript Dataset 3 (AtRTD3). AtRTD3 contains over 169,000 transcripts—twice that of the best current Arabidopsis transcriptome and including over 1500 novel genes. Seventy-eight percent of transcripts are from Iso-seq with accurately defined splice junctions and transcription start and end sites. We develop novel methods to determine splice junctions and transcription start and end sites accurately. Mismatch profiles around splice junctions provide a powerful feature to distinguish correct splice junctions and remove false splice junctions. Stratified approaches identify high-confidence transcription start and end sites and remove fragmentary transcripts due to degradation. AtRTD3 is a major improvement over existing transcriptomes as demonstrated by analysis of an Arabidopsis cold response RNA-seq time-series. AtRTD3 provides higher resolution of transcript expression profiling and identifies cold-induced differential transcription start and polyadenylation site usage.Conclusions: AtRTD3 is the most comprehensive Arabidopsis transcriptome currently. It improves the precision of differential gene and transcript expression, differential alternative splicing, and transcription start/end site usage analysis from RNA-seq data. The novel methods for identifying accurate splice junctions and transcription start/end sites are widely applicable and will improve single-molecule sequencing analysis from any species.
AB - Background: Accurate and comprehensive annotation of transcript sequences is essential for transcript quantification and differential gene and transcript expression analysis. Single-molecule long-read sequencing technologies provide improved integrity of transcript structures including alternative splicing, and transcription start and polyadenylation sites. However, accuracy is significantly affected by sequencing errors, mRNA degradation, or incomplete cDNA synthesis.Results: We present a new and comprehensive Arabidopsis thaliana Reference Transcript Dataset 3 (AtRTD3). AtRTD3 contains over 169,000 transcripts—twice that of the best current Arabidopsis transcriptome and including over 1500 novel genes. Seventy-eight percent of transcripts are from Iso-seq with accurately defined splice junctions and transcription start and end sites. We develop novel methods to determine splice junctions and transcription start and end sites accurately. Mismatch profiles around splice junctions provide a powerful feature to distinguish correct splice junctions and remove false splice junctions. Stratified approaches identify high-confidence transcription start and end sites and remove fragmentary transcripts due to degradation. AtRTD3 is a major improvement over existing transcriptomes as demonstrated by analysis of an Arabidopsis cold response RNA-seq time-series. AtRTD3 provides higher resolution of transcript expression profiling and identifies cold-induced differential transcription start and polyadenylation site usage.Conclusions: AtRTD3 is the most comprehensive Arabidopsis transcriptome currently. It improves the precision of differential gene and transcript expression, differential alternative splicing, and transcription start/end site usage analysis from RNA-seq data. The novel methods for identifying accurate splice junctions and transcription start/end sites are widely applicable and will improve single-molecule sequencing analysis from any species.
KW - Alternative polyadenylation
KW - Alternative splicing
KW - Arabidopsis
KW - Iso-seq
KW - Reference transcript dataset
KW - Splice junction
KW - Transcription start and end sites
UR - http://www.scopus.com/inward/record.url?scp=85133563650&partnerID=8YFLogxK
U2 - 10.1186/s13059-022-02711-0
DO - 10.1186/s13059-022-02711-0
M3 - Journal article
C2 - 35799267
AN - SCOPUS:85133563650
SN - 1474-7596
VL - 23
JO - Genome Biology
JF - Genome Biology
M1 - 149
ER -