fastq格式与fasta格式相同的一点在于它们都是文本格式,不同之处在于前者提供了测序质量等更多的信息,而后者仅仅能提供序列(前者也提供序列)。

FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are encoded with a single ASCII character for brevity. It was originally developed at the Wellcome Trust Sanger Institute to bundle a FASTA sequence and its quality data, but has recently become the de facto standard for storing the output of high throughput sequencing instruments such as the Illumina Genome Analyzer.[1]

sra格式则为二进制格式,由NCBI Short Read Archive提供,为高通量测序/下一代测序的“原始记录”。由于采用二进制方式进行压缩,该格式占用空间药效,可能限于经费的原因,NCBI SRA数据库现在仅提供了sra数据的下载。不过同时NCBI提供了SRA toolkit来进行sra格式->fastq格式(sam格式等)的转化。sra格式内部(貌似)以XML database的格式存储数据。

The Sequence Read Archive or Short Read Archive is a bioinformatics database and a collaboration between the European Bioinformatics Institute, the National Center for Biotechnology Information, and the DNA Data Bank of Japan. It provides a public repository for the “short reads” generated by High-throughput sequencing.

据笔者目测,一个约700M的sra文档,经fastq-dump转化后的大小在2.4G左右。

fastq格式示例

 

sra格式的内容 - 二进制文件

 

作者简介

Chun-Hui Gao is a Research Associate at Huazhong Agricultural University.

重复使用

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The source code is licensed under MIT. The full source is available at https://github.com/yihui/hugo-prose.

欢迎修订

如果您发现本文里含有任何错误(包括错别字和标点符号),欢迎在本站的 GitHub 项目里提交修订意见。

引用本文

如果您使用了本文的内容,请按照以下方式引用:

gaoch (2012). fastq format – sra format. BIO-SPRING. /post/2012/08/21/2012-08-21-fastq-format-sra-format/

BibTeX citation

@misc{
  title = "fastq format – sra format",
  author = "gaoch",
  year = "2012",
  journal = "BIO-SPRING",
  note = "/post/2012/08/21/2012-08-21-fastq-format-sra-format/"
}