sra format SRA文件的格式
http://www.ebi.ac.uk/ena/about/sra_format
Read metadata format
Metadata is represented using XML documents. For detailed information about the metadata XMLs please refer to SRA XML 1.5 metadata format. For examples how to prepare the XMLs please refer to Preparing SRA XML metadata. The following metadata objects are used:
Metadata object | Description | Example |
Study | A study groups together data submitted to the archive. Please use the study accession number when citing data submitted into ENA. | ERP000016 |
Submission | A submission contains submission actions to be performed by the archive. A submission can add more objects to the archive, update already submitted objects or make objects publicly available. | ERA000092 |
Sample | A sample contains information about the sequenced samples. Samples are associated with checklists, which define the attributes used to annotate the samples, and experiments or analysis objects. | ERS000081 |
Experiment | An experiment contains information about the sequencing experiments including library and instrument detail. | ERX000398 |
Run |
Runs are part of experiments and contain sequencing reads submitted in data files (e.g. BAM or CRAM). Each run can contain all or part of the results for a particular experiment. |
ERR003990 |
Analysis | An analysis contains secondary analysis results computed from the primary sequencing results (e.g. VCFs with sequence variations or BAMs with sequence alignments). | ERZ000001 |
DAC | An European Genome-phenome Archive (EGA) data access committee (DAC). Required for authorized access submissions. | EGAC00001000001 |
Policy | An European Genome-phenome Archive (EGA) data access policy. Required for authorized access submissions. | EGAP00001000001 |
Dataset | An European Genome-phenome Archive (EGA) data set. Required for authorized access submissions. | EGAD00001000001 |
Accession number format
Each metadata object is assigned a unique accession number by the archive. The accession numbers can be used to retrieve data and metadata using the EB-Eye search available at the top of all EBI web pages or using the free text search available on the ENA home page. The metadata is then retrieved and displayed through the ENA Browser as in the examples in the above table.
Accession numbers assocaited with read data assigned by EBI start with 'ER' and accession numbers assigned by NCBI and DDBJ start with 'SR' and 'DR', respectively. The third letter of the accession number indicates the type of the metadata object. EGA accession numbers start with 'EGA' with the fourth letter indicating the type of the metadata object.
The accession numbers have a fixed number of digits after the letters: six for ENA and eleven for EGA.
Metadata object | Accession prefix | Number of digits | Example |
Submission | ERA, SRA, DRA | 6 | ERA000092 |
Sample | ERS, SRS, DRS | 6 | ERS000081 |
Study | ERP, SRP, DRP | 6 | ERP000016 |
Experiment | ERX, SRX, DRX | 6 | ERX000398 |
Run | ERR, SRR, DRR | 6 | ERR003990 |
Analysis | ERZ, SRZ, DRZ | 6 | ERZ000001 |
EGA Submission | EGA | 11 | EGA00001000001 |
EGA Sample | EGAN | 11 | EGAN00001000001 |
EGA Study | EGAS | 11 | EGAS00001000001 |
EGA Experiment | EGAX | 11 | EGAX00001000001 |
EGA Run | EGAR | 11 | EGAR00001000001 |
EGA Analysis | EGAZ | 11 | EGAZ00001000001 |
EGA DAC | EGAC | 11 | EGAC00001000001 |
EGA Policy | EGAP | 11 | EGAP00001000001 |
EGA Data Set | EGAD | 11 | EGAD00001000001 |
Archive generated fastq file format
Once made public, data submitted to ENA are available for download using ftp and Aspara. Detailed data download instructions are available here. Currently, both submitted data files and archive generated fastq files are made available for download. The naming and format of the generated fastq files are described below.
In general, one fastq file is created for each application read in a run. Please refer to the table below for full details:
Number of application reads | Fastq Files | Description |
1 | <run accession>.fastq.gz | For experiments with single application reads only all reads will be made available in one fastq file. |
2 |
<run accession>_1.fastq.gz <run accession>_2.fastq.gz <run accession>.fastq.gz |
For paired experiments with two application reads reads will be made available in 1-3 fastq files. If a paired experiment is submitted with both application reads then the first reads will be in <run accession>_1.fastq.gz file, the second reads will be in <run accession>_2.fastq.gz, and any unpaired reads will be in <run accession>.fastq.gz file. In case a paired experiment is submitted containing only unpaired reads then only a single file will be created: <run accession>.fastq.gz. |
> 2 | <run accession>_N.fastq.gz |
For experiments with more than two application reads (e.g. Complete Genomics or strobed PacBio) one fastq file is created for each application read, however, no empty fastq files are created. |
The fastq file format is:
@<run accession>.<spot index> <spot name>\[/<read index>\] <bases> + <phred qualities, ASCII encoded starting with '!' (33)>
Field | Description |
<run accession> | The Run accession. A spot is identified uniquely by the combination of the Run accession and the spot index. |
<spot index> | A positive integer assigned to the spots in the order in which they appear in the run. A spot is identified uniquely by the combination of the Run accession and the spot index. |
<spot name> | The spot name as it was provided by the submitter. |
<read index> | A positive integer assigned to the application reads in the order in which they appear in the spot: /1 for first application read and /2 for the second application read. |
Single layout example:
@ERR000017.1 IL6_554:7:1:249:322 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + ??????????????????????????????>>>>>>
Paired example (first read):
@ERR005143.1 ID49_20708_20H04AAXX_R1:7:1:41:356/1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
Paired example (second read):
@ERR005143.1 ID49_20708_20H04AAXX_R1:7:1:41:356/2 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
SOLiD color example:
The first base is included before the SOLiD colors.
@ERR000451.1 VAB_S0103_20080915_542_14_17_70_F3 T33023230203102103223330020300233001 + T%245719<.6353&:%0#$1%&%2(--27*%&%,