BioPerl has a Genbank –> GFF converter (the script is bp_genbank2gff3.pl), however, this converter generated GFF/GTF file could not work in Bioconductor when using the function maketxdbfromGFF to construct a transcript database.

Currently, I have developed such a converter that may extract information from a NCBI Genbank file and write them to GFF version 2, or GTF format. This script is derived from BioPerl, so you need firstly get the module installed in your system. To check whether BioPerl is ready, just refer to this post: http://bio-spring.top/check-bioperl/ .

Here is the synopsis of this script.

SYNOPSIS

NCBI_genbank2gtf.pl [options] filename

Options:
–help -h display this message
–input -i NCBI GenBank file
–out -o filename of GTF output

Examples:

perl NCBI_genbank2gtf.pl -i seq.gb -o seq.gtf
perl NCBI_genbank2gtf.pl seq.gb > seq.gtf

The codes is as follows:
#!/usr/bin/perl

=pod

=head1 NAME

NCBI_genbank2gtf.pl – Genbank -> Bioconductor friendly GTF

=head1 SYNOPSIS

NCBI_genbank2gtf.pl [options] filename

Options:
–help -h display this message
–input -i NCBI GenBank file
–out -o filename of GTF output

Examples:

perl NCBI_genbank2gtf.pl -i seq.gb -o seq.gtf
perl NCBI_genbank2gtf.pl seq.gb > seq.gtf

=head1 DESCRIPTION

Use this program to generate bioconductor friendly GTF files from NCBI GenBank.
The file may be used in *maketxdbfromGFF* function and should work well.

=head1 AUTHOR

Chun-Hui, Gao (gaoch@thelifesciencecentury.com)

Copyright (c) 2015 www.thelifesciencecentury.com.

=cut

use strict;
use Data::Dumper;
use Pod::Usage;
use Bio::SeqIO;
use Getopt::Long;

my ($help, $genbank_input, $gtf_output);

my $ok= GetOptions( ‘help|h’ => \$help,
‘input|i=s’ => \$genbank_input,
‘out|o=s’ => \$gtf_output );
pod2usage(2) if $help || !$ok;

$genbank_input = shift @ARGV unless ($genbank_input );

open *IN, “< $genbank_input " or die pod2usage(2); my $out; if ($gtf_output) { open $out, “> $gtf_output” or die “Can’t open file $gtf_output:$@\n”;
}
else {
$out = *STDOUT;
}

my %seen;
my $in = Bio::SeqIO->new(-fh => \*IN, -format => “genbank”);
while ( my $seq = $in->next_seq() ) {
my $seq_acc = $seq->accession_number;

# abort if there are no features
warn “$seq_acc has no features, skipping\n” and next
if !$seq->all_SeqFeatures;

# construct a GFF header
# add: get source_type from attributes of source feature? chromosome=X tag
# also combine 1st ft line here with source ft from $seq ..
my $header = gff_header($seq);
print $out $header;
warn “# working on $seq_acc\n”;

for my $feature ( $seq->get_all_SeqFeatures ) {

my $method = $feature->primary_tag;
$seen{$method} ++;
$feature = maptags2gff($feature);
print $out $feature->gff_string(), “\n”;
if ($method eq ‘CDS’){
$feature->primary_tag(“exon”);
$feature->add_tag_value(“exon_number”, 1);
$feature->add_tag_value(“exon_id”, $feature->get_tag_values(“transcript_id”));
$feature = maptags2gff($feature);
print $out $feature->gff_string(), “\n”;
}

}

}

warn “# Count of features in $genbank_input:\n”;
map {warn sprintf(”# %20s:%6d\n", $_, $seen{$_});} keys %seen;

sub gff_header {
my ($seq) = @_;
my $head = sprintf("##gff-version 2\n# sequence-region: %s (1..%d)\n", $seq->accession_number, $seq->length);
$head .= sprintf( “# %s\n”, $seq->desc);
$head .= “# converted by NCBI_genbank2gtf.pl\n”;

#~ print Dumper $acc, $desc, $end;
return $head;

}

sub maptags2gff {
my $f = shift;
my $method = $f->primary_tag;
my %TAG_MAP;
if ($method eq ‘gene’ || $method eq ‘misc_RNA’ || $method eq ‘rRNA’ || $method eq ’tRNA’){
$f->add_tag_value(’transcript_id’,$f->get_tag_values(’locus_tag’));
%TAG_MAP = ( gene => ‘gene_name’,
transcript_id => ’transcript_id’,
locus_tag => ‘gene_id’ );
}
elsif ($method eq ‘CDS’) {
$f->add_tag_value(’transcript_id’,$f->get_tag_values(’locus_tag’));
%TAG_MAP = ( locus_tag => ‘gene_id’,
protein_id => ‘protein_id’,
transcript_id => ’transcript_id’,
gene => ‘gene_name’,
product => ‘product’);

}
elsif ($method eq ’exon’) {

%TAG_MAP = ( gene_id => ‘gene_id’,
protein_id => ‘protein_id’,
transcript_id => ’transcript_id’,
exon_number => ’exon_number’,
exon_id => ’exon_id’,
gene_name => ‘gene_name’);

}
elsif ($method eq ‘source’) {
%TAG_MAP = ( organism => ‘organism’ );

}
else {

}

my @all_tags = $f->get_all_tags();
#~ print Dumper \@all_tags;
for my $tag (@all_tags){

if (exists $TAG_MAP{$tag}){
my $newtag= $TAG_MAP{$tag};
my @v= $f->get_tag_values($tag);
$f->remove_tag($tag);
$f->add_tag_value($newtag,@v);
}
else {
$f->remove_tag($tag);
}
}

return $f;
}

If you would like to access the full document, run:
perldoc NCBI_genbank2gtf.pl

Feedback is welcome.

Here is also a GenBank -> BED converter:

作者简介

Chun-Hui Gao is a Research Associate at Huazhong Agricultural University.

重复使用

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The source code is licensed under MIT. The full source is available at https://github.com/yihui/hugo-prose.

欢迎修订

如果您发现本文里含有任何错误(包括错别字和标点符号),欢迎在本站的 GitHub 项目里提交修订意见。

引用本文

如果您使用了本文的内容,请按照以下方式引用:

gaoch (2015). A Genbank –> GTF converter derived from BioPerl. BIO-SPRING. /post/2015/10/14/2015-10-14-a-genbank-gtf-converter-derived-from-bioperl/

BibTeX citation

@misc{
  title = "A Genbank –> GTF converter derived from BioPerl",
  author = "gaoch",
  year = "2015",
  journal = "BIO-SPRING",
  note = "/post/2015/10/14/2015-10-14-a-genbank-gtf-converter-derived-from-bioperl/"
}