BioPerl has a Genbank –> GFF converter (the script is bp_genbank2gff3.pl), however, this converter generated GFF/GTF file could not work in Bioconductor when using the function maketxdbfromGFF to construct a transcript database.
Currently, I have developed such a converter that may extract information from a NCBI Genbank file and write them to GFF version 2, or GTF format. This script is derived from BioPerl, so you need firstly get the module installed in your system. To check whether BioPerl is ready, just refer to this post: http://bio-spring.top/check-bioperl/ .
Here is the synopsis of this script.
SYNOPSIS
NCBI_genbank2gtf.pl [options] filename
Options:
–help -h display this message
–input -i NCBI GenBank file
–out -o filename of GTF output
Examples:
perl NCBI_genbank2gtf.pl -i seq.gb -o seq.gtf
perl NCBI_genbank2gtf.pl seq.gb > seq.gtf
The codes is as follows:
#!/usr/bin/perl
=pod
=head1 NAME
NCBI_genbank2gtf.pl – Genbank -> Bioconductor friendly GTF
=head1 SYNOPSIS
NCBI_genbank2gtf.pl [options] filename
Options:
–help -h display this message
–input -i NCBI GenBank file
–out -o filename of GTF output
Examples:
perl NCBI_genbank2gtf.pl -i seq.gb -o seq.gtf
perl NCBI_genbank2gtf.pl seq.gb > seq.gtf
=head1 DESCRIPTION
Use this program to generate bioconductor friendly GTF files from NCBI
GenBank.
The file may be used in *maketxdbfromGFF* function and should work
well.
=head1 AUTHOR
Chun-Hui, Gao (gaoch@thelifesciencecentury.com)
Copyright (c) 2015 www.thelifesciencecentury.com.
=cut
use strict;
use Data::Dumper;
use Pod::Usage;
use Bio::SeqIO;
use Getopt::Long;
my ($help, $genbank_input, $gtf_output);
my $ok= GetOptions( ‘help|h’ => \$help,
‘input|i=s’ => \$genbank_input,
‘out|o=s’ => \$gtf_output );
pod2usage(2) if $help || !$ok;
$genbank_input = shift @ARGV unless ($genbank_input );
open *IN, “< $genbank_input " or die pod2usage(2); my $out; if
($gtf_output) { open $out, “> $gtf_output” or die “Can’t open file
$gtf_output:$@\n”;
}
else {
$out = *STDOUT;
}
my %seen;
my $in = Bio::SeqIO->new(-fh => \*IN, -format =>
“genbank”);
while ( my $seq = $in->next_seq() ) {
my $seq_acc = $seq->accession_number;
# abort if there are no features
warn “$seq_acc has no features, skipping\n” and next
if !$seq->all_SeqFeatures;
# construct a GFF header
# add: get source_type from attributes of source feature? chromosome=X
tag
# also combine 1st ft line here with source ft from $seq ..
my $header = gff_header($seq);
print $out $header;
warn “# working on $seq_acc\n”;
for my $feature ( $seq->get_all_SeqFeatures ) {
my $method = $feature->primary_tag;
$seen{$method} ++;
$feature = maptags2gff($feature);
print $out $feature->gff_string(), “\n”;
if ($method eq ‘CDS’){
$feature->primary_tag(“exon”);
$feature->add_tag_value(“exon_number”, 1);
$feature->add_tag_value(“exon_id”,
$feature->get_tag_values(“transcript_id”));
$feature = maptags2gff($feature);
print $out $feature->gff_string(), “\n”;
}
}
}
warn “# Count of features in $genbank_input:\n”;
map {warn sprintf(”# %20s:%6d\n", $_, $seen{$_});} keys %seen;
sub gff_header {
my ($seq) = @_;
my $head = sprintf("##gff-version 2\n# sequence-region: %s
(1..%d)\n", $seq->accession_number, $seq->length);
$head .= sprintf( “# %s\n”, $seq->desc);
$head .= “# converted by NCBI_genbank2gtf.pl\n”;
#~ print Dumper $acc, $desc, $end;
return $head;
}
sub maptags2gff {
my $f = shift;
my $method = $f->primary_tag;
my %TAG_MAP;
if ($method eq ‘gene’ || $method eq ‘misc_RNA’ || $method eq ‘rRNA’
|| $method eq ’tRNA’){
$f->add_tag_value(’transcript_id’,$f->get_tag_values(’locus_tag’));
%TAG_MAP = ( gene => ‘gene_name’,
transcript_id => ’transcript_id’,
locus_tag => ‘gene_id’ );
}
elsif ($method eq ‘CDS’) {
$f->add_tag_value(’transcript_id’,$f->get_tag_values(’locus_tag’));
%TAG_MAP = ( locus_tag => ‘gene_id’,
protein_id => ‘protein_id’,
transcript_id => ’transcript_id’,
gene => ‘gene_name’,
product => ‘product’);
}
elsif ($method eq ’exon’) {
%TAG_MAP = ( gene_id => ‘gene_id’,
protein_id => ‘protein_id’,
transcript_id => ’transcript_id’,
exon_number => ’exon_number’,
exon_id => ’exon_id’,
gene_name => ‘gene_name’);
}
elsif ($method eq ‘source’) {
%TAG_MAP = ( organism => ‘organism’ );
}
else {
}
my @all_tags = $f->get_all_tags();
#~ print Dumper \@all_tags;
for my $tag (@all_tags){
if (exists $TAG_MAP{$tag}){
my $newtag= $TAG_MAP{$tag};
my @v= $f->get_tag_values($tag);
$f->remove_tag($tag);
$f->add_tag_value($newtag,@v);
}
else {
$f->remove_tag($tag);
}
}
return $f;
}
If you would like to access the full document, run:
perldoc NCBI_genbank2gtf.pl
Feedback is welcome.