bp_genbank2gff3.pl man page on Fedora

Man page or keyword search:  
man Server   31170 pages
apropos Keyword Search (all sections)
Output format
Fedora logo
[printable version]

BP_GENBANK2GFF3(1)    User Contributed Perl Documentation   BP_GENBANK2GFF3(1)

NAME
       genbank2gff3.pl -- Genbank->gbrowse-friendly GFF3

SYNOPSIS
	 genbank2gff3.pl [options] filename(s)

	 # process a directory containing GenBank flatfiles
	 perl genbank2gff3.pl --dir path_to_files --zip

	 # process a single file, ignore explicit exons and introns
	 perl genbank2gff3.pl --filter exon --filter intron file.gbk.gz

	 # process a list of files
	 perl genbank2gff3.pl *gbk.gz

	 # process data from URL, with Chado GFF model (-noCDS), and pipe to database loader
	 curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \
	 | perl genbank2gff3.pl -noCDS -in stdin -out stdout \
	 | perl gmod_bulk_load_gff3.pl -dbname mychado -organism fromdata

	   Options:
	       --dir	 -d  path to a list of genbank flatfiles
	       --outdir	 -o  location to write GFF files (can be 'stdout' or '-' for pipe)
	       --zip	 -z  compress GFF3 output files with gzip
	       --summary -s  print a summary of the features in each contig
	       --filter	 -x  genbank feature type(s) to ignore
	       --split	 -y  split output to seperate GFF and fasta files for
			     each genbank record
	       --nolump	 -n  seperate file for each reference sequence
			     (default is to lump all records together into one
			      output file for each input file)
	       --ethresh -e  error threshold for unflattener
			     set this high (>2) to ignore all unflattener errors
	       --[no]CDS -c  Keep CDS-exons, or convert to alternate gene-RNA-protein-exon
			     model. --CDS is default. Use --CDS to keep default GFF gene model,
			     use --noCDS to convert to g-r-p-e.
	       --format	 -f  Input format (SeqIO types): GenBank, Swiss or Uniprot, EMBL work
			     (GenBank is default)
	       --GFF_VERSION 3 is default, 2 and 2.5 and other Bio::Tools::GFF versions available
	       --quiet	     dont talk about what is being processed
	       --typesource  SO sequence type for source (e.g. chromosome; region; contig)
	       --help	 -h  display this message

DESCRIPTION
       This script uses Bio::SeqFeature::Tools::Unflattener and
       Bio::Tools::GFF to convert GenBank flatfiles to GFF3 with gene
       containment hierarchies mapped for optimal display in gbrowse.

       The input files are assumed to be gzipped GenBank flatfiles for refseq
       contigs.	 The files may contain multiple GenBank records.  Either a
       single file or an entire directory can be processed.  By default, the
       DNA sequence is embedded in the GFF but it can be saved into seperate
       fasta file with the --split(-y) option.

       If an input file contains multiple records, the default behaviour is to
       dump all GFF and sequence to a file of the same name (with .gff
       appended).  Using the 'nolump' option will create a seperate file for
       each genbank record.  Using the 'split' option will create seperate GFF
       and Fasta files for each genbank record.

   Notes
       'split' and 'nolump' produce many files

       In cases where the input files contain many GenBank records (for
       example, the chromosome files for the mouse genome build), a very large
       number of output files will be produced if the 'split' or 'nolump'
       options are selected.  If you do have lists of files > 6000, use the
       --long_list option in bp_bulk_load_gff.pl or bp_fast_load_gff.pl to
       load the gff and/ or fasta files.

       Designed for RefSeq

       This script is designed for RefSeq genomic sequence entries.  It may
       work for third party annotations but this has not been tested.  But see
       below, Uniprot/Swissprot works, EMBL and possibly EMBL/Ensembl if you
       don't mind some gene model unflattener errors (dgg).

       G-R-P-E Gene Model

       Don Gilbert worked this over with needs to produce GFF3 suited to
       loading to GMOD Chado databases.	 Most of the changes I believe are
       suited for general use.	One main chado-specific addition is the
	 --[no]cds2protein  flag

       My favorite GFF is to set the above as ON by default (disable with
       --nocds2prot) For general use it probably should be OFF, enabled with
       --cds2prot.

       This writes GFF with an alternate, but useful Gene model, instead of
       the consensus model for GFF3

	 [ gene > mRNA> (exon,CDS,UTR) ]

       This alternate is

	 gene > mRNA > polypeptide > exon

       means the only feature with dna bases is the exon.  The others specify
       only location ranges on a genome.  Exon of course is a child of mRNA
       and protein/peptide.

       The protein/polypeptide feature is an important one, having all the
       annotations of the GenBank CDS feature, protein ID, translation, GO
       terms, Dbxrefs to other proteins.

       UTRs, introns, CDS-exons are all inferred from the primary exon bases
       inside/outside appropriate higher feature ranges.   Other special gene
       model features remain the same.

       Several other improvements and bugfixes, minor but useful are included

	 * IO pipes now work:
	   curl ftp://ncbigenomes/... | genbank2gff3 --in stdin --out stdout | gff2chado ...

	 * GenBank main record fields are added to source feature, e.g. organism, date,
	   and the sourcetype, commonly chromosome for	genomes, is used.

	 * Gene Model handling for ncRNA, pseudogenes are added.

	 * GFF header is cleaner, more informative.
	   --GFF_VERSION flag allows choice of v2 as well as default v3

	 * GFF ##FASTA inclusion is improved, and
	   CDS translation sequence is moved to FASTA records.

	 * FT -> GFF attribute mapping is improved.

	 * --format choice of SeqIO input formats (GenBank default).
	   Uniprot/Swissprot and EMBL work and produce useful GFF.

	 * SeqFeature::Tools::TypeMapper has a few FT -> SOFA additions
	     and more flexible usage.

TODO
   Are these additions desired?
	* filter input records by taxon (e.g. keep only organism=xxx or taxa level = classYYY
	* handle Entrezgene, other non-sequence SeqIO structures (really should change
	   those parsers to produce consistent annotation tags).

   Related bugfixes/tests
       These items from Bioperl mail were tested (sample data generating
       errors), and found corrected:

	From: Ed Green <green <at> eva.mpg.de>
	Subject: genbank2gff3.pl on new human RefSeq
	Date: 2006-03-13 21:22:26 GMT
	  -- unspecified errors (sample data works now).

	From: Eric Just <e-just <at> northwestern.edu>
	Subject: bp_genbank2gff3.pl
	Date: 2007-01-26 17:08:49 GMT
	  -- bug fixed in genbank2gff3 for multi-record handling

       This error is for a /trans_splice gene that is hard to handle, and
       unflattner/genbank2 doesn't

	From: Chad Matsalla <chad <at> dieselwurks.com>
	Subject: genbank2gff3.PLS and the unflatenner - Inconsistent   order?
	Date: 2005-07-15 19:51:48 GMT

AUTHOR
       Sheldon McKay (mckays@cshl.edu)

       Copyright (c) 2004 Cold Spring Harbor Laboratory.

   AUTHOR of hacks for GFF2Chado loading
       Don Gilbert (gilbertd@indiana.edu)

perl v5.14.1			  2011-07-22		    BP_GENBANK2GFF3(1)
[top]

List of man pages available for Fedora

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net