The Gramene GeneBuilder
The evidence-based genes were built using the Gramene evidence-based gene build pipeline (Liang et al, 2009). The Gramene pipeline extends the Ensembl gene annotation pipeline (Curwen et al, 2004) uses the same-species or cross-species proteins, full-length cDNAs (FLcDNAs) or ESTs for gene predictions. As of Release 4a.53, the GeneBuilder integrates ab initio gene models predicted by Fgenesh (Salamov and Solovyev, 2000). The following gene products were aligned to maize sequences using Exonerate (Slater and Birney, 2005) (these were the maximum number of sequences available at the time of gene build):
- 14,079 FLcDNAs from the Arizona Maize Full-Length cDNA Project
- 36,433 FLcDNAs from Ceres
- 359,942 SwissProt proteins from all species
- 494,444 non-maize plant proteins in Trembl
- 94,734 GenBank proteins from plant species
- 52,177 proteins from rice gene annotations
- 36,338 proteins from sorghum gene annotations
- 1,462,607+537,726 maize ESTs
- 18,181 other mRNAs from maize
- 1,217,859 rice ESTs
- 72,919 other mRNA from rice
- 2,448,641 ESTs from other monocot species
- 14,015 other mRNAs from other monocot species.
For many genes, multiple spliced transcripts were preserved with high confidence FLcDNA/EST support (at least 99% sequence identity).
The resultant genes were filtered by a minimum translation length of 50 amino acids (25 for protein-supported genes), which includes genes with full-length CDS or partial CDS which lack either the start or stop codon. Partial genes without complete CDS that overlap Fgenesh-predicted gene models were combined with the latter to extend the open reading frame (ORF). The partial genes not overlapping Fgenesh genes remain unchanged.
Following this preliminary gene build process, all the transcripts and proteins are searched against rice proteins using BLASTX and BLASTP, respectively, and the overlapping Fgenesh proteins are searched against rice proteins using BLASTP. For each evidence-based gene, if the BLASTX P-value is much lower than BLASTP P-value, the translation frame is recomputed based on the matching frame of the protein. If there are no opposite strand genes and the blastx match is on opposite strand of the BLASTP match, then the gene strand is flipped and the translation frame recomputed. If the P-value of the overlapping Fgenesh prediction is much lower than the evidence gene, then the evidence gene is replaced by the Fgenesh prediction.
References
- Liang C., Mao L., Ware D., Stein, L. Evidence-based gene predictions in plant genomes. Genome Research 2009. 19(10):1912-23
- Salamov A., Solovyev, V.Ab initio Gene Finding in Drosophila Genomic DNA. Genome Research 2000. 10: 516-522
- Slater G. S., Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 2005. 15:6-31
- Curwen V., Eyras E., Andrews T. D., Clarke L., Mongin E., Searle S.M., Clamp M. The Ensembl automatic gene annotation system. Genome Research 2004 14(5):942-50

