Gcorn database
Gene phylogeny using correlation network

Gcorn database family

G. classic

DatabaseOrganism group
G. plantplant
G. protozoaprotozoa
G. fungifungi
G. invertebrate
(analyzing)
invertebrate
G. vertebrate
(analyzing)
vertebrate
except for mammalian
G. mammalian
(preparing for analysis)
mammalian

G. taxon

  • G. yeast (analyzing)

G. universal (preparing for analysis)

Gene nameOrganism group
rbcLplant
matKplant
COX1eukaryote
16S rRNAbacteria
cytochrome P450all

G. transfer (preparing for analysis)

  • Vertical transfer
  • Horizontal transfer

★ Story

Concept

  • By the advance of next-generation sequencing, more & more organisms' genomes have been decoded.
  • Although the number of these organisms is still limited inclusively, it sufficiently makes performable a phylogenetic approach in a particular kingdom.
  • The Gcorn database project aims to reveal relationships between speciation & gene divergence based on gene and/or amino acid sequences.
  • The Gcorn classic databases provide information on the dynamics of homology groups along with speciation using various thresholds of homology indices.
  • The project started with kingdoms in eukaryotes; i.e., plants, fungi, protozoa, invertebrates, & vertebrates..
  • After publishing databases of eukaryotes, those of prokaryotes will be planned.
  • Gcorn can be applied not only to kingdom level, but to particular genes used for DNA barcoding, & to candidate genes transferred vertically (along with speciation) or horizontally (across speciation).

Database type

  • Gcorn classic: Databases of particular kingdoms such as plants & fungi at the genome level.
  • Gcorn taxon: Databases of particular taxon such as yeast, including multiple strains or ecotypes.
  • Gcorn universal: Databases of genes used for DNA barcoding such as RbcL, 16S rDNA, & COI.
  • Gcorn transfer: Databases of genes transferred vertically & horizontally.


★ Pipeline

Data materials

Organism group# species # genesRelease history
Plant 104
73
4,385,746
2,785,715
September 2019
October 2017
Fungi 249 2,345,743 November 2019
Protozoa 86 954,157 December 2019
Invertebrate 146 3,273,518 not yet
Vertebrate
 (except for Mammalian)
127 4,006,839 not yet
Mammalian 110 4,490,015 not yet
Archaea - - not yet
Bacteria - - not yet
Viral - - not yet

Pre-data analysis

  • The sequences in each organism group are counted by species to select the species that contain genes at the genome level.
  • Such species are selected using a threshold of the number of genes (it depends upon each organism group); e.g., 5,000 genes for plant.
  • As of November 2019, the numbers of species & genes selected for Gcorn are listed in the right table.

BLASTP analysis

  • In each organism group, BLASTP analyses are performed between all genes of the species selected through the pre-data analysis.

Calculating homology index

  • Using the result of the BLASTP analysis, homology indices (HI) between all pairs of genes are calculated.
  • HI is calculated by twice the number of amino acids shared with a pair of genes over the sum of amino acids of the pair; i.e., it is equivalent to a F-measure, harmonic mean of precision & recall, of the pair.
  • The index ranges between zero (dissimilar) to one (equivalent).
  • For example, when the numbers of amino acids of genes A & B are 100 & 140, respectively & 90 amino acids are shared with these genes, HI is calculated as 90 x 2 / ( 100 + 140 ) = 0.75.
  • HI represents a variation rate in a pair of genes.

Grouping homologous genes

  • Using various HIs as thresholds, genes are grouped in a manner of correlation network analysis.
  • Namely, a single gene can be contained in multiple gene groups on the basis of various HIs.
  • Hypothetically, a gene group represents a gene of an ancient organism.
  • In the Gcorn database, therefore, the gene group is named “ancient gene.”

Constructing phylogenetic tree for gene

  • A phylogenetic tree of a gene group (ancient gene) that contains a gene of interest & around 20 genes.
  • In the Gcorn database, a phylogenetic tree that contains many genes is not provided because such trees are difficult to grasp & understand it.
  • Instead of such trees, Gcorn provides a line chart that represents information on ancient genes (or “gene evolution”; see the next section).

Depicting line chart of gene evolution

  • A line chart that represents the numbers of genes (red), species (blue), & families (green) in a lineage of ancient genes (or gene evolution).
  • Namely, this line chart is a digest of a phylogenetic tree (regarding these numbers).
  • Vertical & horizontal axes represent these numbers & HI, respectively.
  • Horizontal axis is hypothetically along with evolution, although the speed of evolution (i.e., mutation or variation rate) is not necessarily constant.
  • The line chart is composed of flat segments & slopes to represent the timings of homologous events.
  • A red & flat segment represents no homologous event in this evolutionary period (of cource, a blue line is also flat).
  • A red slope represents a homologous event: in the cases of a blue slope & flat segment, the period represents potentially an orthologous or paralogous event, respectively.

Constructing correlation network between organisms

  • A correlation network between orgamisms is depicted on the basis of the ratio of homologous genes.
  • In the network, nodes represent orgamisms & are connected to other nodes in the cases that show the higher ratio than a threshold determined by each organism group.
  • Groups (or “network modules”) in which nodes are tightly connected to each other represent organism groups that are mutually close such as members of a family.
  • Red nodes represent organisms that contain genes homologous to a gene of interest on the basis of HI threshold which is changeable (0.8 as default).

Depicting taxonomy tree

  • This phylogenetic tree for species is depicted on the basis of information obtained from the NCBI Taxonomy database.
  • Several divergence points are based on authorized information or articles (see references for each orgaism group); e.g., the tree for plants is revised by information on APG III.
  • Blue & gray nodes represent taxon with & without hyperlinks to the Taxonomy database, respectively.
  • Red nodes represent organisms used in the Gcorn databases & species names in red represent those that contain genes homologous to a gene of interest based on a HI threshold.

Depicting orthology tree

  • This tree contains genes orthologous to a gene of interest based on a HI threshold.
  • When the order of branching in the tree is similar to that in the phylogenetic tree for species, the evolution (or variation) of the gene occurred similarly to speciation.
  • On the other hand, when the order is dissimilar, some evolution events may occurred in a different manner to speciation.

Constructing database

  • The Gcorn databases are constructed using Perl/CGI & HTML scripts.
  • The databases are maintained using a server of Osaka Prefecture University & each database will be updated every one or two years.

★ Special database

Key to analysis of speciation (Kanon)

  • Based on the results of the Gcorn database construction, the Kanon database will provide information on genes that hypothetically showed frequent mutations (or variations) at a particular speciation.
  • At first, we have a plan to publish Kanon plant.