KGDB: Documentation

Human kidney Gene DataBase

KGDB Documentation

KGDB Home

Search KGDB

Browse by Category

Documentation

Database Statistics

Resources

Definition of KGDB
Motivations of KGDB
Gene Inclusion Criteria
Data Sources of KGDB
Process of Construction
Analysis of Expression
Data Structure and Format
Web Interface

Definition of KGDB
KGDB represents Kidney Gene DataBase and is a curated and integrated database of genes related to human kidney and kidney diseases.

Motivations on building KGDB

Biomedical literature is growing explosively. The MEDLINE database is a primary repository for such data. The kidney is an intricate organ, and also a disease targeted organ. Renal disease affects 11% of people in the United States over the age of 65, not including those with diabetes or hypertension. Kidney failure is predominantly a disease of older people. Kidney cancer remains relatively rare, but incidence and mortality rates are reported to be rising steadily across the world. The development and progression of kidney diseases involves a number of molecular events, both genetic and epigenetic, such as gene amplification, mutation,gross deletion, polymorphism, loss of heterozygosity (LOH), microsatellite instability (MSI) methylation, and hypomethylation. These events in turn involve a large number of genes, which have been documentedin bibliographic databases such as the MEDLINE database with thousands ofrecords. A fundamental limitation of MEDLINE and other similar resources is that the information they contain is not represented in structured format. Thus, both retrieval effectiveness and precision are poor. For example, a typical question that scientistsmay ask when searching MEDLINE database is: ‘What geneshave been found mutated in human renal cancer?’ Toanswer the question, they may search MEDLINE using the query (‘kidney’ OR ‘renal’) AND‘cancer’ AND mutation AND human, which returns 1789 recordsas of July 21, 2003, among which less than 20 percent are relevantto the question and many of which are redundant. Another problemhindering efficient retrieval of gene-related information fromliterature databases is the non-standardized terminology usedfor gene names by scientists. For example, different alias nameshave been used in the literature for the CDKN2A gene commonlyknown as p16, including ARF, P16, CMM2, INK4, MTS1, TP16, CDK4I,CDKN2, INK4A, p14ARF and p16INK4. Use only any one to query MEDLINE database will result in missing of relevant records.

In consideration of the existing problems, KGDB was thus constructed to: 1) catalog gene-related facts of the kidney and kidney diseases cumulated in the literature database during the past years and years to come; 2) store the information in structured format for fast and easy access; 3) annotate to deliver value-added information.

Gene Inclusion Criteria
One of the following two criteria must be satisfied for genes to be included in the database: 1) A gene must have been reportedin the published literature to be involvedin one of the following events: gene amplification, mutation,gross deletion, polymorphism, apoptosis, loss of heterozygosity (LOH), microsatellite instability (MSI) methylation, and hypomethylation. 2) Genes specifically expressed in kidney. Evidence for this category is from the SAGEmap database and the UniGene database hosted by National Center for Biotechnology Information (NCBI). For EST expression, a UniGene cluster must have at least 2 member ESTs, all of which were derived from kidney libraries; for SAGE expression, a gene to be defined as kdney specific must have a tag count of more than 1, all of which were derived from kdney libraries. Currently all the genes in this category are UniGene clusters of ESTs.

Data sources of KGDB
KGDB uses data from the following databases.

· MEDLINE citation database through PubMed at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed

· Unigene at http://www.ncbi.nlm.nih.gov/UniGene/

· SAGEmap at http://www.ncbi.nlm.nih.gov/SAGE/

· dbSNP at http://www.ncbi.nlm.nih.gov/SNP/

· LocusLink at http://www.ncbi.nlm.nih.gov/LocusLink/

· Gene Ontology at http://www.geneontology.org

· NCBI’s Gene Expression Omnibus (GEO), a gene expression and hybridization array data repository, at http://www.ncbi.nlm.nih.gov/geo/

Process of Construction
The construction KGDB is a multiple stage process.

Data retrieval: MEDLINE citation abstracts are retrieved using Entrez query tool. A typical query consists of three key words: “kidney”, “human” and the word for the event. For example, the query for the event of gene mutation in kidney was (("kidney"[MeSH Terms] OR kidney[Text Word]) AND ("mutation"[MeSH Terms] OR mutation[Text Word])). MEDLINE records that are of review type or without abstracts were excluded. Genes involved in the event of over-expression were retrieved from OMIM database using “kidney” as the query word. Data from other databases were retrieved either through FTP or HTTP.
Data extraction: MEDLINE abstracts from each query were carefully read by two scientists to identify true relationship between a gene and the kidney and to extract gene name, type of molecular and genetic events, type of prostatic diseases. A list of genes was thus generated for further annotations.
Data annotations: Data annotations were performed automatically using programs written in Perl language. Pieces of information from other database were extracted and added to the extracted gene such as alias name, summary of gene function, gene ontology, SNPs.
Expression analysis: To provide relative expression levels in all tissues for each gene, expression data were analyzed as stated below.
File generation: KGDB is stored and maintained in a single denormalized flat file, from which front-end web pages are further generated automatically for display.

Analysis of Expression
For each gene collected in KGDB, levels of expression were analyzed utilizing both SAGE and EST data and pooled by tissue type. For expression derived from EST, the number of ESTs for each gene in each library was first normalized to the number of ESTs per million, and then was pooled by tissue to obtain the average level of expression in tissues. When calculating expression from SAGE data, only reliable mapping data was used as defined by SAGEmap database. For each gene, the tag frequency in each library was also normalized to the number of tags per million. Special measures were taken to deal with the problem of multiple tag assignments. If one SAGE tag was mapped to n genes, the tag frequency for each gene in each library was divided by n. If one gene had more than one tag mapped to it, then the tag frequency for the gene was the sum of tag frequencies of all tags.

Interpretation of expression
For each gene in KGDB database, SAGE and/or EST expression data are given. An example is provided below. To view explanation for each item, please click the link.

EST (11 ESTs^a, 6 libraries^b)
Tissue	Breadth ^c	CPM ^d

muscle
kidney
uncharacterized tissue
uterus

SAGE (2858419 tags^a, 53 libraries^b)

Tissue

Breadth ^c

CPM ^d

ovary

pancreas

kidney

skin

stomach

a. Total ESTs or SAGE tags: Total ESTs or SAGE tags representing this gene in all libraries from all tissues.

b. Total Libraries: Total number of libraries expressing this gene.

c. Breadth: Percentage of libraries expressing this gene out of total libraries in a tissue pool.

d. Tag count per million (CPM): Number of tags from a library which is mapped to the gene is first normalized to a tag count per million, then is averaged among libraries expressing this gene.

Data Structure and Format
KGDB is distributed and maintained in a single flat file. Fields of entry are explained below.

Name: Official gene name as assigned by HUGO Gene Nomenclature Committee (HGNC). If no official name is available, interim name from LocusLink is used.
Symbol (Optional): Official gene symbol as assigned by HUGO Gene Nomenclature Committee (HGNC). If no official symbol is available, interim symbol, from LocusLink is used.
Aliases (Optional): Other names and symbols used for the gene, from LocusLink
Gene Products: The name of product of this transcript
Category: The types of molecular or genetic event and disease the gene is involved or type of expression derived by analysis of EST and SAGE expression data
UniGene (Optional): UniGene Id for the gene
Reference Sequences: mRNA or Protein sequence from RefSeq
OMIM and SNP (Optional): OMIM ID for the gene and link to NCBI SNPs for the locus.
Locus (Optional): LocusLink ID, chromosome, cytoband for the gene including linking to UCSC and Ensembl genome database.
Summary (Optional): A summary description of the gene, its products, its significance, and mutant phenotypes, from LocusLink
Gene Ontology (Optional): Gene Ontology for the gene, from LocusLink and Gene Ontology
Expression: Expression information derived from analysis of EST and SAGE data and is pooled by tissue type. Details are here.
Evidence (Optional): Supporting references listed by type of molecular events and diseases, sorted by year of publication, from PubMed. Other key publications related to this gene links to a list of publications related to the gene (from LocusLink)

Web Interface

Search: KGDB uses the free search engine, ht://dig from http://www.htdig.org. Searchable fields include gene name and symbol, aliases, UniGene ID, OMIM ID, and LocusLink ID.
Browse: KGDB content can be browsed by molecular event and by disease.