|
- Definition of KGDB
- Motivations of
KGDB
- Gene Inclusion
Criteria
- Data Sources of KGDB
- Process of
Construction
- Analysis
of Expression
- Data Structure and Format
- Web Interface
Definition
of KGDB
KGDB represents Kidney Gene
DataBase and is a curated and integrated database of genes
related to human kidney and kidney diseases.
Motivations
on building KGDB
Biomedical literature is
growing explosively. The MEDLINE database is a primary repository for such
data. The kidney is an intricate organ, and also a disease targeted organ.
Renal disease affects 11% of people in the United States over the age of 65,
not including those with diabetes or hypertension. Kidney failure is
predominantly a disease of older people. Kidney cancer remains relatively
rare, but incidence and mortality rates are reported to be rising steadily
across the world. The development and progression of kidney diseases
involves a number of molecular events, both genetic and epigenetic, such as
gene amplification, mutation, gross deletion, polymorphism, loss
of heterozygosity (LOH), microsatellite instability (MSI) methylation, and
hypomethylation. These events in turn involve a large number of genes, which
have been documented in bibliographic databases such as the
MEDLINE database with thousands of records. A fundamental
limitation of MEDLINE and other similar resources is that the information
they contain is not represented in structured format. Thus, both retrieval
effectiveness and precision are poor. For example, a typical question that
scientists may ask when searching MEDLINE database is: ‘What
genes have been found mutated in human renal cancer?’ To answer
the question, they may search MEDLINE using the query (‘kidney’ OR
‘renal’) AND ‘cancer’ AND mutation AND human, which
returns 1789 records as of July 21, 2003, among which less than
20 percent are relevant to the question and many of which are
redundant. Another problem hindering efficient retrieval of
gene-related information from literature databases is the
non-standardized terminology used for gene names by scientists.
For example, different alias names have been used in the
literature for the CDKN2A gene commonly known as p16,
including ARF, P16, CMM2, INK4, MTS1, TP16,
CDK4I, CDKN2, INK4A, p14ARF and p16INK4.
Use only any one to query MEDLINE database will result in missing of
relevant records.
In consideration of the
existing problems, KGDB was thus constructed to: 1) catalog gene-related
facts of the kidney and kidney diseases cumulated in the literature database
during the past years and years to come; 2) store the information in
structured format for fast and easy access; 3) annotate to deliver
value-added information.
Gene
Inclusion Criteria
One of the following two criteria must be satisfied for genes to be
included in the database: 1) A gene must have been reported in
the published literature to be involved in one of the following
events: gene amplification,
mutation, gross deletion, polymorphism, apoptosis, loss of
heterozygosity (LOH), microsatellite instability (MSI) methylation, and
hypomethylation. 2) Genes specifically expressed in kidney. Evidence for
this category is from the SAGEmap
database and the UniGene database hosted by National
Center for Biotechnology Information (NCBI). For EST expression,
a UniGene cluster must have at least 2 member ESTs, all of which were
derived from kidney libraries; for SAGE expression, a gene to be defined as
kdney specific must have a tag count of more than 1, all of which were
derived from kdney libraries. Currently all the genes in this category are
UniGene clusters of ESTs.
Data sources of KGDB
KGDB uses data from the following
databases.
·
MEDLINE citation database through PubMed at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
·
Unigene at http://www.ncbi.nlm.nih.gov/UniGene/
·
SAGEmap at http://www.ncbi.nlm.nih.gov/SAGE/
·
dbSNP at http://www.ncbi.nlm.nih.gov/SNP/
·
LocusLink at http://www.ncbi.nlm.nih.gov/LocusLink/
·
Gene Ontology at http://www.geneontology.org
·
NCBI’s Gene Expression Omnibus (GEO), a gene expression
and hybridization array data repository, at http://www.ncbi.nlm.nih.gov/geo/
Process of Construction
The construction KGDB is a multiple stage process.
- Data retrieval: MEDLINE citation abstracts are retrieved using Entrez query tool. A typical query consists of three key words: “kidney”, “human” and the word for the event. For example, the query for the event of gene mutation in kidney was (("kidney"[MeSH Terms] OR kidney[Text Word]) AND ("mutation"[MeSH Terms] OR mutation[Text Word])). MEDLINE records that are of review type or without abstracts were excluded. Genes involved in the event of over-expression were retrieved from OMIM database using “kidney” as the query word. Data from other databases were retrieved either through FTP or HTTP.
- Data extraction: MEDLINE abstracts from each query were carefully read by two scientists to identify true relationship between a gene and the kidney and to extract gene name, type of molecular and genetic events, type of prostatic diseases. A list of genes was thus generated for further annotations.
- Data annotations: Data annotations were
performed automatically using programs written in Perl language. Pieces of information from other database were extracted and added to the extracted gene such as alias name, summary of gene function, gene ontology, SNPs.
- Expression analysis: To provide relative expression levels in all tissues for each gene, expression data were analyzed as stated below.
- File generation: KGDB is stored and
maintained in a single denormalized flat file, from which front-end web
pages are further generated automatically for display.
Analysis
of Expression
For each gene
collected in KGDB, levels of expression were analyzed utilizing both SAGE and
EST data and pooled by tissue type.
For expression derived from EST, the number of ESTs for each gene in
each library was first normalized to the number of ESTs per million, and then
was pooled by tissue to obtain the average level of expression in tissues. When
calculating expression from SAGE data, only reliable mapping data was used as
defined by SAGEmap database. For each gene, the tag frequency in each library
was also normalized to the number of tags per million. Special measures were
taken to deal with the problem of multiple tag assignments. If one SAGE tag was
mapped to n genes, the tag frequency for each gene in each library was
divided by n. If one gene had more than one tag mapped to it, then the
tag frequency for the gene was the sum of tag frequencies of all tags.
Interpretation
of expression
For each gene in KGDB database, SAGE and/or EST expression data are given. An
example is provided below. To view explanation for each item, please click the
link.
a. Total
ESTs or SAGE tags: Total ESTs or SAGE tags representing this gene in all
libraries from all tissues.
b. Total
Libraries: Total number of libraries expressing this gene.
c. Breadth:
Percentage of libraries expressing this gene out of total libraries in a tissue pool.
d. Tag
count per million (CPM): Number of tags from a library which is mapped to
the gene is first normalized to a tag count per million, then is averaged among libraries expressing this gene.
Data
Structure and Format
KGDB is distributed and maintained in a single flat file. Fields
of entry are explained below.
-
Name: Official
gene name as assigned by HUGO
Gene Nomenclature Committee (HGNC). If no official name is
available, interim name from LocusLink
is used.
-
Symbol (Optional):
Official gene symbol as assigned by HUGO
Gene Nomenclature Committee (HGNC). If no official symbol is
available, interim symbol, from LocusLink
is used.
-
Aliases (Optional):
Other names and symbols used for the gene, from LocusLink
-
Gene Products:
The name of product of this transcript
-
Category: The
types of molecular or genetic event and disease the gene is involved
or type of expression derived by analysis of EST and SAGE
expression data
-
UniGene (Optional):
UniGene Id for the
gene
-
Reference
Sequences: mRNA or Protein sequence from RefSeq
-
OMIM and SNP (Optional):
OMIM ID
for the gene and link to NCBI SNPs for the locus.
-
Locus (Optional):
LocusLink ID,
chromosome, cytoband for
the gene including linking to UCSC and Ensembl genome database.
-
Summary (Optional):
A summary description of the gene, its products, its significance, and
mutant phenotypes, from LocusLink
-
Gene Ontology
(Optional): Gene Ontology for the gene, from LocusLink
and Gene Ontology
-
Expression:
Expression information derived from analysis of EST and SAGE data and is
pooled by tissue type. Details are here.
-
Evidence (Optional):
Supporting references listed by type of molecular events and diseases,
sorted by year of publication, from PubMed.
Other key publications related to this
gene links to a list of publications related to the gene (from
LocusLink)
Web
Interface
-
Search: KGDB uses the free search engine, ht://dig from http://www.htdig.org.
Searchable fields include gene name and symbol, aliases, UniGene ID,
OMIM ID, and LocusLink ID.
-
Browse: KGDB content can be browsed by molecular event and by disease.
|