During the past 10 years, numerous studies of allergy have been published with a rising trend year by year. However, the rich scientific information about previously identified allergy-related genes is dispersed in thousands of publications. To address this need, we build the AllerGAtlas database 1.0 (http://biokb.ncpsb.org/AllerGAtlas/), which is an online database provided by National Center for Protein Science (Beijing); it contains a comprehensive list of allergy-related genes with their correspondent evidences in literature, derived by text mining and manual curation.
In order to serve better, we provide three query approaches for searching AllerGAtlas : search by protein name, search by nucleotide sequence and search by protein sequence.
For the query by gene name, the user can enter a gene name in the search box of "Gene Symbol", and see a drop down list with auto-completed gene symbols present in the AllerGAtlas. After selecting one of them and clicking the 'Search' button, the search engine will run and return the queried results containing a table showing the queried gene, related diseases and the supporting literature evidence. If users click the 'Reset' button, all current search terms will be deleted.
The users also can search the gene by Nucleotide sequence or protein sequence, the sequence identity score from BLAST will be listed in the parenthesis after the description. Users can specify the matched gene symbol and click “continue” for result page.
Clicking the hyperlink of the gene symbol for an individual allergy-related gene in the 'Gene' column, users can see well annotated information of the gene and the cross references to external databases (i.e. Ensembl, NCBI Gene, UniProtKB, neXtProt and Antibodypedia).
The user also can input words or select a term in the drop list in the text box to filter the listed genes or diseases from the results by sub-string match. In addition, clicking the small triangle in the head of each column in the table, the correspondent results in the table will be resorted in ascending/descending order by the column.
Clicking the number of the evidence abstracts or sentences, a table containing gene, disease, the PubMed ID, the evidence sentence and the manual validation information will be displayed to the users. The user can click on the evidence to see the original abstract with the highlighted key words, i.e. gene symbols and disease terms.
The page of "Browse & Download" shows three different methods for browsing detailed information on allergy-related genes: browse by gene, browse by disease and browse by biomarker. All the information for allergy-related genes and their supporting literature evidences can be downloaded for further analysis.
To obtain a complete list of publications for allergy-related genes, we performed a comprehensive search for allergy-related literature abstracts in PubMed . Gene-nomenclature recognition and extraction from these abstracts for human allergy-related gene candidates were performed by self-developed ontology-based bio-entity recognizer,which has the precision, recall, F-measure of 0.810, 0.883, 0.845 against the CRAFT corpus for gene/protein recognition based on Protein Ontology (PR), and is on par with current state-of-the-art biomedical annotation systems like BeCAS.
A list of human allergy-related genes together with their related diseases and evidence from PubMed abstract was compiled in the following three steps.
First, PubMed abstract and sentences containing the keywords of either ‘allergy’ or ‘allergic’ or ‘anaphylaxis’ or ‘allergic reaction ’or ‘hypersensitivity’ or ‘atopic’ or their lexical variants were collected. And 112,979 abstracts and 242,066 sentences were obtained.
Second, a list of human genes which co-occured with the allergy-related keywords at single-sentence level were recognized and extracted from these sentences by our bio-entity recognizer based on Protein Ontology.
Third, all 3,150 candidates from 27,033 original abstracts and 42,975 sentences were manually curated by our experts and 1,195 genes were finally selected as human allergy-related genes.
Disease terms were identified from PubMed abstracts by bio-entity recognizer based on Human Disease Ontology (DO). Associations between allergy-related genes/proteins and human disease terms were obtained based on sentence-level co-occurrence. Furthermore, the biomarker roles of certain genes/proteins are recognized and marked as well. All extracted allergy-related genes/proteins, human disease terms as well as their biomarker roles were loaded into MySQL database.
Three rounds of strict manual curation for the candidate allergy-related genes have taken to guarantee data quality. First, two experienced researchers independently checked all candidate genes and supporting evidence ; second, these selected genes with corresponding evidences were submitted to an internal review, in which each of the selected genes names were manually reviewed by a reviewer team consisting of three experts; third, all co-authors were asked to randomly check allergy-related genes from our website to make sure that all identified genes stored in our database are high credible allergy-related genes. Finally, we obtained 1195 genes and 577 associated diseases. All the evidence sentences that regarded as validated evidence have been manually curated and included on the top of each allergy-related gene page.
The login is only demanded when users do the manual curation. Users can log in by clicking here
We will monitor and validate community curation feedback before inclusion in the database.
Here we offer a trial account for the curation function (Account: "test@test.com", Password: "test").
Note: All the other functions of AllerGAtlas including data retrieval, browsing and downloading do not require any login or registration.
As it is difficult to validate the massive sentences of co-occurrence between the allergy-related genes and human disease terms,we leave the part of manual validation to all logged in users. we added a manual curation function to the sentences, with which the users can provide their feedback by simply clicking the "Yes" or "No" button to confirm or reject the evidence phrases. "Yes(1)"stands for the number of the sentence considered as "right" while "No(0)"stands for the number of the sentence considered as "error" .
A biomarker is a characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention.
"Disease section" means a classification of human disease associated with allergy-related genes.All disease terms were extracted from PubMesd abstracts by our bio-entity recognizer based on Human Disease Ontology.
"Validated evidence" means that the sentence which underwent strict three rounds of manual curation was selected as "validated evidence" to represent the correlation between the gene and allergy.
The sentences or terms matching key words of genes or key words of allergy or disease terms that based on Human Disease Ontology, are extracted from PubMesd abstracts by our bio-entity recognizer and highlighted in our database.
Gene names mean formal gene symbols and their synonyms provided by RefSeq.
Disease terms mean standardized disease names and their common synonyms provided by Human Disease Ontology.
Yes, "Feedback" feature was provided to our website, with which the users can submit new genes to our database manually; the database will be updated periodically in future. we also added a manual curation function to the evidence sentences of disease terms, with which the users can provide their feedback by simply clicking the "Yes" or "No" button. In addition, the users can also send email to us for further questions or potential collaborations.