CECMAtlas is the first database that stores a list of cancer-related extracellular matrix (C-ECM) genes collected by literature mining and manual curation. 225 human immunosuppression genes, 1387 associated diseases and 12 biological process are collected in this database.
CECMAtlas provides three ways to query: by gene name, by cancer name and by biological process.
Users can input a gene name in the "Gene Name" search box and a drop-down menu will provide auto-completed genes stored in the database. Selecting one of them and clicking the "Search" button will lead to the queried results containing a table showing the C-ECM genes, their associated diseases as well as the supporting evidence. All current query terms will be deleted if users click the "Reset" button.
Users can input a disease name in the "Cancer Term" search box, and a drop-down menu will provide auto-completed diseases stored in the database. Selecting one of them and clicking the "Search" button will lead to the queried results containing a table showing the C-ECM genes, associated diseases, biological processes, and supporting evidence.
Users can input a biological process in the "Biological Process" search box, and a drop-down menu will provide auto-completed biological process stored in the database. Selecting one of them and clicking the "Search" button will lead to the queried results containing a table showing the C-ECM genes, associated diseases, biological process, and supporting evidence.
Clicking on the hyperlink of the gene symbol in the "Gene" column will lead to the annotation information of the gene, validated sentences for biological processes, and the cross-references to external databases (i.e. Ensembl, NCBI Gene, UniProtKB and neXtProt).
When the user clicks the small triangle in the head of each column in the table, the corresponding results in the table will be resorted in ascending/descending order by the column. And the user can also input words or select a term in the drop list in the text box to filter the listed genes, diseases, biological process from the results by sub-string match.
When the user clicks the number of the evidence abstracts or sentences, a table containing gene, disease, the PubMed ID, the evidence sentence and the manual validation information will be displayed. The user can click on the sentence to see the original abstract in which the keywords, gene and disease names, are highlighted.
Gene names mean formal gene symbols and their synonyms provided by RefSeq.
Cancer terms mean standardized cancer names and their common synonyms provided by Human Disease Ontology.
In CECMAtlas, we collected a final gene list related to cancer terms, and we provide the summary information about the genes and cancer terms as below:
Biological process mean standardized biological process provided by Gene Ontology.
The page of "Browse & Download" shows four different methods for browsing detailed information on CECM gene: browse by gene, browse by disease, browse by biomarker and browse bys cancer-related biological process. All the information for CECM genes and their supporting literature evidences can be downloaded for further analysis.
Our data collection relied on the text mining of C-ECM PubMed abstracts, we searched PubMed for relevant literature abstracts, bio-entity recognition and extraction of these C-ECM candidate genes through self-developed ontology-based bio-entity recognizer. Our recognition tool has the recall, precision and F-measure of 0.883, 0.810 and 0.845 against the CRAFT corpus for gene/protein recognition based on Protein Ontology (PR). It is comparable to current state of the art biomedical annotation systems such as BeCAS.
We use the following three steps to collect data. First, we collected 2636 cancer entries from disease ontology under the father node “disease of cellular proliferation” (DOID:14566) and 429 ecm genes associated with the GO term “extracellular matrix”(GO:0031012) from Gene Ontology. Second, 325 candidate c-ecm genes were recognized and extracted from Pubmed abstracts which co-occurred with the cancer entries and bioprocess keywords at single-sentence level. That is to say, a gene occurs together with a particular bioprocess keyword and at least one of the cancer entries in a single sentence. Third, all these 325 candidates were manually curated by our experts and 225 genes were finally identified as human C-ECM genes.
To ensure the data quality, three rounds of strict manual curation were taken for the identification of cecm genes. First, all candidate cecm genes and supporting evidence were checked by two experienced researchers independently. Second, after these selected genes with corresponding evidence were submitted for internal review, a panel of three experts manually reviewed each of the selected gene names. Third, all co-authors were asked to randomly check cecm genes from our website to make sure that all identified genes stored in our database are high credible cecm genes. Finally, we obtained 225 genes and 1386 associated diseases. All the evidence sentences that regarded as validated evidence have been manually curated and included on the top of each cecm gene page.
The login is only demanded when users do the manual curation. Users can log in by clicking here.We will monitor and validate community curation feedback before inclusion in the database.
Here we offer a trial account for the curation function (Account: "test@test.com", Password: "test"). Note: All the other functions of CECMAtlas including data retrieval, browsing and downloading do not require any login or registration.
Note: All the other functions of CECMAtlas including data retrieval, browsing and downloading do not require any login or registration.