PolySearch header
PolySearch sub-header
Home Check Result Documentation Contact & Download

Help

  1. To Begin
  2. Association Search (Example: Disease-Gene)
  3. Checking a Search
  4. Custom Thesaurus Search
  5. SNP/PCR Analysis Pipeline

  1. To Begin: First choose a given. For example, if you want to find out information about a disease, then choose Disease in the "Given" drop down menu. Then in the "Find" drop down menu, choose what you want to find out about the given. For example, given "Disease" you can choose one of "Disease-Gene", "Disease-Drug" and "Disease-Metabolite" associations. Finally, click on the "Go" button.
    How to Choose a Given Search Type How to Choose a Find Search Type

  2. Association Search (Example: Disease-Gene):
    Enter your query in the top textbox and click the "Submit" button (this will use the default settings).
    Disease Gene - Basic Query

    Custom filter words: PolySearch will retrieve PubMed abstracts that have the query word inside the abstract title or abstract itself. In addition, the custom filter words can further filter unwanted abstracts. For example, a query "Cohen Syndrome" with custom filter words "gene; protein" means PolySearch will retrieve the abstracts that have "Cohen Syndrome" in the abstract title or abstract itself AND these abstracts must have either "gene" OR "protein" in their PubMed records. For advance filtering technique: Create your own filter by following this example: "metabolite[all]+OR+metabolites[all]+AND+glucose[tiab]" (i.e. Use +OR+ and +AND+ for different combination of OR and AND, use[] to indicate PubMed record field to apply the filter word, no semi colon).
    Using a custom filter

    PolySearch Only PubMed: PolySearch searches through PubMed as well as OMIM, SwissProt, DrugBank and HMDB. If you are only interest in PubMed (or you want to do a screening search using PubMed only), then choose "Yes". This will bypass searches in OMIM, SwissProt, DrugBank and HMDB so the results can be computed faster. Choosing "No", PolySearch will search through PubMed, OMIM, SwissProt, DrugBank and HMDB.
    Selecting database search sources

    Minimum number of citations/references: PolySearch will only return/display results that have equal to or more than the minimum number of citations. For example, if the minimum number of citations is "2", then PolySearch will return "Cohen Syndrome" is associated with "COH1" if and only if there are two or more abstracts support that.
    How to Choose the max number of citations

  3. Checking a Search: When you are waiting for a search to complete, the result page will refresh every 10 seconds until PolySearch is done. From the screen you can read what type of search it is, what is the query, what are the filter words used and the job ID. In addition, there are progress messages (i.e. "Processing PubMed ... Done" for PubMed, SwissProt, OMIM and HMDB) that show PolySearch's progress, as well as time estimates for each step.
    Checking search progress Checking search progress

    When the result is done, a hyperlink will appear and you can click the hyperlink to read PolySearch's result.
    How to know your search is complete

  4. Custom Thesaurus Search: In PolySearch main page, choose "Given Text Word, Find Associated Text Word" search. Then enter queries, filter words and text words in a "semicolon space" ("; ") delimited format as shown by the figure below. How to search with a custom dictionary

  5. SNP Analysis Pipeline: Here we use the "Given Gene, Find SNP" combined with "Given SNP, Find PCR Primer" tools to extract GSTT1 SNPs and input them into PCR Primer Design. We first input our gene on the "Given Gene, Find SNP" page:
    Inputing a gene to search for SNPs

    This returns a list of SNPs for GSTT1. rs2234953, rs2266633, and rs2266637 seem important as they are inside exons with associated allele frequencies.
    Sample result output showing matching SNPs for GSTT1

    We then take these SNPs and input them into the "Given SNP, Find PCR Primer" tool:
    Using the SNPs to perform PCR Primer Design

    This provides us with PCR Primer information for GSTT1:
    Sample result output showing PCR analysis for GSTT1

Algorithms

PolySearch consists of seven basic components: 1) a web-based user interface for constructing queries; 2) a collection of internal and external biomedical databases; 3) a collection of biomedical synonyms (custom thesauruses); 4) a general text search engine for extracting data from heterogeneous databases; 5) a schema for selecting, ranking and integrating content; 6) a display tool for displaying and synopsizing results and 7) a PCR primer designing tool to facilitate SNP and mutation studies. An outline of PolySearch's general design is given in Figure 1.

Figure 1: PolySearch system overview showing the resources that PolySearch uses and the features found in PolySearch.

Overview of the resources used and features available in PolysSearch

Sentence scoring, ranking and integration: Some of the more important rules in PolySearch's pattern recognition system are as follows:

  • For the main pattern "Query Word-Association Word-Thesaurus Word", PolySearch searches for compact patterns first. If a compact pattern cannot be found, then PolySearch searches for general patterns. If a general pattern cannot be found, then PolySearch searches for relaxed patterns.
  • Compact patterns:
    • The query word and the association word must be within 5 words (tokens) of each other.
    • A "Query Word-Association Word-Thesaurus Word" pattern must be established (i.e. all three types of words are present) within 10 words (tokens) of the query word.
    • A stop word such as "that", "which", "whereas" or "no" cannot be in a "Query Word-Association Word-Thesaurus Word" pattern.
    • Once a "Query Word-Association Word-Thesaurus Word" pattern is established, any thesaurus words that come after that phrase can also meet the pattern recognition criteria.
    • Once a "Query Word-Association Word-Thesaurus Word" pattern is established, if another association word or stop word is seen, the pattern resets.
  • General patterns:
    • All relevant words must be within 40 words (tokens) of each other.
    • A "Query Word-Association Word-Thesaurus Word" pattern must be established (i.e. all three types of words are present) within 15 words (tokens) of the query word.
    • A stop word such as "that", "which", "whereas" or "no" cannot be in a "Query Word-Association Word-Thesaurus Word" pattern.
    • Once a "Query Word-Association Word-Thesaurus Word" pattern is established, any thesaurus words that come after that phrase can also meet the pattern recognition criteria.
    • Once a "Query Word-Association Word-Thesaurus Word" pattern is established, if another association word or stop word is seen, the pattern resets.
  • Relaxed patterns:
    • All relevant words must be within 45 words (tokens) of each other.
    • The query word and the association word must be within 30 words (tokens) of each other.
    • A "Query Word-Association Word-Thesaurus Word" pattern must be established (i.e. all three types of words are present) within 40 words (tokens) of the query word.
    • Once a "Query Word-Association Word-Thesaurus Word" pattern is established, any thesaurus words that come after that phrase can also meet the pattern recognition criteria.
    • Once a "Query Word-Association Word-Thesaurus Word" pattern is established, if another association word is seen, the pattern resets.
  • For the "Association Word-Query Word-Thesaurus Word" pattern (mainly for Gene/Protein searches), the association word must have a suffix of -ate, -fer, -ment, -ing, -ion, -lex, -es, or -ions. In addition, all three words must be within 10 words (tokens) of each other.

For the "Query Word-Thesaurus Word-Association Word" pattern (mainly for Gene/Protein searches), the association word must be one of "complex", "complexes", "inhibitor", "inhibitors", "interaction", or "interactions". In addition, all three words must be within 8 words (tokens) of each other.

Databases

One of the more unique features of PolySearch is its integration of multiple databases containing both text and sequence data. Currently PolySearch can search and extract data from more than a dozen biomedical databases including PubMed, OMIM, SwissProt, DrugBank, the Human Metabolome Database (HMDB) the Human Protein Reference Database (HPRD), the Genetic Association Database (GAD), HapMap, Entrez SNP (dbSNP), CGAP SNP500cancer Database, and the Human Genome Mutation Database (HGMD). Many of these databases (PubMed, OMIM, etc.) are housed externally and queried through various custom CGI tools written in Perl, while others (DrugBank, HMDB and the SNP databases) are housed internally to accelerate PolySearch's query process. Below is a short description of each database.

  • PubMed: A service of the U.S. National Library of Medicine that includes over 16 million abstracts and paper titles from life science journals dating back to the 1950s.
  • OMIM (Online Mendelian Inheritance in Man): A catalog of human genes and genetic disorders authored and edited by Dr. Victor A. McKusick and his colleagues at Johns Hopkins University, and developed for the World Wide Web by the NCBI (the National Center for Biotechnology Information).
  • GAD (Genetic Association Database): An archive of human genetic association studies of complex diseases and disorders.
  • SwissProt: A curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases.
  • HPRD (Human Protein Reference Database): A centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome.
  • DrugBank: A unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information.
  • HMDB: A freely available electronic database containing detailed information about small molecule metabolites found in the human body.
  • HapMap: A freely available resource that contains information pertaining to haplotype map of the human genome. The HapMap database describes the common patterns of human DNA sequence variation.
  • Entrez SNP (dbSNP): A central repository for both single base nucleotide substitutions (SNPs) and short deletion and insertion polymorphisms in the human genome.
  • CGAP SNP500cancer Database: A part of the Cancer Genome Anatomy Project and is specifically designed to contain data on the genetic variation in genes important in cancer.
  • Human Genome Mutation Database: A database comprises various types of mutation within the coding regions, splicing and regulatory regions of human nuclear genes causing inherited disease.

Evaluations

A text mining tool is only useful if it gives accurate results and extensive coverage in less time than what could be performed using alternative (i.e. non-computational) or competing computational methods. To evaluate PolySearch's performance, we used several different tests or methods. These included:

  • Evaluation #1: a comparison of features and capabilities between PolySearch and other biomedical text mining tools;
  • Evaluation #2: an evaluation of PolySearch's ability to identify genes and protein names within different sentences or abstracts;
  • Evaluation #3: a comparative evaluation of PolySearch's ability to identify protein-protein interactions;
  • Evaluation #4: an evaluation of PolySearch's ability to identify disease/gene associations;
  • Evaluation #5: an evaluation of PolySearch's ability to identify drug/drug-target associations;
  • Evaluation #6: an evaluation of Polysearch's ability to identify metabolite/enzyme associations;
  • Evaluation #7: several real-life assessments relating to its capacity to facilitate or accelerate database annotations; and
  • Evaluation #8: an evaluation of PolySearch's performance on protein-protein interaction corpus (SPIES).

Evaluation #1:
In the first assessment, PolySearch was compared to seven other well known biomedical text mining tools, namely Entrez, MedMiner, MedGene, LitMiner, ALIBABA, IHOP, EBIMed. Comparisons included the types of searches supported, the extent of hyperlinking, the presence of access restrictions, the capacity for text and sentence highlighting, the presence of word co-occurrence for scoring, support for keywords/association words or pattern recognition, and the number of database integrations for each tool.

As seen in Table 1, Entrez offers the most extensive database and search coverage as well as the broadest hyperlinking capabilities. However, Entrez is more of an information retrieval system rather than a text mining system and so it lacks the ranking, scoring and sentence highlighting capabilities of other text mining tools. In contrast, MedMiner provides key highlighting capability and organizes these sentences into twelve general categories, which shorten the time required to gather relevant information from the selected texts. However, MedMiner searches are mainly limited to one-to-one searches (e.g. "one gene" to "one drug" search) and this limits the general utility of MedMiner. Both MedGene and LitMiner provide the capability to perform "given X find all Y" types of searches and both of them provide statistical rankings for the associations they found. Nevertheless, both MedGene and LitMiner lack the ability to perform text and sentence highlighting, making it difficult to verify the associations that MedGene and LitMiner found. ALIBABA, IHOP, EBIMed, and PolySearch all have the ability to rank the associations they found and supply both text and sentence highlighting for quick verification of the associations. ALIBABA treats PubMed abstracts and associations as a graph and provides a graphical interface to display the associations it found. While this approach may be useful for a small number of abstracts, for larger numbers of abstracts, the graph becomes almost unusable due to the over-abundance of information. IHOP, while great for identifying Gene/Protein to Gene/Protein interactions, lacks support for many other search types and this limits IHOP's usefulness in other areas of biomedical research. Unlike IHOP, EBIMed provides a means of an analysis that is independent of the initial keyword query and is more flexible with the types of searchers it allows. However, EBIMed uses a pure word co-occurrence approach to assess associations and so it tends to lack the accuracy of systems that use both keywords and pattern matching (such as ALIBABA, IHOP and PolySearch). PolySearch, being the most recent addition, combines some of the best features from each of the other tools. In addition, PolySearch appears to be unique in terms of the diversity of its search and text ranking possibilities, its ability to perform extensive query synonym expansion using its different thesauruses, its PolySeach Relevancy Index (PRI) scoring display for immediate visual indications on the strength of association, its SNP search functionalities, and its ability to text mine additional databases such as OMIM, SwissProt, DrugBank, HMDB, HPRD and GAD.

Table 1: Feature comparison of various biomedical text mining tools.

Entrez MedMiner MedGene LitMiner Alibaba IHOP EBIMed PolySearch
Type of Search supported Literature, Disease, Gene, Structure, Taxonomy, SNP, Compound, etc. Gene, Drug, Text Word Gene, Disease Gene, Disease, Compounds, Tissues/Organs Gene, Disease, Drug, Tissues/Organs, Cells, Species Gene Gene, Cellular Compartment, Biological Process, Molecular Function, Drug, Species Gene, Disease, Drug, Metabolite, Tissues/Organs, Subcellular Localization, Text Word
Extensive hyperlinking Most Extensive Less Extensive Less Extensive Less Extensive Less Extensive More Extensive More Extensive More Extensive
Access restrictions None None Registration None None None None None
Text and sentence highlighting No Yes No No Yes Yes Yes Yes
Co-occurrence scoring scheme None None Abstract level Abstract level Sentence level Sentence level Sentence level Sentence level
Use of keywords for association words None Predefined keywords None None Predefined keywords Predefined keywords None Predefined & custom association words
Sentence pattern recognition No No No No Yes Yes No Yes
Thesaurus query synonym expansion Yes, limited Yes, limited Yes, limited None None Yes, for genes only None Yes, extensive
Databases PubMed, OMIM, Gene, MMDB, Taxonomy, dbSNP, PubChem, etc. PubMed, GeneCards PubMed PubMed PubMed PubMed, HPRD, IntAct PubMed PubMed, OMIM, Swisprot, DrugBank, HMDB, HPRD, GAD, HapMap, dbSNP, CGAP, HGMD

Evaluation #2:
For the second assessment, we tested PolySearch's ability to identify genes and protein names within different sentences or abstracts. To do this, we use the dataset that IHOP used for evaluating their gene synonym identification for human genes [36]. The dataset contains 181 sentences from various PubMed abstracts with an average of about 2-3 gene names per sentence (the names include symbols, standard names, abbreviations and synonyms). We manually identified the correct gene and protein names from the dataset and used this collection as our gold standard to compare to PolySearch's gene synonym identification for the dataset. Table 2 shows PolySearch's precision, recall and f-measure in this evaluation as compared to IHOP.

Table 2: Precision, recall and f-measure on gene synonym identification for PolySearch and IHOP.

IHOP PolySearch
Precision (%) 87.1 90.1
Recall (%) 81.8 84.4
F-measure (%) 85.3 87.6

Evaluation #3:
For the third assessment, we compared PolySearch against EBIMed, IHOP, and HPRD. Precision, recall, and f-measure are shown in the following table.

Table 3: Precision, recall, and f-measure for protein-protein interaction evaluation among the different tools.

HPRD EBIMed IHOP PolySearch R1 >= 1 PolySearch + HPRD R1 >= 1
TP, FN, FP TP=31, FN=99-31=68, FP=0 TP=23, FN=99-23=76, FP=83-23=60 TP=39, FN=99-39=60, FP=80-39=41 TP=64, FN=99-64=35, FP=86-64=22 TP=82, FN=99-82=17, FP=104-82=22
Precision 31/31 = 100% 23/83 = 27.7% 39/80 = 48.8% 64/86 = 74.4% 82/104 = 78.8%
Recall 31/99 = 31.3% 23/99 = 23.2% 39/99 = 39.4% 64/99 = 69.2% 82/99 = 82.8%
F-measure 47.7 (±25.7)% 25.3 (±22.4)% 43.6 (±11.6)% 69.2 (±10.0)% 80.8 (±6.8)%

Evaluation #4:
For the fourth assessment, we evaluated "Given Disease Find Associated Gene" queries. The following table shows the performance of the different text mining tools.

Table 4: "Given Disease Find Associated Gene": precision, recall and f-measure for GAD, LitMiner, EBIMed, PolySearch R2 >= 1, PolySearch R1 >= 1, PolySearch with PubMed + OMIM, and PolySearch with PubMed + OMIM + GAD.

GAD LitMiner EBIMed PolySearch R2 >= 1 PolySearch R1 >= 1 PolySearch + OMIM R1 >= 1 PolySearch + OMIM + GAD R1 >= 1
TP, FN, FP TP=21, FN=132-21=111, FP=0 TP=4, FN=132-4=128, FP=5-4=1 TP=102, FN=132-102=30, FP=177-102=75 TP=119, FN=132-119=13, FP=251-119=132 TP=93, FN=132-93=39, FP=133-93=40 TP=101, FN=132-101=31, FP=143-101=42 TP=113, FN=132-113=19, FP=156-113=43
Precision 21/21 = 100% 4/5 = 80% 102/177 = 57.8% 119/251 = 47.4% 93/133 = 69.9% 101/143 = 70.6% 113/156 = 72.4%
Recall 21/132 = 15.9% 4/132 = 3.0% 102/132 = 77.3% 119/132 = 90.2% 93/132 = 70.4% 101/132 = 76.5% 113/132 = 85.6%
F-measure 27.5 (±23.0)% 5.8 (±13.5)% 66.0 (±10.3)% 62.1 (±16.6)% 70.2 (±17.5)% 73.5 (±9.3)% 78.5 (±10.3)%

Evaluation #5
For the fifth assessment, we evaluated "Given Drug Find Associated Gene" queries. The intent of this query is to find all genes that are affected or acted on by a drug. For a text mining system to be useful, it must perform well with large amounts of data such that used in this assessment. For this assessment, we compared PolySearch's results to EBIMed, LitMiner, and a manually curated database on drug-protein interactions, called DrugBank. DrugBank is one of the largest and most comprehensive drug and drug target databases available. In particular, it contains extensive extensive information about drug and gene/protein associations (i.e. drug metabolizing enzymes and drug targets). For this assessment, the following ten drugs were randomly chosen from DrugBank for analysis and the results are shown in Table 5b:

Table 5a: The DrugBank IDs and common names for the ten drugs randomly chosen from DrugBank for evaluating "Given Drug Find Associated Gene" queries.

DrugBank ID Common Name
APRD00028 Tramadol
APRD00108 Pefloxacin
APRD00128 Tizanidine
APRD00136 Quinidine
APRD00294 Bumetanide
APRD00319 Fenfluramine
APRD00454 Cisapride
APRD00600 Famciclovir
APRD00706 Nizatidine
APRD00761 Dicumarol

Table 5b: "Given Drug Find Associated Gene": comparing DrugBank, LitMiner, EBIMed, PolySearch with PubMed, and PolySearch with PubMed + DrugBank.

DrugBank LitMiner EBIMed PolySearch (R1 >= 1) PolySearch + DrugBank (R1 >= 1)
TP, FN, FP TP=19, FN=227-19=208, FP=0 TP=24, FN=227-24=203, FP=41-24=17 TP=118, FN=227-118=109, FP=186-118=68 TP=220, FN=227-220=7, FP=358-220=138 TP=223, FN=227-223=4, FP=363-223=140
Precision 19/19 = 100% 24/41 = 58.5% 118/186 = 63.4% 220/358 = 61.5% 223/363 = 61.4%
Recall 19/227 = 7.9% 24/227 = 10.6% 118/227 = 52.0% 220/227 = 96.9% 223/227 = 98.2%
F-measure 15.4(±13.7)% 17.9(±11.8)% 57.1(±17.7)% 75.2(±9.5)% 75.6(±9.2)%

To assess PolySearch's performance, PolySearch's "Given Drug Find Associated Gene" query was run for each of the ten drugs using their common names as well as their synonyms (which were automatically generated by PolySearch). We used PolySearch in two modes. In one mode, the search was limited to PubMed abstracts only and in the second mode, we turned on PolySearch's access to DrugBank to see if this would help to improve performance. The default PolySearch settings were used in this assessment. The association word list used in this assessment contains a list consisting of most likely protein interactions words. We only looked at the results that PolySearch's R1 system returned and then tried to map the results that EBIMed and LitMiner returned to the results that PolySearch's R1 system returned or the results derived from DrugBank. Also, based on previous observations, we chose to ignore gene names that were three letters or less. All the PolySearch extracted drug-gene associations that satisfied the previously mentioned criteria were manually verified by reading the abstracts and checking appropriate databases. All manually verified drug-gene associations including the pre-existing drug-gene associations in DrugBank were combined to derive a list of gold standard drug-gene associations. This list was used to tabulate the performance measured in Table 5b.

As Table 5b shows, PolySearch was able to identify significantly more drug-gene interactions (and potential gene targets) than what is provided by DrugBank. On average, PolySearch found 20.9 new drug-gene associations for each of the ten drugs. The reason for this discrepancy lies in the fact that the drug targets in DrugBank are typically primary drug targets, meaning they are responsible for the therapeutic effects of many drugs, while PolySearch identified secondary drug targets in addition to primary drug targets. These secondary drug targets, which could be responsible of the side effects of drugs, can be just as important as the primary drug targets. Inclusion of these secondary drug targets into DrugBank would likely improve the coverage and utility of this database.

Table 5b also shows that PolySearch outperformed both EBIMed and LitMiner in this task. It is worth mentioning that LitMiner precomputes its results and even though this has the advantage of providing the results almost instantaneously, the precomputed results contain a shorter list of genes. Furthermore, precomputed results for some of the drugs were not available. As a result, LitMiner's performance suffered. However, even if we only looked at the set of drugs for which LitMiner's precomputed results were available, LitMiner would still perform the worst. For PolySearch with access to DrugBank turned on, the performance improved slightly over PolySearch with PubMed only and R1 >= 1. In this case, we were able to extract most of the drug-gene associations with PubMed alone. As a result, the recall improved only slightly with the DrugBank integration turned on. While PolySearch did find more drug-gene interactions, higher precision is still desirable. Below we took a closer look at the precision scores for different R1 cut-off values and the precision scores for all R1 sentences found in the ten queries.

Table 5c: Precision for drug-gene associations of the ten "Given Drug Find Associated Gene" queries at different R1 scores and precision for all R1 sentences of the ten queries.

R1 >= 1 R1 >= 2 R1 >= 3 All R1 sentences
TP, FN, FP TP=220, FP=358-220=138 TP=120, FP=164-120=44 TP=90, FP=105-90=15 TP=1148, FP=1283-1148=135
Precision 220/358 = 61.4% 120/164 = 73.2% 90/105 = 85.7% 1148/1283 = 89.5%

Table 5c shows that with an R1 score >= 1, 61.4% of the extracted associations are accurate, with an R1 score >= 2, 73.2% of the extracted associations are accurate, and with an R1 score >= 3, 85.7% of the extracted associations are accurate. It is interesting to see that R1 >= 3 achieved high precision while still finding new drug-gene associations (~ 7 new associations per drug). It seems that an R1 >= 3 can be used as a reasonable cut-off score for automatic information extraction as it achieves high precision while maintaining good coverage.

In total, there were 1283 R1 sentences found in the ten "Given Drug Find Associated Gene" queries. Assuming that all R1 sentences for true associations are relevant, then the R1 sentences achieved a precision of 89.5% as 1148 of the 1283 R1 sentences were judged to be relevant. To compare this to a baseline consider this: if a user had to read all 9816 sentences in the 954 abstracts that mention the 1148 relevant R1 sentences, this would equate to a baseline precision of 11.7%. Furthermore, assuming it takes 30 seconds for a skilled individual to process an abstract this would translate to 8 hours (954 * 30 seconds) of continuous reading. The time taken by PolySearch for not only identifying the 1283 R1 sentences in the 954 abstracts but also searching through the 14,000 total extracted abstracts was about 20 minutes. This shows that using PolySearch is significantly faster than using PubMed and manually reading abstracts.

Overall with this assessment we demonstrated that PolySearch can serve as an automatic information extraction system for large amounts of biomedical data. We also demonstrated that PolySearch outperformed other text mining tools and that it can extract more data than it contained in a manually curated database. The precision that PolySearch achieved is significantly better than using PubMed and manually reading abstracts. By varying the R1 cut-off scores, PolySearch can be an information extraction system with high recall and moderate precision (R1 >= 1) or a system with high precision and moderate (R1 >= 3) that is still capable of finding new associations not found in a high quality manually curated database.


Evaluation #6
In the sixth assessment, we evaluated "Given Metabolite Find Associated Gene" queries. With this assessment, we investigated how PolySearch performs with another type of automatic information extraction task. For this assessment, we compared PolySearch, EBIMed, and LitMiner to the Human Metabolome Database (HMDB). The HMDB is a database containing detailed chemical, biological and clinical information about small molecule metabolites found in the human body. The HMDB also contains metabolic enzyme data for each of the metabolites. These metabolite/metabolic enzyme associations are the ones that we are interested in and what we wish to compare to PolySearch's results for its "Given Metabolite Find Associated Gene" queries. To make this assessment, the following ten metabolites were randomly chosen from the HMDB:

Table 6a: The HMDB IDs and common names for the ten metabolites randomly chosen from HMDB for evaluating Given Metabolite Find Associated Gene queries.

HMDB ID Common Name
HMDB00210 Pantothenic acid
HMDB00721 Glycyl-L-proline
HMDB00835 N-Acetyl-a-D-galactosamine
HMDB01059 1D-Myo-inositol 1,3,4,5-tetrakisphosphate
HMDB01175 Malonyl-CoA
HMDB01381 Prostaglandin H2
HMDB01413 Citicoline
HMDB01489 Ribose 1-phosphate
HMDB01550 S-Formylglutathione
HMDB02037 12-Hydroxyeicosatetraenoic acid

PolySearch's "Given Metabolite Find Associated Gene" query was run for each of the ten metabolites using their common names as well as their synonyms. This time PubMed, OMIM, and HMDB were used as data sources for the search. Other search settings were the same as mentioned in the Drug-Gene assessment. The average number of abstracts identified per query was 880. A list of gold standard metabolite-gene associations for these ten metabolites was compiled by manually verifying PolySearch metabolite-gene associations for R1 >= 1 and by manually compiling data from the HMDB. Table 6b shows a comparison between HMDB, EBIMed, LitMiner, PolySearch with PubMed alone, PolySearch with PubMed and OMIM, and PolySearch with PubMed, OMIM and HMDB.

Table 6b: Given Metabolite Find Associated Gene: precision, recall and f-measure for HMDB, LitMiner, EBIMed, PolySearch with PubMed, PolySearch with PubMed + OMIM, and PolySearch with PubMed + OMIM + HMDB:

HMDB LitMiner EBIMed PolySearch (R1 >= 1) PolySearch + OMIM (R1 >= 1) PolySearch + OMIM + HMDB (R1 >=1)
TP, FN, FP TP=26, FN=183-26=157, FP=0 TP=5, FN=183-5=178, FP=0 TP=83, FN=183-83=100, FP=111-83=28 TP=166, FN=183-166=17, FP=263-166=97 TP=170, FN=183-170=13, FP=267-170=97 TP=183, FN=183-183=0, FP=284-183=101
Precision 26/26 = 100% 5/5 = 100% 83/111 = 74.8% 166/263 = 63.1% 170/267 = 63.7% 183/284 = 64.4%
Recall 26/183 = 14.2% 5/183 = 2.7% 83/183 = 45.4% 166/183 = 90.7% 170/183 = 92.9% 183/183 = 100%
F-measure 24.9(±17.6)% 5.3(±9.3)% 56.5(±28.4)% 74.4(±8.7)% 75.6(±8.1)% 78.4(±7.6)%

Overall, PolySearch appears to be just as effective in automatic information extraction for metabolite-gene associations as it is for drug-gene associations. On average, PolySearch found 15.7 new metabolite-gene associations for each of the ten metabolites. Table 6c also shows that the performance of PolySearch using PubMed + OMIM + HMDB is the best. This assessment demonstrates again that mining of high quality, manually curated databases can help improve both the sensitivity and specificity of biomedical information extraction.

Next, we took a closer look at the precision scores for different R1 score cut-offs and the precision scores for all R1 sentences to see if high precision with moderate recall could be achieved.

Table 6c: Precision for metabolite-gene associations of the ten "Given Metabolte Find Associated Gene" queries at different R1 scores and precision for all R1 sentences of the ten queries.

R1 >= 1 R1 >= 2 R1 >= 3 All R1 sentences
TP, FN, FP TP=166, FP=263-166=97 TP=85, FP=105-85=20 TP=54, FP=60-54=6 TP=828, FP=926-828=98
Precision 166/263 = 63.1% 85/105 = 81.0% 54/60 = 90.0% 828/926 = 89.4%

An R1 cut-off score of >= 3 achieved a precision of 90% while still finding new metabolite-gene associations (~ 4 new associations per metabolite). This suggests that PolySearch can be tuned to achieve high precision and moderate recall which means that it could be used for automated text mining. As noted on Table 8, PolySearch achieved 89.4% precision using R1 sentences. Compared to the PubMed baseline, to manually retrieve the 828 relevant R1 sentences from the 740 abstracts (7287 sentences) equates to an 11.4% precision.


Evaluation #7
For the seventh assessment, PolySearch was compared with the speed/coverage performance of a researcher tasked with finding all metabolites known to be human cerebrospinal fluid and obtaining their concentration values. The individual (a senior undergraduate, now in medical school) was given 6 weeks and several primary references (PubMed, the Merck Manual, Google Scholar, Wikipedia) to assist with his search. The student was encouraged to access and read through abstracts, complete journal articles and clinical chemistry reference books to obtain the necessary data. Additionally the continuous suggestions were provided to improve his search methodology. After the student had completed his 6 week search project, the number of metabolite concentration values identified by the student using manual methods was just 47 (in 42 days). PolySearch then was run using the "Given Text Word Find Metabolites" query, with the text words being "CSF" and "cerebrospinal fluid" and the resulting list was given to the student to help his search. At the end of the student's search project (roughly 9 weeks later), a total of 308 concentration values were reported by the student with 70% of these being obtained through PolySearch. The student indicated that he considered PolySearch had helped him tremendously. While it may be argued that such an assessment lacks the scientific rigour found in other evaluations, we believe these results are perhaps the most realistic in terms of demonstrating the potential time-savings and the breadth of coverage that are possible with a robust text-mining system.


Evaluation #8
For the eigth evaluation, the dataset used here is the SPIES corpus for protein-protein interaction [22] which contains 963 sentences and 1436 interactions. Some examples of interactions from the dataset include (each interaction has tab-delimited four parts: the type of interaction, the first participant, the second participant and the sentence itself):

interaction     Skp1    Fwd2    A coimmunoprecipitation assay has revealed the in vivo interaction between Skp1 and Fwd2 through the F-box domain.
interact        Apg3p   Agp5p   A cross-linking experiment revealed that Apg3p interacts with the endogenous Apg12p/Apg5p conjugate.
interact        Apg3p   Apg12p  A cross-linking experiment revealed that Apg3p interacts with the endogenous Apg12p/Apg5p conjugate.
					

Since PolySearch queries are "Given X Find Associated Y", the first participant is used as the value for X. Additionally, in order to develop and test PolySearch's rule based pattern recognition system independently of the order of the participants, we expanded the corpus such that for each given pair of interactions and sentences, each participant can be used as the given X. For example:

interaction     Skp1    Fwd2    A coimmunoprecipitation assay has revealed the in vivo interaction between Skp1 and Fwd2 through the F-box domain.
					

becomes:

interaction     Skp1    Fwd2    A coimmunoprecipitation assay has revealed the in vivo interaction between Skp1 and Fwd2 through the F-box domain.
interaction     Fwd2    Skp1    A coimmunoprecipitation assay has revealed the in vivo interaction between Skp1 and Fwd2 through the F-box domain.
					

With the corpus expanded, the first 200 sentences of the corpus were used as a training set and the rest of the sentences were used as the test set. For association words, we first manually compiled a list of most likely protein-protein interaction words and then assembled a complete list using sentence analysis from unrelated protein-protein interaction texts. One method for the sentence analysis and extraction of association words was described in the previous chapter. The default association word list for all possible PolySearch queries can be found at http://wishart.biology.ualberta.ca/polysearch/help/association_word_list.htm. When given a protein X and a sentence, PolySeach attempts to find the protein-protein interaction(s) described in the sentence for the given protein X. Table 8 shows the performances of PolySearch's rule based pattern recognition system on the expanded corpus (i.e. the cut-off score is R1 >= 1).

Table 8: Precision, recall and f-measure on a corpus of protein-protein interaction using PolySearch's rule based pattern recognition system (when X is given) and using a Naive Bayes classifier available from Weka.

PolySearch R1 >= 1 (training set) PolySearch R1 >= 1 (test set) PolySearch R2, R3 >= 1 Baseline (test set) Naive Bayes (test set)
Precision (%) 70 71.1 49.1 49.4
Recall (%) 75.7 71.8 84.4 84.4
F-measure (%) 72.7 (±5.2) 71.5 (±6.5) 62.1 (±4.8) 62.3 (±4.6)

2008 © Polyomx research group