BioMinT: Biological Text Mining
EU FP5 Quality of Life Project
[End User Club]
[EUC newsletters page]
AbstractGenome research has spawned unprecedented volumes of data, but characterisation of DNA and protein sequences has not kept pace with the rate of data acquisition. To anyone trying to know more about a given sequence, the worldwide collection of abstract and papers remains the ultimate information source. The goal of the BioMinT project is to develop a generic text mining tool that (1) interprets diverse types of query, (2) retrieves relevant documents from the biological literature, (3) extracts the required information, and (4) outputs the result as a database slot filler or as a structured report. The BioMinT tool will thus operate in two modes. As a curator's assistant, it will be validated on SWISS-PROT and PRINTS; as a researcher's assistant, its reports will submitted to the scrutiny of biologists in academia and industry. The project will be conducted by an interdisciplinary team from biology, computational linguistics, and data/text mining.
ObjectivesOverall, the objective is to develop a generic text mining tool which performs information retrieval and extraction in one of two modes. As a curator's assistant, it outputs a query result as a database field filler according to a prespecified template; as a researcher's assistant, it generates a structured report in readable prose. The underlying technological challenges include: the investigation of semantic content based information retrieval and extraction; customization of natural language processing techniques to biological texts; and application of relational data/text mining techniques. The business objective is the commercialization of this tool; the target market includes biotech and pharmaceutical companies that maintain databases or otherwise depend on efficient and reliable information retrieval and extraction.
Description of work
We will develop a generic text mining tool to support the professional activities of biology researchers and database annotators. The core of this system will consist of information retrieval (IR) techniques for identifying documents that are relevant to a given query and information extraction (IE) techniques for discovering the required answers. We will take a strictly problem-oriented, rapid prototyping approach to system development. All design decisions will be based on input from those who will use the final product in their daily work: (1) curators of SWISS-PROT and PRINTS, and (2) biology researchers from partner institutions and interested companies.
Initially, IR and IE technologies will be developed independently based on training and evaluation data provided by the user partners (biologists). The first prototype (month 12) will feature a graphical user interface and will already be able to produce simple reports and annotate selected SWISS-PROT and PRINTS fields. Subsequently, the IR and IE technologies will be gradually integrated into a coherent system. The power of our IR and IE techniques will come from their seamless integration of biochemical background knowledge and natural language processing techniques, which is enabled by the use of state-of-the-art relational learning algorithms. After the completion of the second prototype (month 24), the system will be extended with an automated update module that regularly checks existing database annotations against the recent relevant literature and suggests additional annotations to the database curator. It will also be generalized to other applications based on input from the End User Club, which will be formed at project kickoff. The final BioMinT text mining system (month 36) will thus be useful for a wide variety of bio-databases and other applications that require automated information extraction from the scientific literature
|The project is funded by the European Commission as BioMinT, contract-no. QLRI-CT-2002-02770|
|under the RTD programme "Quality of Life and Management of Living Resources"|