EU
FP5 Quality of Life Project
BioMinT:
Biological Text Mining
Contents
Other
Locations
Official project home page
Restricted
area for reviewers
PharmaDM's restricted
area for consortium members
Project
Partners
Presentations
on the Project
Luc Dehaspe : Great
Expectations : A To-Do list for the Biologist's in Silico Research
Assistant, Data and Text Mining for Bioinformatics, Cavtat, 22
September 2003
Short project description
Abstract
Genome research has spawned
unprecedented volumes of data, but characterisation
of DNA and protein sequences has not kept pace with the rate of data
acquisition.
To anyone trying to know more about a given sequence, the worldwide
collection
of abstract and papers remains the ultimate information source. The
goal
of the BioMinT project is to develop a generic text mining tool that
(1) interprets diverse types of query, (2) retrieves relevant documents
from the biological literature, (3) extracts the required information,
and (4) outputs the result as a database slot filler or as a structured
report. The BioMinT tool will thus operate in two modes. As a curator's
assistant, it will be validated on SWISS-PROT and PRINTS; as a
researcher's
assistant, its reports will submitted to the scrutiny of biologists
in academia and industry. The project will be conducted by an
interdisciplinary
team from biology, computational linguistics, and data/text mining.
Objectives
Overall, the objective is to develop
a generic text mining tool which performs
information retrieval and extraction in one of two modes. As a
curator's
assistant, it outputs a query result as a database field filler
according
to a prespecified template; as a researcher's assistant, it generates
a structured report in readable prose. The underlying technological
challenges include: the investigation of semantic content based
information
retrieval and extraction; customization of natural language processing
techniques to biological texts; and application of relational data/text
mining techniques. The business objective is the commercialization of
this
tool; the target market includes biotech and pharmaceutical companies
that
maintain databases or otherwise depend on efficient and reliable
information
retrieval and extraction.
Description of work
We will develop a generic text
mining tool to support the professional
activities of biology researchers and database annotators. The core of
this system will consist of information retrieval (IR) techniques for
identifying
documents that are relevant to a given query and information extraction
(IE) techniques for discovering the required answers. We will take a
strictly
problem-oriented, rapid prototyping approach to system development. All
design decisions will be based on input from those who will use the
final
product in their daily work: (1) curators of SWISS-PROT and PRINTS, and
(2) biology researchers from partner institutions and interested
companies.
Initially, IR and IE technologies
will be developed independently based
on training and evaluation data provided by the user partners
(biologists).
The first prototype (month 12) will feature a graphical user interface
and will already be able to produce simple reports and annotate
selected
SWISS-PROT and PRINTS fields. Subsequently, the IR and IE technologies
will be gradually integrated into a coherent system. The power of our
IR
and IE techniques will come from their seamless integration of
biochemical
background knowledge and natural language processing techniques, which
is enabled by the use of state-of-the-art relational learning
algorithms.
After the completion of the second prototype (month 24), the system
will
be extended with an automated update module that regularly checks
existing
database annotations against the recent relevant literature and
suggests
additional annotations to the database curator. It will also be
generalized
to other applications based on input from the End User Club, which will
be formed at project kickoff. The final BioMinT text mining system
(month
36) will thus be useful for a wide variety of bio-databases and other
applications
that require automated information extraction from the scientific
literature.