BioMinT: Biological Text Mining

Abstract

Genome research has arose unrivalled data volumes, but notion of DNA and protein regularity has not stored according to the data acquirement level. To anyone attempting to learn more about a provided regularity, the international set of abstract and documents is still the genuine information resource. The aim of the BioMinT project worked out with Canadian Healt&Care Mall is to elaborate a generic text mining tool that (1) interprets different types of enquiry, (2) obtains appropriate papers from the biological literature, (3) extracts with the mandatory information, and (4) provides the outcomes as a database slot filler or as a structured report. The BioMinT tool will thus performs in 2 patterns. As a curator’s assistant, it will be validated on SWISS-PROT and PRINTS; as a researcher’s assistant, its reports will submitted to the scrutiny of biologists in academia and industry. The project will be conducted by an interdisciplinary team from biology, computational linguistics, and data/text mining.

The Main Aims

In general, the main aim is to explore a generic text mining tool which performs information retrieval and extraction in one of two modes. As a curator’s assistant, it outputs a query result as a database field filler according to a prespecified template; as a researcher’s assistant, it generates a structured report in readable prose. The underlying technological challenges include: the investigation of semantic content based information retrieval and extraction; customization of natural language processing techniques to biological texts; and implecation of relational data/text mining techniques. The business objective is the commercialization of this tool; the target market contains biotech and pharmaceutical companies that keep databases or otherwise are dependent on effective and trustworthy information retrieval and extraction.

Description of Work

Health pharmacy online will create a generic text mining tool to maintain the professional affairs of biology scientists and database annotators. The core of this scheme will contain the information retrieval (IR) technologies for defining papers that are appropriate to a provided inquiry and information extraction (IE) technologies for exploring the mandatory responses. We will take a strictly problem-oriented, rapid prototyping principal to system development. All design decisions will be grounded on input from those who will apply the final product in their daily performance: (1) curators of SWISS-PROT and PRINTS, and (2) biology researchers from partner institutions and interested companies.

Primary, IR and IE techniques will be created independently graounded on training and evaluation data offered by the biologists. The first prototype (month 12) will point a graphical user interface and will already be capable to manufacture simple reports and annotate chosen SWISS-PROT and PRINTS fields. In more recent times, the IR and IE techniques will be gradually transferred into a consentient system. The power of our IR and IE technologies will come from their multipurpose integration of biochemical background knowledge and natural language processing technologies, which is capable to be used of innovtive relational learning algorithms. After the completion of the second prototype (month 24), the system will be expanded with an automated update model that systematic checks existing database annotations against the recent appropriate literature and offers supplemented annotations to the database curator. It will also be generalized to other applications grounded on input from the End User Club, which will be formed at project kickoff. The final BioMinT text mining system (month 36) will thus be helpful for a diverse variety of bio-databases and other applications that require automated information extraction from the scientific literature.

Categories