BioMinT analysis & design
|
Version |
v1.1 |
|
Date |
16/07/03 |
|
Authors |
Andre Vandecandelaere, Kristof Van Belleghem |
1 EXECUTIVE OVERVIEW
The goal of the BioMinT project is to develop a text mining tool for information retrieval and information extraction from the biological literature. The resulting tool is intended for use by:
annotators of protein entries in the SwissProt database (Swiss Institute of Bioinformatics - SIB);
annotators of protein fingerprint entries in the PRINTS database (School of Biological Sciences, University of Manchester);
researchers at academia and industry during literature search.
The tool will be constructed out of a few major modules for: information retrieval, information extraction, natural language understanding, data mining, domain background knowledge support, data import, data export, (external) program support, and user interactions. These modules will be developed by the different BioMinT partners. Hence, the issue of the communication between the modules has been a matter of prime concern at the current stage of development.
This document proposes an overall architecture for the system, while not pinning it down, at this stage, to a particular deployment. For each of the intended uses, the interactions of the system with the users, on the one hand, and external repositories and applications, on the other hand, have been considered. A design for each of the mentioned modules is proposed (except of the retrieval and extraction modules that are designed by OEFAI and UNIGE). The emphasis has been on the interfaces of the modules rather than on their internal workings. A glossary of types has been compiled. An object-oriented approach to the analysis and design using the Unified Modeling Language (UML) has been adopted. However, openings have been left for development according to different programming paradigms. The design has been taken to the stage where first prototypes can be implemented without running into major difficulties caused by unclear communication between partners.
2 INTRODUCTION
2.1 Situation
The analysis and design of the text mining tool described here was performed to meet the deliverables of work package 2: ARCHITECTURAL DESIGN AND SPECIFICATION (cf. technical annex of the proposal). In addition to the overall architecture of the system the following deliverables have been met, as well:
D2.1: design of the annotator interaction module.
D2.2: design of the data import module.
D2.3: design of the data export module.
D2.4: design of the domain knowledge base module.
D2.5: design of the data mining module.
D2.6: design of the natural language understanding module.
D2.7: design of the program interface module.
The following requirement documents have been considered during the process:
- Technical annex of the BioMinT proposal QLRT-2001-02770
- An outline of Swiss-Prot annotation (SIB)
- Analysis of PRINTS annotation and BioMinT requirements (UMAN)
- BioMinT end user group requirements questionnaire (PharmaDM)
- BioMinT use cases (PharmaDM, Appendix C)
- Additional information (e-mail communication between biomint partners)
The artefacts presented here document the overall architecture of the system (package diagram), its static structure (class diagram) and its dynamic behaviour (interaction diagrams) at design level. The functional design of the system has been left open at this stage. For now, the main focus has been on the internal interfaces within the system that will have to be supported by the modules that are developed by the different partners. It is left to the responsible partners to research and design the optimal solutions (algorithms) for the different tasks that the system has to perform.
The glossary explains the different concepts, classes and design elements that feature in the diagrams.
2.2 Approach
Earlier in the project, it was decided to develop the system in an iterative, incremental manner with an emphasis on the development of concrete modules that will be refined and extended later in the project. Therefore, rather than doing a full-fledged formal analysis a pragmatic approach to the analysis and the design of the system was adopted based on the methodology described by Larman [1998]. The artefacts have been produced using Poseidon for UML, community edition 1.5 (Gentleware) and comply with the UML conventions as much as this case tool allows.
NOTE: Poseidon is limited in its expressiveness with regard to the UML. We have tried to adhere to the UML standards as much as possible, but, the reader should be aware that the diagrams may sometimes deviate from what would normally be expected.
In short, the analysis and design of the system went as follows: Based on the information in the requirement documents, a conceptual model of the problem domain was constructed. The widely used three-layer system architecture was adopted, thus distinguishing between a presentation layer (GUIs), a domain layer (different modules where the actual work happens) and a data layer (modules for data import and export, file systems, system databases). Next, the system events, i.e. the messages from the outside world that the system will have to handle, were identified from the use case scenarios as described in the use case report. Taking the architecture and the conceptual model into account, the handling of each system event by the system has been elaborated. This has resulted in an extended class diagram and several sequence diagrams that document the interactions between the different elements in handling each system event.
Several design decisions have already been incorporated in the models, including the use of well-known design patterns [GoF, 1995]. Hence, the shown diagrams are to be considered design class diagrams and design interaction diagrams.
3 DISCUSSION
First, the overall organisation of the system is discussed. Next, the design of the different modules that correspond to the above deliverables is detailed. The emphasis has been on the interfaces between the modules rather than on the intricate details of each module. Functional aspects of design have largely been ignored at this stage, but, will be incorporated as the research into the required algorithms progresses.
3.1 Overall architectural design and specification
3.1.1 Conceptual model
A conceptual model of the problem domain has been constructed by identifying concepts that appear in the requirement documents and applying additional domain knowledge. The model shown here (Figure 1) is restricted to the most prominent concepts and the relations between them. This model is not exhaustive, but, it contains the concepts that are most likely to be accounted for in the system. It can be used in a first attempt at identifying the types that will be part of the eventual system.

Figure 1: A conceptual model of the problem domain.
3.1.2 System architecture
The overall organisation of the system is given by the below package diagram.

Figure 2: Overall organisation of the sytem.
Three layers are distinguished:
1. The presentation layer (SwissProt annotation editor, PRINTS annotation editor, research assistant, model builder) consists of the elements that allow for the communication between human actors and the system i.e. the Graphical User Interfaces (GUIs).
2. The domain layer contains the modules that implement the core functionality of the system such as retrieval, extraction, Natural Language Understanding (nlu) and the applications that can be launched from within the system (progam interface). These modules are grouped together in an information expert package. A“domain knowledge”support module is part of the domain layer, as well. The "program interface" supports interactions of the system with external applications. The modules within “information expert” are dependent on one another (not indicated), thus, e.g. the extraction module may call on the data mining module, etc..
3. The data layer consists of the data import and data export modules grouped together in the data package. File sytems or databases that are part of the system (not shown) belong to the system as well. Databases and other repositories such as SwissProt, PubMed, PRINTS ... are not part of the system and are considered as actors. This layer supports communication with these external repositories.
At this stage no commitment to a particular deployment of the system has been made. The above organisation allows for a stand-alone, a client-server or a web-enabled implementation.
3.1.3 System events
System events are the interactions of the actors with the system. The system will have to be able to handle these events to produce a desired response or side effect. Given an accurate description of what is part of the system and what is not (i.e. what belongs to the system and what is an actor), these interactions can be identified. System events can be identified in several manners. The approach adopted here was to start from the real use case scenarios.
The system events that occur are illustrated by means of system sequence diagrams, one for each use case except for the“Retrieve external information”which is incorporated in the shown diagrams where appropriate.
CAUTION: Owing to limitations of the case tool, object symbols have been used for the actors instead of the proper UML actor symbol. Objects with the same name in the eventual design have a different meaning and act as role controllers in the system (Figure 12).

Figure 3: The "start up" system events

Figure 4: The "annotate protein" system events

Figure 5: The "annotate protein fingerprint" system events

Figure 6: The "annotate protein family" system events

Figure 7: The "annotate protein super family" system events

Figure 8: The "annotate protein domain" system events

Figure 9: The "gather information from the literature" system events

Figure 10: The "incorporate new model" system events

Figure 11: The "administrate user" system events
Once the main system events have been identified, ways of how the system should handle them have been considered taking the overall architecture and the conceptual model into account. Design decisions were incorporated from an early stage onwards. This exercise resulted in a first design class diagram and several interaction diagrams (see subsections 3.1.4 and 3.1.5).
3.1.4 Static design
The design class diagram is shown below (Figure 12). Several new elements have been introduced to connect the main modules in the system. The introduction of many of these elements has been motivated by concerns such as low coupling between elements, high cohesion of the elements, extendibility of the system and flexibility with regard to both deployment of the system and future changes within the modules.

Figure 12: Design class diagram: overview
Most system events originate from the GUI elements in the presentation layer. These events are channeled towards dedicated types (AnnotatorGUI, ResearcherGUI, ModelBuilderGUI, AdministratorGUI) that establish the connection with the appropriate role controllers in the domain layer (Annotator, Researcher, ModelBuilder, Administrator). Similarly, the connections between the domain layer and the data layer are materialised by dedicated objects (EntryManager, RetrievalManager, KnowledgeManager, DataImportManager, DataExportManager). Hence, the coupling between the different layers is kept loose, allowing for easier future changes of elements in the different layers. Furthermore, the published properties and methods of these role controllers make for an application programming interface (API) of the BioMinT tool for other applications.
The specific role of the controller objects is to convert system events into proper Requests that are understood by the internal elements of the system and to delegate these requests to the appropriate elements. (Notice that a Request is an internal representation of a user request. Conversely, a Query is a representation of a Request that complies to the specifications of an external repository to which the request needs to be addressed.) A distinction is made between straightforward loading entries from a repository (any type of structured data in a repository, whether simple or complex, is considered an Entry) and gathering information by means of retrieval requests, extraction requests, data mining requests or other type of application requests. The former type of request is delegated to an EntryManager; the latter type of request is delegated to an InformationExpert. The InformationExpert is the facade of the information expert package in which the retrieval, extraction, data mining, nlu and program interface modules are grouped together. In other words, the role of the InformationExpert is to know which Approach is to be used for a particular information request. The right Approach for dealing with the Task that is implied by a Request can be looked up in an ApproachCatalogue. The Approach is resolved using a BlackBoard mechanism. This gives rise to one or more specific Operations that can invoke retrievals, extractions, data mining operations etc.. If particular Specifications need to be provided, they are formulated by the Operation. The functionality offered by each of the internal modules should be applied for via proper interfaces for these modules (Retriever, Extractor, DataMiner, Application) to ensure easier modification of these modules in the future. In all cases background knowledge from the domain knowledge module can be used via its interface, DomainKnowledgeExpert.
Three types of data import and export are distinguished: (1) import from and export of data to a repository; (2) import and export of an ontology; and (3) import or export of the data generated by an internal or external application. If information or data needs to be collected from or sent to an external repository, an internal Request needs to be transformed into a Query or Instruction that is meaningful to that information source. The knowledge of what are meaningful queries, instructions and formats for a given repository is concentrated in a corresponding InformationSource objects (e.g. a PubMed object for PubMed) that will do the appropriate transformation. An InformationSourceCatalogue can be consulted to find the InformationSource of interest. The actual import or export of Records in the repository is done by Importer and Exporter objects. Records are the data structures that are received from or need to be sent to the repository. They may need to be transformed to the internally used Entry structure by a Converter. A similar pattern is used for loading or exporting ontologies and data from an external application using specialised objects of type OntologySourceCatalogue, OntologySource, DataSetSourceCatalogue and DataSetSource.
3.1.5 Dynamic design
For each system event shown in Figures 3 - 11, ways of handling the event were explored taking the overall architecture and the conceptual model into account. This resulted in the series of interactions diagrams shown below (Figures 13 - 26). Often, similar interactions occur for the handling of different system events, thus, a distinction can be made between frequently used “redundant” paths, and the interaction pathways that are specific for a particular system event.
Redundant interactions paths have been designed for:
- resolving information requests (retrievals, extractions, applications, ...);
- retrieving entries (importing data);
- saving entries (exporting data);
- validating information;
- validating entries.
These paths are illustrated in figures 13 - 17. Figures 18 - 26 show how the system events are handled using these redundant paths.
NOTE: In cases when the type of the object has the same name as an actor, this object is an internal representation of the actor that contains the required intelligence about that actor (i.e. it is a role controller).

Figure 13: "Resolve information request" sequence diagram.
A Request is the internal representation of a user request as forwarded by the GUI. It is generated by the role controller and sent to the EntryManager (in case of a straightforward retrieval request, see Figure 14) or to the InformationExpert in case of a more complex information request. The InformationExpert figures out the task(s) that correspond(s) to that requests and looks up the appropriate Approach for that (these) task(s). Using an agenda mechanism involving a BlackBoard and a Collection of Operations, Approach objects develops the request into one or more Operations that need to be performed. Depending on whether the Operation is a Retrieval, an Extraction, a DataMining operation or an ApplicationRun, the appropriate module is activated via resp. the Retriever, the Extractor, the DataMiner or the Application interface.
The interactions in case of gathering information, incorporating a new model or user administration are similar with Annotator replaced by resp. Researcher, ModelBuilder and Administrator.

Figure 14: "Retrieve entries" sequence diagram.
A Request to load one or more specific entries from a specified repository is delegated by the Annotator to the EntryManager. The EntryManager forwards load Entry requests to the DataImportManager. Using a source key provided by the Request, the DataImportManager looks up the corresponding InformationSource object from the InformationSourceCatalogue. The InformationSource transforms the Request into a Query that complies to the format expected by the extenal repository and sends that Query to the Importer. The Importer is the object that communicates with the external repository and that collects the Records that are returned by the repository. It returns the received collection to the InformationSource that uses a Converter to turn the Records into Entry objects for internal use.

Figure 15: "Save entries" sequence diagram.
Like for the import of entries, saving an Entry involves a transformation of the Entry into a Record that is recognised by the repository by means of a Converter. Whenever an Entry needs to be saved, the "save" message has to be sent to the DataExportManager. If there are several formats in which the Entry can be stored (e.g. as a prePRINTS entry that includes the validation information or as a cleaned PRINTS entry), the desired format needs to be specified by means of a format key. The DataExportManager looks up the required InformationSource object in the InformationSourceCatalogue and delegates the saving task to it. After the Entry has been converted, an Instruction is formulated and sent to the Exporter for submission to the repository.

Figure 16: "Validate information" sequence diagram.
Whenever a field in the entry is validated by the actor, the validated information is forwarded via the GUI by the relevant role controller to the appropriate Entry. This information may already have been known by the Entry, or it may have been provided explicitly by the actor. What validation means depends on the type of Entry: it can mean setting an internal flag for the field, or replacing the existing information by the provided validated information etc..

Figure 17: "Validate entry" sequence diagram.
When the actor has validated the Entry that is being annotated, the Annotator instructs the EntryManager to validate the Entry using the provided format key. A Validator will extract all the validated information from the Entry to produce a "cleaned" Entry that is subsequently submitted to the DataExportManager for storage in the appropriate repository (cf. Figure 15) . Note: Not only the cleaned entry is stored, but, the original "full" entry is stored separately, as well (not shown).
The above redundant interaction paths appear in the handling of nearly all system events. The actual handling of the system events is detailed in Figures 18-26.

Figure 18: "Start up" sequence diagram.

Figure 19: "Annotate protein" sequence diagram.

Figure 20: "Annotate protein fingerprint" sequence diagram.

Figure 21: "Annotate protein family" sequence diagram.

Figure 22: "Annotate protein superfamily" sequence diagram.

Figure 23: "Annotate protein domain" sequence diagram.

Figure 24: "Gather information from the literature" sequence diagam.

Figure 25: "Incorporate new model" sequence diagram.

Figure 26: "Administrate user" sequence diagram.
3.1.6 Functional design
Most of the required algorithms to make the above schemes work are straightforward with exception of the use of Approach objects by the InformationExpert. The latter aspect is part of the heart of the system since it determines the interplay between the major modules (retrieval, extraction, program interface, data mining). The details of the use of Approach objects is described next.
3.1.6.1 Motivation
The Approach concept is a key concept in making the BioMinT tool easy to extend. The underlying idea is that many tasks can be performed, some of them in various ways, using the available modules (in particular "information retrieval", "information extraction", and "background knowledge") as building blocks. For instance, finding synonyms for a given protein name can be done by launching a background knowledge query to a protein ontology. Alternatively, synonyms may be found by retrieving articles concerning the protein from a doucment repository and then extracting synonym-indicating templates. Yet another way may be to retrieve an entry for the protein from a particular web site (e.g. HUGO) and look in the right field of the entry for synonyms.
An Approach represents one way to perform a specific task, as a sequence of calls to the available building blocks. Whenever a certain task is required, the task is looked up in an ApproachCatalogue, which should return descriptions of one or more approaches suited to fulfill the task.
3.1.6.2 Approach Representation
In order to make the tool easily extendable, it is important that this ApproachCatalogue can be
easily manipulated and that new Approaches can be added without a need for much coding. An obvious way to achieve this is to describe approaches in a short, declarative, human-writable format.
Such format could look somewhat like this :
<subtask name 1>;<input 1.1>,...<input 1.n_1>;<output 1.1>,...<output 1.m_1>
...
<subtask name p>;<input p.1>,...<input p.n_p>;<output p.1>,...<output p.m_p>
where the number of input and output parameters of each subtask description should, of course, correspond to the numbers of the subtask. Each input parameter must, moreover, refer to a parameter of the global task (special parameter names can be used to reference these) or to an output parameter of an earlier subtask (using the same name). Hence, one way of retrieving synonyms could be described as:
retrieve; "HUGO", ARG1; X
select; X, "SYNONYMS"; ANSWER
where ARG1 refers to the protein name input to the task, X is an intermediate result used to pass on information (in this case a HUGO entry) from the first subtask to the next, and ANSWER is the final result.
3.1.6.3 Approach Execution
Executing descriptions of the above form can be done using the blackboard methodology. The BlackBoard is a repository where intermediate results can be stored. Initially, the input parameters of the task are put on the BlackBoard. Next, each subtask takes the input parameters it needs from the BlackBoard, processes them, and puts its own results on the BlackBoard. Eventually, the final results of the task should appear on the BlackBoard, and an answer can be returned.
Note: During early development, the blackboard mechanism may be replaced by loose coupling at the file/command level.
This concludes the overall design of the system. In the next sections, the main modules in the system are elaborated in further detail.
3.2 The annotator interaction module.
The interactions between the annotators, and for that matter, all other human actors happens via graphical user interfaces (GUIs). Figures 27 - 38 show a first proposal for GUIs based on the interactions in the system sequence diagrams and earlier proposals by various partners (indicated in the text where appropriate).
The shown images are indicative of how the main GUIs might look like and are not detailed prescriptions. Alternative GUIs and extensions are still possible. The main points to beware of, in this respect, are: (1) to make sure that the essential interaction needs as revealed by the system sequence diagrams are supported, and (2) to keep the coupling between the presentation elements and the domain elements in the system sufficiently loose so that existing GUIs can easily be replaced by other ones, or, new GUIs can readily be integrated in the system.
The shown GUIs where built using Sun's One Studio Java integrated development environment (IDE).
3.2.1 Starting BioMinT
After launching the BioMinT application, the user will be presented with the screen shown in Figure 27 where she/he needs to enter her/his login id and password.

Figure 27: Log-in window.
If the login was successful, the blank BioMinT GUI appears (Figure 28). It consists of a frame with a menu bar. A task can be selected from the "Tasks" menu in the menu bar. Only the tasks for which the user has the appropriate rights are highlighted.

Figure 28: Empty BioMinT window.
3.2.2 Retrieving and extracting information from the literature
Since BioMinT is about text mining of free-style biological text documents, information retrieval and information extraction from the literature are the core functionalities offered by the system. These functionalities will almost always be called upon, whether during the annotation of proteins or protein fingerprints by annotators, or, for a literature search by researchers and bioinformaticians (after selecting the "search literature" task).
The following phases are distinguished during this text mining process:
formulation of a retrieval query for the retrieval of documents related to a subject of interest;
display of retrieved documents and (optional) selection of documents for further extraction;
formulation of an extraction query for the extraction of specific information from text or text fragments;
display of extraction results.
The GUIs for these operations are discussed in the next paragraphs.
3.2.2.1 Specifying the retrieval query.
The GUI elements for specifying a retrieval query could look like Figures 29 - 30. Figure 29 shows the general panel for specifying a general retrieval query and launching the search; Figure 30 shows the SwissProt assistant for protein-related retrieval queries.
![]()
Figure 29: Composing the retrieval query (taken over from AlphaDMaxTM -PharmaDM).

Figure 30: SwissProt assistance GUI for a protein-related retrieval query (taken over from the prototype by Pavel Dobrokhotov - SIB).
Retrieval queries can be specified in a number of manners. First, the document repository to search in is specified in the "search" field (Figure 29). Next, the subject of interest can be typed in the "for" text field. Clicking the "Go" button activates the retrieval. Clicking "Clear" empties the field for a new subject; "Stop" interupts the search. If the subject is a protein, the subject can be specified with assistance of SwissProt (Figure 30). The accession number of the protein can be specified in the "SwissProt AC" field. Clicking "From ExPASy" activates a search for the official name and synonyms of the protein including the names of the genes. These names are listed in a table and can be selected for inclusion in the query. Alternatively, a gene name can be entered in the "Gene name" field and HUGO can be searched for synonyms. The query can be further refined by specifying modifiers in the "Search modifiers" field and setting the period in the "Years from ... to" fields. The number of retrieved documents can be limited by setting the "Maximum number of hits".
3.2.2.2 Displaying retrieval results and selection of documents for extraction.
The documents that are returned by the document repository are shown in the "retrieval" panel (Figure 31). The display is similar to the display used by AlphaDMaxTM (PharmaDM). Basically, the documents are ranked according to the order in which they where returned and the weights attributed to the modifiers and filters that have been applied. Only the author list and document title are shown, but, the entire document can be shown by clicking the "+" button in front of it (clicking the "-" button reduces the amount of shown detail). A checkbox in front of the article can be checked if one wants to use the document for subsequent extraction.

Figure 31: GUI for showing retrieval results.
3.2.2.3 Specifying the extraction query.
In order to extract the information that resides within a text document or text fragment, one or more specific extraction queries need to be formulated. Figure 32 shows one way of how this might be done using a query template for binary relations (unary and other types of query have their own adjusted templates).

Figure 32: GUI for composing extraction queries (after AlphaDMaxTM - PharmaDM).
This template enables the user to specify a relation of interest that needs to be extracted from the text (e.g. "gene influences disease", "januskinase present in location", etc.). The template allows for specification of the expected syntactic relationship between the terms involved in the relationship. The type of each term can be specified in the combobox underneath the term. If types are determined by selection of a suitable ontology, the level of query expansion up and down the associated taxonomy can be set using the "+" and "-" levels in the corresponding comboboxes.
The extraction query formulation GUI is incorporated in the "queries" panel (see Figure 33).
3.2.2.4 Displaying the extraction results.
The results of the extraction are displayed in the "extraction" panel. This panel is similar to the equivalent panel in AlphaDMaxTM (PharmaDM). Listed are the sentences that match the specified query. (In this particular case, the matches have been colour-coded depending on the type of match i.e. by NLU or co-occurrence). Each sentence is scored depending on the quality of the match. The sentences from the same article are kept together. Articles are ranked according to their relevance with regard to the query a measured from the summed score of their matching sentences. If the user wants to incorporate a sentence in the annotation report, she/he ticks the checkbox in front of the sentence.

Figure 33: GUI for displaying extraction results.
3.2.3 Annotating proteins and protein fingerprints
If the "annotate protein" or "annotate protein fingerprint" task has been selected, two empty tabbed panels appear labelled "entry" and "annotation". The two panels are two different views on an Entry of interest ("Observer" pattern [GoF]). One view focuses on the entry in terms of the physical structure of the underlying record; the other view focuses on the biological contents on the entry (cf. figure 9 in "Analysis of PRINTS annotation and BioMinT requirements" - UMAN). In the case of the "annotate protein" task, the observed entry is a SwissProtEntry; in the case of "annotate protein fingerprint" the view is on a PRINTSEntry. The information shown in both views is identical, but, arranged differently.
A new entry can be created with the "new" option in the file menu; an existing entry can be loaded using the "load" function in the "file" menu. (Alternatively, these functions can be selected from the toolbar.) The "load" function opens a view on the relevant repository of existing entries (not shown) from which an entry can be selected with the mouse or explicitly typed in. Figure 34 shows the entry view on the PRINTSEntry for januskinase. Figure 35 shows an empty annotation view.

Figure 34: Annotate protein fingerprint GUI: entry view.
Each line in the entry view reflects the information stored in the corresponding line of the underlying entry (record). In addition, each line contains three buttons: (1) clicking the "Annotate" button triggers the system to annotate the field; (2) clicking the "Evidence" button opens a panel with the supporting evidence and references for the annotated contents; and (3) clicking the "Validate" button validates the information in this field. The buttons at the bottom of the panel are for complete automatic annotation of the entire entry ("Annotate All"), validation of the entry ("Validate Entry") or storage of the unvalidated information in an entry ("Save Entry"). The user can also delete selected information ("Delete") or cancel an operation ("Cancel").

Figure 35: Annotate protein fingerprint GUI: annotation view (after figure 9 in "Analysis of PRINTS annotation and BioMinT requirements" - UMAN).
The annotation view highlights the biology of the annotated information. Three main sections are distinguished: (1) an "Information Categories" panel, (2) an "Annotation" panel and (3) a control panel that contains buttons for the manipulation of annotated information. The "Information Category" panel shows the biologically relevant concepts that are dealt with by the entry. Selecting a category brings up the annotated information for that category, together with supporting statements and references. Each piece of information can be selected by checking a checkbox (not shown). The selected information can be annotated once more, or validated or deleted using resp. the "Annotate", "Validate" and "Delete" buttons in the "Selection" part of the control panel. The entire entry can be annotated, validated, saved or deleted using the appropriate button in the "Entry" part of the control panel.
3.2.4 User administration
When the "administrate user" task has been selected, the user administration panels appear (Figures 36-37). In the add/modify panel, the user administrator can enter the name of a new user, a login id and a password and tick the rights this user has. Clicking the "Add" button registers the new user in the system. Alternatively, the login id , the password or the rights of an existing user can be modified. By entering either the user name or the login id and clicking the "Find" button the system will display the current settings for this user. The administrator can alter these settings and register the change by clicking the "Apply Change" button.

Figure 36: User administration: add new user(s) or modify user details.

Figure 37: User administration: remove existing user(s).
When the "remove" tabbed panel is selected, the admistrator can either enter a user name or user id and click the "Find" button to get the details of that user, or, click the "FindAll" button to produce a list of all existing users. The administrator can select the user(s) from the list. Clicking a "Remove" button removes the selected users from the system. In all cases, the administrator can undo selections or specifications by clicking the "Cancel" button.
3.2.4 Model incorporation
When the "incorporate new model" task has been selected, the following panel appears (Figure 38).

Figure 38: GUI for incorporating a model.
The first section in the "model" panel deals with the selection of a component that needs to be added or replaced. The "component" combobox lists the available components, however, a new component can be specified, as well. If required, a learning method can be specified in the "method" combobox. Clicking the "Select" button selects the component. The second section is about training and validating the selected component. If a new model needs to be trained, the file containing the training data can be specified in the "training data" field. It can also be selected from the explorer that appears after clicking the "Browse" buttton (not shown). Similarly, for the validation of the newly trained model, the location of the test data set is specified in the "test data" field. To start the training, click "Train"; to validate the resulting model, click "Validate". The final section is about the acceptance of the selected or newly trained model. The model builder can either accept it ("Accept") or reject it ("Reject"). (An “Unvalidated” option may be provided as well for models that have been built but not yet validated).
3.3 The data import module.
The data import module is responsible for providing the domain elements of the system with the data or information they need from internal or external repositories. The DataImportManager is the interface of this module. Thus, to avoid intricate coupling of the modules and classes within the system all retrieval requests need to be sent to this interface.
3.3.1 Static design
The classes that make up the data import module are shown in dark green in Figure 39. For clarity, the relations with relevant classes from the other packages are indicated as well (colour-coded).

Figure 39: Static structure of the data import module.
Three types of import are distinguished:
Data/information from an (external) repository (InformationSource).
Ontologies such as Medical Subject Headings (MeSH), Gene Ontology (GO), International Classification of Disease (ICD) etc..
Data generated by (external) applications such as BLAST, ILP, etc..
The general principle is that for each type of import there is a proper Catalogue (InformationSourceCatalogue, OntologySourceCatalogue, DataSetSourceCatalogue) available from which the appropriate Source (InformationSource, OntologySource, DataSetSource). The Source represents the (external) source (i.e a repository, an ontology or an application) and contains the knowledge of how that source needs to be called. (Here the term "ontology" is used in a broad sense ranging from a simple lexicon to full-bodied ontologies including domain vocabularies, graph structure, constraints and all) All of these elements are also involved in data export, hence, they have been grouped together in the “data” package. The data import module is then restricted to the classes that deal with the technicalities of the import as represented by Importer. There are suitable Importer classes for import of Records, Ontologies and data generated by an (external) application (ApplicationData). The implementation of the Importer interface is realised according to the Bridge pattern (GoF).
3.3.2 Dynamic design
The interactions for importing data from an InformationSource are shown in Figure 14. The interactions for importing ontology information or data that originate from applications are similar using corresponding Catalogue, Source and Importer objects.
3.4 The data export module.
3.4.1 Static design
The class structure of the data export module is similar to the one of the data import module. Instead of dedicated Importer classes the data export module contains dedicated Exporter classes for the export of data or information from an Entry (after transformation to a Record by the Converter), an Ontology or an Application (ApplicationData). Again, the Bridge pattern is used to implement the Exporter interface, and the relations with important types from the“data” and other packages are shown.

Figure 40: Static structure of the data export module.
3.4.2 Dynamic design
The interactions for exporting data to an InformationSource are already shown in Figure 15. The interactions for exporting ontology information or data that originate from applications are similar using corresponding Catalogue, Source and Exporter objects.
3.5 The domain knowledge base module.
The domain knowledge module contains elements that are useful to any element of the system that requires domain-specific knowledge. It offers special support for the use of ontologies, which are currently popular for representing domain knowledge in automated systems.
3.5.1 Static design
The types that make up the domain knowledge base module are shown in Figure 41.

Figure 41: Static structure of the domain knowledge base module.
The support offered by this module can be requested via the DomainKnowledgeExpert, the interface for this module. The interface is implemented using the “Bridge” pattern [GoF]. The central elements of the module are DomainKnowledgeComponent, DomainType and DomainRelation (cf. Appendix A: Glossary). In principle, all useful domain knowledge can be represented from these elements by means of “Composite”[GoF]. Thus, an Ontology is defined as a specialised DomainKnowledgeComponent that is itself composed out of the DomainKnowledgeComponents DomainType (e.g. protein, gene, biological process, disease, ...) and DomainRelation (e.g. influence, part-of, ...). DomainRelations are defined as subsets of the product set of a given number of DomainTypes. In general, a DomainType contains many domain-related concepts (DomainConcepts, e.g. trypsin, mitosis, ...). Given that a single concept is often represented in the literature by several terms (e.g. acetylcholine receptor, cholinoceptive site, cholinoceptor, ACh receptor etc. for cholinergic receptor), a DomainConcept is seen here as composed out of zero or more DomainTerms (zero e.g. in case an artificial node is incorporated in an ontology that does not correspond to a meaning in the problem domain, but, helps to keep the graph well organised). DomainTerms can be subject to particular DomainTermConstraints (e.g. “influenced”can never feature as active form of “influence” ). DomainConcepts can be associated with one another by a DomainRelationship (e.g. colchicine inhibits mitosis), or in other words a DomainRelationship can be seen as an element of a DomainRelation (in relational data model terms: a DomainRelationship is like the tuple in a relation). For easy retrieval of DomainConcepts, DomainTerms and DomainRelationships, DomainTypes and DomainRelations can be equiped with respectively a TermIndex of a RelationIndex.
3.5.2 Dynamic design
An example of how the elements in the domain knowledge module interact is shown in Figure 42. This figure illustrates how the request “get the related terms for the given terms under a specified relation”to the domain knowledge base module is processed.

Figure 42: Dynamic structure of the domain knowledge base module.
3.6 The data mining module.
The data mining module provides functionality for finding the patterns in retrieved or extracted information. This module hosts elements that are required for data mining (incl. relational data mining). Other data mining tools can be accessed via the program interface. Also available are elements for model building and validation of data mining results.
3.6.1 Static design
The classes in the datamining module are shown in Figure 43.

Figure 43: Static structure of the data mining module.
The DataMiningExpert is the facade of the data mining module. All data mining requests need to be sent to this element including the DataMiningSpecifications for the task. In essence three types of DataMiningTask can be asked from the module: (1) a ClassificationTask for classifying the Instances of a new DataSet using a specified Model available from the ModelCatalogue; (2) a ModelBuildingTask for the construction of new Models using some specified LearningMethod, a TrainingSet and a BackgroundKnowledgeTheory; and (3) a ValidationTask by which an existing or newly-trained Model can be validated (a SimpleValidationTask as well as a CrossValidationTask can be performed) using a TestSet of data. The DataSet is normally retrieved by the data module (DataImportManager), together with the required BackgroundTheory. If the Instances of the DataSet have been classified, the DataSet comes with a Classification and the ClassificationDefinition to which it conforms. The class in which an Instance falls according to the Model or an annotator is given by the InstanceClassification. The Classification of the DataSet is the collection of the InstanceClassifications of its Instances. The data in the DataSet should conform to a well-defined format as given by the DataFormat. This format exists of AttributeSignatures and/or PredicateSignatures depending on the case. The Instances are described by an InstanceDescription composed of Attributes and/or PredicateInstances that also comply to the specified DataFormat. Unlike Attributes, PredicateInstances can be associated with one another by sharing variables. Finally, if required input and output data can be filtered by a DataFilter.
3.6.2 Dynamic design
The main interactions within the data mining module are indicated in Figure 44. The interactions are self-explanatory. Notice that the DataMiningExpert is normally addressed by the DataMining Operation. The result that is returned depends on the particular task and can be a validation message, a newly trained Model or a classified data set. The details of the interactions that are invoked during the ValidationTask and the ModelBuildingTask depend on the particular task and the used method.

Figure 44: Dynamic structure of the data mining module.
3.7 The natural language understanding module.
The natural language understanding module makes the NLU functionality as provided by the CNTS (Centrum voor Nederlandse TaalStudie - UIA) available to the other modules in the system. In other words, the NLU functionality developed by the CNTS such as part-of-speech tagging, chunking and parsing is considered an external application, and our NLU module implements an interface to interact with it.
3.7.1 Static design
The class diagram of the NLU module is shown below (Figure 45). The module is essentially organised around the main types of NLU tasks as provided by the NLU application of the CNTS based on the Tilburg Memory-Based Learner (TiMBL). In addition, ContextSettings such as the task parameters (NLUTaskParameter), the lexicon (NLULexicon) and the used model (NLUModel) can be set or modified by means of a ContextUpdateTask.

Figure 45: Static structure of the natural language understanding module.
An NLUTask can be atomic such as the recognition of entities in the text (EntityRecognition), part-of-speech tagging (POSTaggingTask), chunking (ChunkingTask) or shallow syntax parsing (ShallowParsingTask), or, it can be a task composed of atomic tasks (ComposedTask) using the “Composite”pattern [GoF]. The input text and the text that results from these tasks are represented as the corresponding TextItem. Once again, the module has a facade i.e. the NLUManager.
3.7.2 Dynamic design
Figure 46 shows how requests to the NLU module are resolved. The diagram is largely self-explanatory with exception of the following two points. First, in case of an NLUTask the NLUSpecifications object that is formulated by the object that calls the NLUManager contains a TextItem with the text that needs to be processed. Second, the modifications that are specified for a ContextUpdateTask are objects of type NLUTaskParameter, NLULexicon or NLUModel that can be interrogated by the ContextSettings object upon update. In practice, these objects could be represented in full, or they could be merely filenames passed to the NLU application.

Figure 46: Dynamic structure of the natural language understanding module.
3.8 The program interface module.
The program interface module contains the functionality for launching (external) applications that may generate additional useful information.
3.8.1 Static design
Figure 47 shows the class diagram of the module. Again, the module is fitted with a facade, ApplicationExpert, via which the functionality offered by this package can be requested. The particular application that needs to be launched is specified by the ApplicationRunSpecifications. For each application that can be launched from the BioMinT tool there is an ApplicationProxy stored in the ApplicationCatalogue. The ApplicationProxy consists of three elements: (1) an ApplicationInputGenerator that retrieves the input data that may be required for the application; (2) an ApplicationBroker that does the actual application call and deals with the technicalities that might be involved (input data and parameters are provided by an ApplicationInputSpecs object generated by the ApplicationInputGenerator); and (3) an ApplicationOutputInterpreter that handles the output generated by the application (ApplicationOutput) and produces an ApplicationRunAnswer.

Figure 47: Static structure of the program interface module.
3.8.2 Dynamic design
The process of calling an (external) application is illustrated in figure 48.

Figure 48: Dynamic structure of the program interface module.
4 CONCLUSION
By the nature of BioMinT as an interdisciplinary project, the text mining tool that will come out of it consists of a few major components that are developed by different scientific partners embedded in a larger framework. In such systems it is of paramount importance that the components be able to seamlessly interact with one another. The main emphasis in the above design has therefore been on the place of each component within the system and the interactions between components. The design of the overall system and its main components has now been taken to the stage where the implementation of a first integrated prototype is possible without major difficulties caused by unclear communication between the involved partners.
References.
Larman, C. (1998) Applying UML and Patterns. An introduction to object-orietend analysis and design. Prentice Hall PTR.
Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (GoF) (1995) Design Patterns. Elements of Reusable Object-Oriented Software. Addison-Wesley.
Appendix A: Glossary
Annotator: A person that uses the system to annotate entries (e.g. entries for proteins and other domain concepts). Role controler for the “Annotate Protein” and the “Annotate Protein Fingerprint” use cases.
AnnotatorGUI: A GUI element that communicates with the Annotator controller.
Annotation: The process of annotating entries in a structured information source (e.g. SwissProt or PRINTS).
Application: A program that generates specific information or output data from input data. Examples: Blast, Fast A ... An application is internal when it is incorporated as part of the system, or external when it is not part of the system. Application is the interface of the program interface module.
Application Broker: takes care of the actual call to a specific external application , once input has been prepared.
Application Catalogue: A catalogue of available Applications. Given an application name, the catalogue returns an Application Proxy object that can be used to communicate with the external application.
Application Data: Input data for external applications
Application Data Exporter : An exporter for application data
Application Expert : controller of the program interface module
Application Input Generator: converts input parameters from the BioMinT tool into the right format for an external application; may produce input files if required by the external application.
Application Output Interpreter: converts results from an external application run (possibly taken from generated output files) into the format expected by the BioMinT tool.
Application Proxy : an object taking care of all tasks in calling a specific external application and interpreting its results; consists of an Application Input Generator, an Application Broker and an Application Output Interpreter.
Application Run Request : an object encapsulating the name and input parameters for a call to an external application; processed by the application expert
Approach: Approach objects contain the intelligence of how to obtain particular information. They know which information sources to query or which applications to launch as well as the sequence in which information needs to be gathered. Step-wise gathered information is turned into an answer to the request.
Approach Catalogue: A catalogue of available 'Approach'es. Can be consulted by any object that needs to meet an information request.
Attribute: One characteristic of an instance in the data mining module.
Attribute Signature: The specification of one attribute in the data format description of a data set.
Background Knowledge Theory: In some learning methods, a theory to be used together with a set of training examples to build a new model.
Blast: Basic Local Alignment Search Tool. A specialised application for the search of local sequence alignments. Usually external to the system.
Classification Definition: Specification of the valid classes for a given data set.
Composed Task: an NLU task which is composed of a sequence of more primitve NLU tasks.
Context Settings: In the NLU module, a set of settings determining the global context (used model, lexicon, parameters) in which NLU tasks are executed
Context Update Task: In the NLU module, a task modifying the context settings (used model, lexicon, parameters) in which NLU tasks are executed
Convertor: Converts entries into records and vice-versa accommodating to a specified format.
Cross-Validation Task: Validation task consisting of cross-validation of a number of different splits in training and test data.
data export module: A module that contains the functionality for data export to internal or external information sources. Deals with the technicalities of export operations.
Data Export Manager: The interface of the data export module. Functions as a controller for data export instructions.
Data Filter: A class for modifying data sets used in learning tasks
Data Format Description: A definition of the attributes (with their range of values) and the predicates valid in the description of instances of a particular data set
data import module: A module that contains the functionality for data import from internal or external information sources. Deals with the technicalities of import operations.
Data Import Manager: The interface of the data import module. Functions as a controller for data import calls.
Data Miner: The interface of the data mining module.
Data Mining: A specialised operation. The activity of finding patterns in data.
data mining module: A module that contains the functionality for data mining.
data module: A module for data import and data export.
Data Set: A set of instances used for either training or testing a model for a particular task
Data Set Importer: An importer for data sets. Various formats may be supported by subclasses.
Data Set Source: A source from which a data set can be obtained.
Data Set Source Catalogue: A catalogue of available data set sources. Upon request, the catalogue returns an appropriate reference to a data set source.
Document: a specialised entry whose primary information is in the form of free-style text.
Domain Concept: A representation of one domain concept (e.g. GPCR).
Domain Knowledge: The knowledge about a specific domain, in particular, the life sciences. The interface of the domain knowledge module.
Domain Knowledge Catalogue: A catalogue of available ontologies (domain knowledge components). Upon request, the catalogue returns an appropriate reference to an ontology.
Domain Knowledge Component (also : Ontology) : either a primitive Domain Type (e.g. Proteins) or a Domain Relation (e.g. part-of).
Domain Knowledge Expert: Controller of the Background Knowledge module.
domain knowledge module: A utility module from which particular domain knowledge may be obtained.
Domain Relation: A conceptual relation (e.g. influences, part-of) between a certain number of domain types (e.g. between proteins and diseases).
Domain Relationship: A specific instance of a domain relation between an appropriate number of specific domain concepts of the appropriate types (e.g. tubulin and cancer).
Domain Term: A term describing a specific domain concept under certain conditions (represented by term constraints)
Domain Term Constraint: A condition for a domain term to be a valid description of a domain concept.
Domain Type: A set of domain concepts of the same type (e.g. proteins)
Entry: An internal representation of data in data storage (database, file, ...). An entry is a structured record that contains information on a domain subject. An entry has named fields that each contain corresponding information in an obvious (e.g. number) or hidden (e.g free text) manner. Unlike a record, an entry contains facilities to keep track of changes in content and validations.
Entry Exporter: An exporter for entries.
Entry Manager: A controller for the retrieval and export of entries from the data layer.
Exporter: Realises export (output) instructions and handles the technicalities of export operations.
Extraction: A specialised operation. A process whereby relevant information is extracted from the relevant field of an entry in response to a query.
extraction module: The module that contains the functionality for the extraction of information from data (incl. text).
Extractor: The interface of the extraction module. Handles extraction requests.
FastA: A specialised application for the search of local sequence alignments. Usually external to the system.
Field: a line of information in an entry. In SwissProt and PRINTS entries, a field consists of a two-letter label (name) and contents (value).
Format: The format in which an entry is stored. Format objects contain the knowledge about the format in which an entry is exported or imported.
Format Converter: converts entries into records satisfying a specific format
ILP Exporter: An ontology exporter which exports an ontology as a background theory file for learning tasks.
Importer: Realises import instructions and handles the technicalities of import operations.
Information: The answer to a request for information obtained by retrieval or extraction.
Information Expert: An expert that knows how to obtain requested information. Also, the interface of the information module. The expert consults an ApproachCatalogue in order to know how a particular request is to be handled.
Information Expert Implementer: Implements the Information Expert interface (Bridge).
Information Generator: Returns useful information from input data. The information is produced by a specialised application (e.g. Blast, FastA search, ...) or a dedicated information extractor (e.g. text extractor) from the input data.
information module: A module that groups all the modules in the system that deal with obtaining information i.e. the retrieval and extraction modules, the natural language understanding module and the program interface module.
Information Source: A (external) source from where information can be obtained. Information source objects represent their model and know: (1) how to translate an information request from the system into a query for the (external) source (e.g. turning a request for abstracts on a given protein into a PubMed query); (2) what are valid instructions (e.g. storing a record in PRINTS or SwissProt); and, (3) what is (are) the allowed format(s) (e.g. the format of PRINTS or SwissProt record).
Information Source Catalogue: A catalogue of available information sources. Upon request, the catalogue returns an appropriate reference to an information source.
Information Storage: A repository of information such as a PubMed, SwissProt, PRINTS, a specialised website, a database etc..
InfoSource Query: Turns an information request to a specific information source into the query language of that source.
Instance: One specific element of a data set used in the data mining module
Instance Classification: The class assignment of a particular instance, either intrinsic (as defined in a training data set) or as defined by a specific model.
Instance Description: The set of properties defining the characteristics (attribute values and/or relational theory) of an instance.
Instruction: Instructions to a source of information (e.g. store a record) that may result in a persistent change in the source. Valid instructions for each type of information source are known by InformationSource objects. Instructions are processed by means of appropriate Importer and Exporter objects.
Knowledge Manager: A controler for the retrieval and export of domain knowledge from the data layer.
Labelling: the process of marking up free-style text or text fragments with labels (e.g. syntactic labels, domain labels, ...).
Learning Method: a specific strategy used for building models from given training data.
learning module: A utility module that assists in the learning and validation of models from training data.
Lexicon Exporter: An ontology exporter which exports an ontology as a lexicon file for NLU tasks.
Model: A model, generated from training data, which can be used to classify new data for a particular task (information retrieval, information extraction, NLU tasks).
Model Builder: A person that trains the system by means of examples, building models that can be used to perform specific substasks in information retrieval/extraction. Role controller for the “Incorporate New Model”use case.
ModelBuilderGUI: A GUI element that communicates with the ModelBuilder controller.
Model Building Task: The task of building a trained model for a particular task, based on a training set of examples and possibly a background knowledge theory.
model builder: A GUI module for building and/or incorporating a trained model into the system.
NLU module: the module responsible for all Natural Language Understanding tasks
NLU-related task: an NLU task or a supporting task for NLU (context update task)
NLU task: a task requiring Natural Language Understanding (tagging, shallow parsing, concept recognition, etc... each of these tasks corresponds to a subclass of NLU Task); each task takes a text item as input and returns a text item as result
Ontology: (see Domain Knowledge Component). An ontology is a structured representation of domain knowledge that can be used by intelligent computer systems. It is used here in a very general meaning covering lexicons, thesauri, "classical" ontologies (consisting of a domain vocabulary, a graph structure, constraints ...), etc..
Ontology Exporter: An exporter for ontologies. Subclasses support various formats for different purposes (see e.g. lexicon exporter or ILP exporter).
Ontology Importer: An importer for ontologies. Subclasses support various ontology formats,
e.g. MeSH, EC, ICD, GO.
Ontology Source: A source for loading a specific ontology. The ontology source knows where to find its ontology data and how to import them (which importer to use).
Ontology Source Catalogue: A catalogue of ontology sources. Upon request, the catalogue returns an appropriate reference to an ontology source.
Operation: An operation that can be performed or ordered by the system (e.g. retrieval, extraction, selection, ...). The approach for solving a task may involve one or more operations.
PRECIS: A specialised application for the generation of protein fingerprint annotation reports (naked fingerprints, pre-PRINTS entries). Currently external to the system.
Predicate Signature: the signature of one predicate which is part of the data format of a specific data set.
PRINTS: A database of protein fingerprint information maintained by the School of Biological Sciences at the University of Manchester (UK). PRINTS objects are specialised InformationSource objects that represent PRINTS.
PRINTS Annotator: A specialised annotator that uses the system to annotate PRINTS entries.
PRINTS annotation editor: A GUI module for the annotation of SwissProt entries.
PRINTS Entry: A specialised entry that contains protein fingerprint information. The annotated information of the entry is stored in a record in the PRINTS database.
PRINTS Exporter: A specialised exporter for export to PRINTS and pre-PRINTS (and temporary storage).
PRINTS Importer: A specialised importer for import from PRINTS and pre-PRINTS (and temporary storage).
program interface module: A module that contains the functionality for interfacing with internal and external applications.
Protein Fingerprint: The set of properties that characterise a family or super-family of proteins, or a protein domain. Protein fingerprint information is stored in the PRINTS database. See also PRINTS Entry.
PubMed: A database of abstracts from the biomedical literature maintained by the US National Library of Medicine. PubMed objects are specialised InformationSource objects that represent PubMed.
PubMed Abstract: A specialised Entry that represents a PubMed abstract.
Query: A query for information from a client (e.g. annotator, researcher, ...) to a supplier (e.g. an information source, ...). A query is expected to return useful information. More specifically, Query objects are the translation of an user's request as handled in the BioMinT tool into a query that conforms to the format that is expected by the queried information source.
Reception: Handles log-in events. Depending on the user rights of the logged-in user, Reception instantiated the appropriate role controler objects.
Record: A representation of data/information as physically stored. A records reflects the situation in the database or file system and is returned or stored by the InformationSource. A record has format and contents. Format information is held by InformationSource; content by Entry.
Relational Data: the set of predicate instances defining the relational part of a data instance.
Relation Index: in a domain relation, an index mapping a set of input concepts to the corresponding output concept
Request: (also: User Request): A request for information from a user (annotator, researcher) to the system. A Request typically contains the specifications, and possibly a choice of information source(s) that is needed to turn the request into a particular retrieval query (document retrieval, synonym retrieval, ...), extraction query or application call by the system.
research assistant editor: A GUI module for gathering information from the literature for research use.
Researcher: A life-science researcher that uses the system for obtaining information about one or more domain concepts (e.g. a protein). Role controler for the “Gather Information From Literature” use case.
ResearcherGUI: A GUI element that communicates with the Researcher controler.
Retrieval: A specialised operation. The process of obtaining the relevant information from a repository for a given query.
Retriever: The interface of the retrieval module. Handles information retrieval requests.
retrieval module: A module that contains the functionality for information retrieval.
Retrieval Manager: A contoler object between the retrieval module in the domain layer and the data layer. Communicates queries from the retrieval module to the DataImportManager (and DataExportManager). Note: This is a redundant class. Everything in data storage is an entry in the system, hence, the RetrievalManager plays the same role as EntryManager w.r.t. stored dat/information.
Selection: A specialised operation. The process of obtaining the relevant field from an entry in response to a query.
Simple Validation Task: A validation task where a model's quality is tested on one specific test set.
Structured Knowledge: domain knowledge that is organised according to a conceptual scheme or well-defined data structure.
SwissProt: A database of protein information maintained by the Swiss Institute of Bioinformatics. SwissProt objects are specialised InformationSource objects that represent SwissProt.
SwissProt Annotator: A specialised annotator that uses the system to annotate SwissProt entries.
swissprot annotation editor: A GUI module for the annotation of SwissProt entries.
SwissProt Entry: An entry in SwissProt that contains annotated information on a protein.
SwissProt Exporter: A specialised exporter for export to SwissProt and temporary storage.
SwissProt Importer: A specialised importer for import from SwissProt and temporary storage .
Task: A task that can be chosen by the actor.
Term Index: an index mapping terms of a certain domain type to their corresponding concepts
Text Item: a piece of text, possibly augmented with labels (POS tags, entity markers, ...), used as input or output of NLU tasks
NLU Lexicon: a lexicon used by the external NLU application (part of the NLU Model ?)
NLU Model: an object representing the model used by the external NLU application
NLU Task Parameter: one of the parameters used by the external NLU application
User: A registered user of the BioMinT tool. User objects know the use rights of each user.
User Catalogue: A catalogue of registered users and their rights (annotator, researcher, model bilder). Upon request, the catalogue returns an appropriate reference to a User object.
Validation Task: The task of investigating the quality of a trained model by comparing the proposed classifications of the model for certain data instances against these instances's real classification.
Validator: Separates the validated from the unvalidated information i.e. extracts the validated information from an Entry into a new “clean”Entry.
Appendix B: List of Figures
Figure 1: A conceptual model of the problem domain.
Figure 2: Overall organisation of the sytem.
Figure 3: The "start up" system events
Figure 4: The "annotate protein" system events
Figure 5: The "annotate protein fingerprint" system events
Figure 6: The "annotate protein family" system events
Figure 7: The "annotate protein super family" system events
Figure 8: The "annotate protein domain" system events
Figure 9: The "gather information from the literature" system events
Figure 10: The "incorporate new model" system events
Figure 11: The "administrate user" system events
Figure 12: Design class diagram: overview
Figure 13: "Resolve information request" sequence diagam.
Figure 14: "Retrieve entries" sequence diagam.
Figure 15: "Save entries" sequence diagram.
Figure 16: "Validate information" sequence diagam.
Figure 17: "Validate entry" sequence diagam.
Figure 18: "Start up" sequence diagam.
Figure 19: "Annotate protein" sequence diagam.
Figure 20: "Annotate protein fingerprint" sequence diagam.
Figure 21: "Annotate protein family" sequence diagam.
Figure 22: "Annotate protein superfamily" sequence diagam.
Figure 23: "Annotate protein domain" sequence diagam.
Figure 24: "Gather information from the literature" sequence diagam.
Figure 25: "Incorporate new model" sequence diagam.
Figure 26: "Administrate user" sequence diagam.
Figure 27: Log-in window.
Figure 28: Empty BioMinT window.
Figure 29: Composing the retrieval query.
Figure 30: SwissProt assistance GUI for a protein-related retrieval query.
Figure 31: GUI for showing retrieval results.
Figure 32: GUI for composing extraction queries.
Figure 33: GUI for displaying extraction results.
Figure 34: Annotate protein fingerprint GUI: entry view.
Figure 35: Annotate protein fingerprint GUI: annotation view.
Figure 36: User administration: add new user(s) or modify user details.
Figure 37: User administration: remove existing user(s).
Figure 38: GUI for incorporating a model.
Figure 39: Static structure of the data import module.
Figure 40: Static structure of the data export module.
Figure 41: Static structure of the domain knowledge base module.
Figure 42: Dynamic structure of the domain knowledge base module.
Figure 43: Static structure of the data mining module.
Figure 44: Dynamic structure of the data mining module.
Figure 45: Static structure of the natural language understanding module.
Figure 46: Dynamic structure of the natural language understanding module.
Figure 47: Static structure of the program interface module.
Figure 48: Dynamic structure of the program interface module.