|
The use of DMax Assistant™ does not require any understanding of the data analysis technology used. This section is merely intended for those interested in what is going on under the hood.The analysis is performed in two steps:
In Step 1, building blocks of background knowledge are combined to form hypotheses.
- Step 1: generation of individual hypotheses that explain high/low observations
- Step 2: combination of the hypotheses generated in Step 1 to produce a model that predicts observations
The user can determine the list of building blocks as well as the relations that can be used to describe larger concepts in terms of those building blocks. While doing so, the user defines a search space of candidate hypotheses.
When looking for the next hypothesis, DMax Assistant™ will scan that search space starting with the simplest candidates, i.e., those consisting of a single building block. For each candidate, a score (essentially, a p-value) is computed based on how well it explains so far unexplained high/low observations. The best candidates of size one will become the seeds from which slightly more complex hypotheses are grown. To grow a hypothesis, it is combined with another building block using either a simple conjunction (cf. mainstream data mining tools) or a relationship from the domain specific background knowledge.
Again, all candidates -now compositions of size two- are evaluated (i.e., p-values assigned) and the best ones become the seeds for level three. This procedure is repeated until no more valid seeds for the next level are found, or until the user-defined maximal search depth is reached (see Section Create hypotheses for information on the advanced parameters).
Finally, all candidates found at the different levels are ranked and the best one is kept. This hypothesis is then validated on a separate test set, and added to the report.
In Step 2, the hypotheses found in Step 1 are associated with binary features. Such a feature has value one if and only if the corresponding hypothesis applies. A given data set can then be transformed to a bit matrix with examples in the rows and hypothesis-based binary features in the columns. Such a representation is compatible with a large number of approaches to predictive modeling, both for regression (i.e., predicting numbers) and classification (predicting categories). Whatever the method used at this stage, our two-step approach has the advantage that:
- the binary features are not taken from a fixed, predefined, or even secret set of descriptors, but generated
- specifically for your dataset;
- to distinguish between high/low values and/or categories that you observed;
- using your terminology, and the background knowledge components that you selected;
- for each predicted example, you can drill down to the level of the hypotheses that apply (i.e., the 1's in the bit vector) to
- understand the rationale behind the prediction;
- verify the reliability of each applicable hypothesis;
- further drill down to the supporting examples for each applicable hypothesis
|
© 2002-2007 PharmaDM, NV. All rights reserved.