The DMax Chemistry Assistant™ Tutorial
Previous Page Table of Contents Next Page Path:DMax Assistant™ product family tutorial > The DMax Assistant™ session > View hypotheses > Statistics

Session: The DMax Assistant™ product family

Statistics

The Statistics panel consists of two parts. It looks different for numerical and categorical targets, but the main structure of the panel is the same. We first explain the numerical case. More information on statistics for categorical targets is given below.

To the left, you find some numbers related to Statistical hypothesis testing. It is important to notice that these numbers were obtained from a validation set, i.e., a portion of the data that was set aside initially and thus not used to generate the hypothesis. The table contains 2 columns with numbers:

  1. the first column relates to the overall population, i.e., the total validation set.
  2. the second column relates to the subset of examples covered by the hypothesis
In the screen shot above, the overall validation set (typically 1/3 of the total data set) contains 14121 cases, and the average target value is -4.54. The subpopulation covered by the hypothesis contains 20 cases, i.e. 0.14% of the total set. The average target value within this subset is -6.3, i.e. 1.76 below the average in the total population. These numbers already indicate that the subset does have an average target value that deviates (-1.76) from what you expect from the total population.

The degree of unexpectness of a particular deviation can be quantified with the p-value. The bigger the subpopulation, and the more deviating the average target value, the lower the p-value. In the above screen shot, the probability that an average at least as deviating is observed in a subset of size 20 is 8.44E-9.

We use the following ranges to link p-values to a significance level:

Hence, the hypothesis in the screen shot gets significance level very high.

The histogram to the right shows the distributions of the target value

In both cases, the line below the X-axis indicates the average. For strong hypotheses you will see the colored (i.e. subset) marker indeed deviates from the grey one, and the colored histogram is biased to either low (red) or high (green) values.

By dragging the left-clicked mouse over the histogram, you can highlight adjacent colored bins.

A yellow line surrounds the highlighted bins. By pressing the Select button in the histogram header, you can select the examples from the highlighted bins in the bottom panel. To remove the highlighting, just right-click in the histogram (and press the Select button again to select all examples).

For categorical targets, the statistics panel looks like this:

The left panel is roughly equivalent to the numerical case, only here percentages of a particular category are shown instead of averages. For instance, in the example above, the total population (of size 76) contains 21.3% examples that belong to class 'weak'. The subgroup (of size 43) covered by the hypothesis however, contains only 9.3% examples from class 'weak', i.e., 12% less than the overall set. This difference, in combination with the size of the subgroup results in a P-value (based on a binomial test) of 0.03.

The right panel contains 2 overlappig pie charts. The outer pie chart shows the class distribution in the overall set: with about 20% of the examples belonging to class 'weak', shown in blue. The inner pie chart shows the class distribution in the subgroup covered by the hypothesis: clearly the blue 'weak' segment is smaller in the subgroup.

You can click on other segments in the pie charts to allign and compare alternative classes.


Previous Page Table of Contents Next Page

© 2002-2007 PharmaDM, NV. All rights reserved.