|
The Statistics panel consists of two parts. It looks different for numerical and categorical targets, but the main structure of the panel is the same. We first explain the numerical case. More information on statistics for categorical targets is given below.
To the left, you find some numbers related to Statistical hypothesis testing. It is important to notice that these numbers were obtained from a validation set, i.e., a portion of the data that was set aside initially and thus not used to generate the hypothesis. The table contains 2 columns with numbers:
In the screen shot above, the overall validation set (typically 1/3 of the total data set) contains 14121 cases, and the average target value is -4.54. The subpopulation covered by the hypothesis contains 20 cases, i.e. 0.14% of the total set. The average target value within this subset is -6.3, i.e. 1.76 below the average in the total population. These numbers already indicate that the subset does have an average target value that deviates (-1.76) from what you expect from the total population.
- the first column relates to the overall population, i.e., the total validation set.
- the second column relates to the subset of examples covered by the hypothesis
The degree of unexpectness of a particular deviation can be quantified with the p-value. The bigger the subpopulation, and the more deviating the average target value, the lower the p-value. In the above screen shot, the probability that an average at least as deviating is observed in a subset of size 20 is 8.44E-9.
We use the following ranges to link p-values to a significance level:
Hence, the hypothesis in the screen shot gets significance level very high.
- p-value > 0.5: significance level very low;
- 0.5 > p-value > 0.05: significance level low;
- 0.05 > p-value > 0.005: significance level medium;
- 0.005 > p-value > 0.0005: significance level high;
- 0.0005 > p-value: significance level very high;
The histogram to the right shows the distributions of the target value
In both cases, the line below the X-axis indicates the average. For strong hypotheses you will see the colored (i.e. subset) marker indeed deviates from the grey one, and the colored histogram is biased to either low (red) or high (green) values.
- in the total data set (so not only in the validation set, cf the numbers to the left): this distribution is drawn as a grey line;
- in subset of the total data set that is covered by the hypothesis: this distribution is shown in shades of red (below overall average) and green (above overall average).
By dragging the left-clicked mouse over the histogram, you can highlight adjacent colored bins.
A yellow line surrounds the highlighted bins. By pressing the Select button in the histogram header, you can select the examples from the highlighted bins in the bottom panel. To remove the highlighting, just right-click in the histogram (and press the Select button again to select all examples).
For categorical targets, the statistics panel looks like this:
The left panel is roughly equivalent to the numerical case, only here percentages of a particular category are shown instead of averages. For instance, in the example above, the total population (of size 76) contains 21.3% examples that belong to class 'weak'. The subgroup (of size 43) covered by the hypothesis however, contains only 9.3% examples from class 'weak', i.e., 12% less than the overall set. This difference, in combination with the size of the subgroup results in a P-value (based on a binomial test) of 0.03.
The right panel contains 2 overlappig pie charts. The outer pie chart shows the class distribution in the overall set: with about 20% of the examples belonging to class 'weak', shown in blue. The inner pie chart shows the class distribution in the subgroup covered by the hypothesis: clearly the blue 'weak' segment is smaller in the subgroup.
You can click on other segments in the pie charts to allign and compare alternative classes.
|
© 2002-2007 PharmaDM, NV. All rights reserved.