Exploratory pattern discovery normally carries with it a high risk of finding spurious patterns. These are patterns that appear to capture interesting associations, but do so only due to chance. The use of statistical tests with corrections for multiple comparisons can guard against this risk. However, the better such measures guard against the risk of finding spurious patterns, the greater the risk that they will discard sound patterns. Because of the large number of potential patterns that Magnum Opus must consider during pattern discovery, adequate control of the risk of finding spurious patterns would entail discarding most patterns that were considered.
Magnum Opus provides a unique facility for statistically sound exploratory data mining. It requires the use of two sets of data: the exploratory data and the holdout data. The exploratory data are used to find a set of rules or itemsets that satisfy the search settings selected by the user. The holdout data are then used to evaluate the risk of these patterns being spurious. Because only a limited set of patterns is evaluated on the holdout data it is possible to provide strong control over the risk of accepting spurious patterns without undue risk of discarding sound patterns.
The appropriate criteria for assessing whether or not a rule or itemset should be considered spurious will vary from application to application. Magnum Opus provides a suite of holdout evaluation tests from which the user can select. All selected tests are applied to each pattern and any pattern that fails any test is identified as having failed holdout evaluation.
The user also specifies a significance level to be used with the holdout tests. This is adjusted using a Bonferroni adjustment for the number of patterns being tested. This adjustment divides the significance level by the number of patterns to be tested. The resulting critical value is applied to all holdout statistical tests. This ensures that the probability of any spurious pattern being accepted is no higher than the user-specified significance level.
The following holdout evaluation tests are supported for rules.
|
Test |
Null Hypothesis |
Statistical technique |
|
Minimum Coverage |
Coverage ≤ Min Coverage |
Binomial sign test |
|
Minimum Support |
Support ≤ Min Support |
Binomial sign test |
|
Minimum Strength |
Strength ≤ Min Strength |
Binomial sign test |
|
Minimum Lift |
Lift ≤ Min Lift |
Binomial sign test |
|
Minimum Leverage |
Leverage ≤ Min Leverage |
Binomial sign test |
| Positive correlation | Support ≤ Coverage × RHS_Coverage | Fisher exact test |
|
Improvement over generalizations |
Strength ≤ the maximum Strength of any generalization of the current rule |
Fisher exact test |
|
Partial with respect to specializations |
There exists another rule GLHS -> RHS in the set of best rules, that has not been rejected by holdout evaluation, that is a specialization of the current rule, and such that the LHS and RHS of the current rule are conditionally independent given the negation of GLHS. |
Fisher exact test |
The following holdout evaluation tests are supported for itemsets.
|
Test |
Null Hypothesis |
Statistical technique |
|
Minimum Coverage |
Coverage ≤ Min Coverage |
Binomial sign test |
|
Minimum Leverage |
Leverage ≤ Min Leverage |
Binomial sign test |
|
Improvement over generalizations |
Coverage ≤ the maximum of coverage(A) × coverage(B) for any partition of the current itemset into two subsets A and B. |
Fisher exact test |
| Self-sufficient | Coverage ≤ the maximum of coverage(A) × coverage(B) for any partition of the current itemset into two subsets A and B within the set of cases not covered by the difference between the current itemset and any of its productive supersets. | Fisher exact test |
Magnum Opus prints rules and itemsets that fail one or more holdout test after those that pass the holdout evaluation process. A summary line shows the adjusted critical value. Each rule or itemset is then listed. Following each rule, a summary line presents the following information:
For itemsets, the summary line contains the holdout coverage only.
The holdout evaluation summary line is followed by a further line summarizing the result of each test that the rule or itemset failed.
Note that setting the significance level used with the Filter-out Insignificant mode to a very low value is likely to decrease the number of rules or itemsets that holdout evaluation rejects by instead rejecting them during exploratory search. However, doing so is also likely to result in many potentially valuable rules or itemset being rejected before they are even subjected to holdout evaluation.
If a single data set is to be sampled to form the exploratory and holdout sets, we recommend using a 50% sample so as to create equal size exploratory and holdout sets. Reducing the amount of exploratory data is likely to lead to interesting rules or itemsets being overlooked. Reducing the amount of holdout data increases the risk of rejecting non-spurious rules and itemsets.
If holdout evaluation is to be used, the holdout data must be specified during data import. See Data Import.
For more information on Holdout Evaluation see Discovering Significant Patterns.
| © G I WEBB & ASSOCIATES 1999-2007 | Last updated October 2007 |
| home | products | download | evaluations | prices | purchase | contact us |