This page illustrates GritBot's ability to find possible anomalies in data. Like all good [ro]bots, GritBot does its job without direction -- the user does not have to specify the nature of anomalies or how to find them. This is illustrated with data from several application areas. Times are for a Pentium D 925 PC.
A possible anomaly exists if a case's value for one attribute is surprising when compared with corresponding values from a subset of cases. Such an anomaly is reported in the following pattern:
case identification: [significance]anomalous value (N cases, reason)
condition 1
condition 2
. . .
condition K
Here
Attribute Assay 1 Assay 2 Assay 3 .....
age 32 63 19
sex F M M
on thyroxine t f f
query on thyroxine f f f
on antithyroid medication f f f
sick f f f
pregnant t N/A N/A
thyroid surgery f f f
I131 treatment f f f
query hypothyroid f f t
query hyperthyroid t f f
lithium f f f
tumor f f f
goitre f f f
hypopituitary f f f
psych f f f
TSH 0.025 108 9
T3 3.7 .4 2.2
TT4 139 14 117
T4U 1.34 .98 -
FTI 104 14 -
referral source other SVI other
diagnosis negative primary compensated
hypothyr hypothyr
This application's data and test files contain 3772 cases in total. GritBot takes 0.6 seconds on a Pentium D 925 PC to identify four possible anomalies:
data case 1365: (label 861) [0.002]age = 455 (3771 cases, mean 52, 99.97% <= 94) test case 373: (label 769) [0.007]
T3 = 7.6 (1476 cases, mean 2.03, 99.93% <= 4.2)
TT4 > 45 and <= 155 [120]
T4U > 0.9 and <= 1.2 [1.04] data case 2224: (label 1562) [0.014]
age = 75 (53 cases, mean 32, 96% <= 42)
pregnant = t data case 1610: (label 3023) [0.016]
age = 73 (53 cases, mean 32, 96% <= 42)
pregnant = t
The first possible anomaly concerns case number 1365 in this application's data file. There are no third or subsequent lines, so all 3771 cases with known values of the attribute age are relevant. This case has a patient age of 455 (!), whereas 99.97% of the cases -- all cases except this one -- have age values not exceeding 94.
The last two possible anomalies concern the 53 patients noted as being pregnant. Two of them are aged in their seventies whereas the average age of pregnant women is 32 and all the others are no older than 42. Note that the value of either "age" or "pregnant" could be faulty for each case, and there is no way to decide which is the culprit.
The second possible anomaly focuses on the thyroid assay T3. Expert endocrinological knowledge would be needed to judge whether or not this T3 value is truly anomalous: the others clearly are!
data case 7110: [0.000]sex = Female (19716 cases, 99.99% `Male')
relationship = Husband data case 576: [0.002]
sex = Male (2331 cases, 99.87% `Female')
relationship = Wife test case 5954: [0.002]
class = >50K (4449 cases, 99.96% `<=50K')
age <= 21 [20]
education-num <= 12 [8]
marital-status = Never-married
capital-gain <= 7000 [0] data case 15377: [0.004]
class = <=50K (1357 cases, 99.71% `>50K')
age > 36 and <= 59 [55]
capital-gain > 7000 [34095]
The first two of these are obviously data errors and illustrate GritBot's ability to find suspicious nominal (discrete) values as well as odd-looking numerical values.
The training and test files contain a total of 5000 cases, each described by 21 attributes. GritBot analyzes them in 2.4 seconds and finds two possible anomalies:
test case 1570: [0.001]voice mail plan = yes (3678 cases, 99.97% `no')
number vmail messages <= 0 [0] data case 15: [0.016]
class = 0 (75 cases, 99% `1')
total day minutes <= 135 [120.7]
number customer service calls > 3 [4]
The first highlights someone paying for a voice mail plan who has received no voice mail messages. The second describes a non-churning customer who is a light user but has numerous service calls.
test case 4190: (label AL2562/Al8 Dy Fe4/MN12 Th/tI26) [0.006]Magnetic = neg (352 cases, 99.4% `pos')
Fe > 3 [4]
Group = tI26
GritBot has found a subset of 352 cases, most of them magnetic, among which this non-magnetic case stands out. Since only 7% of the cases in the entire dataset are noted as being magnetic, this potential anomaly is indeed interesting.
GritBot requires 0.5 seconds to find 19 possible anomalies. Most of these are clear errors since the highlighted cases violate common-sense constraints, e.g. by the weight of a part being greater than the whole weight, or the maximum dimension (length) being less than other dimensions (e.g. diameter). Note that we did not tell GritBot about these constraints -- the anomalies were apparent in the data themselves. Here are a few examples:
data case 2628: [0.000]Shucked weight = 0.495 (37 cases, mean 0.0476, 95% <= 0.059)
Whole weight > 0.105 and <= 0.1197 [0.1055] data case 2052: [0.001]
Height = 1.13 (4177 cases, mean 0.139, 99.95% <= 0.25) data case 1211: [0.004]
Length = 0.185 (392 cases, mean 0.47, 99.7% >= 0.425)
Diameter > 0.347 and <= 0.377 [0.375] data case 1258: [0.007]
Height = 0 (154 cases, mean 0.11, 99.4% >= 0.08)
Whole weight > 0.343 and <= 0.4328 [0.428]
Shell weight > 0.1003 and <= 0.133 [0.115] data case 648: [0.017]
Whole weight = 0.777 (58 cases, mean 0.5108, 98% <= 0.578)
Shucked weight <= 0.2238 [0.216]
Viscera weight > 0.113 [0.13]
Shell weight > 0.1257 and <= 0.191 [0.17]
Rings <= 10 [9]
GritBot finds two possible anomalies in these cases (in 1.4 seconds):
case 550: [0.009]A30 = C (657 cases, 99.8% `G')
A34 = G
class = EI case 839: [0.010]
A28 = T (606 cases, 99.8% `A')
A27 = C
class = IE
There are 657 extron-intron junction cases that have G in position 34, all of them (except case 550) also having a G in position A30. Similarly, among the 606 cases that are intron-extron junctions and for which the residue in position A27 is C, all except case 839 have A in position A28.
In less than 0.1 seconds, GritBot identifies three possible anomalies in this well-studied data:
case 78: [0.009]leafspot-size = gt-1/8 (221 cases, 99.5% `N/A')
leafspots-marg = N/A case 614: [0.012]
stem = norm (191 cases, 99.5% `abnorm')
stem-cankers = above-sec-nde case 558: [0.015]
seed = abnorm (405 cases, 99.8% `norm')
fruiting-bodies = absent
mold-growth = absent
seed-discolor = absent
The first case has large leafspots but their margin is shown as not applicable; the second notes the presence of stem cankers, but the stem is stated to be normal.
All the examples above were run using GritBot's default option settings. GritBot provides mechanisms
After a dataset has been analyzed by GritBot, the regularities that it discovered can be saved and used to inspect new data. Furthermore, the types of potential anomalies identified in new data can be quite different from those found in the original data!
If you would like to learn more about how to use GritBot, please see the tutorial.
| © RULEQUEST RESEARCH 2007 | Last updated April 2007 |
| home | products | download | evaluations | prices | purchase | contact us |