This page should give you a feel for the kinds of results achievable with C5.0 and See5. For brevity, we'll now refer just to See5 (Release 2.05), the Windows implementation, with the understanding that C5.0 can generate the same results.
So what makes See5 different? One short answer is its attention to the issue of comprehensibility. At RuleQuest, we believe that a data mining system should find patterns that provide insight in addition to supporting accurate predictions. In line with this approach, See5 emphasizes rule-based classifiers because they are easier to understand -- each rule can be examined and validated separately, without having to consider it in the context of the classifier as a whole.
You'll also notice that See5 is fast -- each of these examples require only a few seconds on a 2.4GHz Core 2 Duo. (See5 can also generate decision trees, useful in situations where classifiers must be constructed even more quickly.)
Without more ado, let's jump straight in ...
Each case describes over 120 properties of a substance, such as the number of atoms of each element that it contains, the number of atoms belonging to each periodic table family, density, crystal structure group, and so on. From 1,750 cases that are labeled pos and a further 22,891 labeled neg to indicate whether or not they are magnetic, See5 takes 0.7 seconds to construct a theory consisting of 18 rules. Magnetic substances are described by six rules:
Rule 1: (367/6, lift 13.8)
Family 6 <= 0
Family 4' <= 0
Group = tI26
-> class pos [0.981]
Rule 2: (213/4, lift 13.8)
Family 3 > 0
Family 10' > 1
Group in {hP6, tP68}
-> class pos [0.977]
Rule 3: (981/27, lift 13.7)
Family 3 > 0
Family 6 <= 0
Family 4' <= 0
Group in {hP38, hR19, tI26, hH6, hH38, tP88, tP92}
-> class pos [0.972]
Rule 4: (1493/52, lift 13.6)
B <= 2
Family 3 > 0
Family 4 <= 4
Family 6 <= 0
Family 1' <= 0
Family 3' <= 0
Family 4' <= 0
Group in {hP6, hP38, hR19, tI26, tP68, hH6, hH38, tP88, tP92}
-> class pos [0.965]
Rule 5: (1555/59, lift 13.5)
Si <= 0
Ga <= 0
In <= 0
Tl <= 0
Family 5 <= 0
Family 6 <= 0
Family 4' <= 0
Group in {hP6, hP38, hR19, tI26, tP68, hH6, hH38, tP88, tP92}
-> class pos [0.961]
Rule 6: (46/2, lift 13.2)
B > 2
Family 3 > 10
Group in {hP6, tI26, tP68}
-> class pos [0.938]
Each rule is characterized by the statistics (N/E, lift L) where
The first rule for non-magnetic substances is also interesting:
Rule 7: (2254, lift 1.1)
Family 3 <= 0
Family 5 > 0
-> class neg [1.000]
This simple rule, paraphrased as "a substance containing
no elements of Family 3 but at least one element of Family 5
is not magnetic", covers over
two thousand training cases with no exceptions.
The same rules were used to classify a further 10,000 unseen cases (that is, substances that were not available to See5 when the theory was constructed). The result -- 99.7% accuracy on these test cases!
This application demonstrates that See5 can speedily process large, high-dimensional datasets to yield interpretable, accurate models.
The second application uses data extracted from the US Census Bureau Database and available from the UCI KDD Archive. Nearly 200,000 individuals are described by seven numeric and 33 nominal attributes, and the task is to predict whether the individual's income is above or below $50,000.
The cases are split into equal-sized training and test sets. From the 99,762 cases in the former, See5 requires 5.9 seconds to produce a classifier consisting of 83 rules, 32 for low-income individuals and 51 for those with high income. Here are some examples:
Rule 1: (19425/1, lift 1.1)
family members under 18 = Both parents present
-> class - 50000 [1.000]
Rule 6: (1603/8, lift 1.1)
education = 5th or 6th grade
capital losses <= 1876
-> class - 50000 [0.994]
Rule 33: (32, lift 15.7)
dividends from stocks > 33000
weeks worked in year > 47
-> class 50000+ [0.971]
Rule 42: (132/12, lift 14.6)
education = Prof school degree (MD DDS DVM LLB JD)
dividends from stocks > 991
weeks worked in year > 47
-> class 50000+ [0.903]
The 83-rule classifier constructed
by See5 correctly categorizes 95% of the 99,761 unseen test cases.
The third case study uses data from the data archive at MLC++ that concerns telecommunications churn. These data are artificial but are claimed to be "based on claims similar to real world." Each case is described by 16 numeric and three nominal attributes and the data are divided into a training set of 3,333 cases and a separate test set containing 1,667. See5 takes only one-tenth of a second to find a 19-rule classifier from the training data, 7 rules for no churn and 12 for churn. Here again are some examples:
Rule 1: (2221/60, lift 1.1)
international plan = no
total day minutes <= 223.2
number customer service calls <= 3
-> class 0 [0.973]
Rule 3: (1972/87, lift 1.1)
total day minutes <= 264.4
total intl minutes <= 13.1
total intl calls > 2
number customer service calls <= 3
-> class 0 [0.955]
Rule 8: (60, lift 6.8)
international plan = yes
total intl calls <= 2
-> class 1 [0.984]
Rule 11: (32, lift 6.7)
total day minutes <= 120.5
number customer service calls > 3
-> class 1 [0.971]
This classifier has an accuracy of 95% on the unseen test cases.
The next example uses innovative data from Nick Kushmerick. There are 3279 cases, each describing an image within an anchor tag in a HTML document. About 14% of these anchored images are banner advertisements, and the goal is to generate rules that predict whether an image is an ad. (Kushmerick's system AdEater uses this prediction to eliminate advertisement images and so speed up page downloading.)
This dataset is very high-dimensional -- there are 1558 attributes,
about half the number of cases! These features include three
numbers -- image height, width, and aspect ratio -- together with
boolean features representing the presence or absence of
phrases in the image caption, its alt tag, and the
anchor, image, and base URLs.
For example, the attribute ancurl*http+www has the value
1 if the URL referred to in the anchor contains http
followed by www (ignoring punctuation).
More than a quarter of the cases have unknown values
for one or more of the attributes.
In 1.3 seconds See5 extracts 15 rules from these data, 14 describing ads and one for non-ads:
Rule 1: (65, lift 7.0)
ancurl*http+www = 1
ancurl*click = 0
-> class ad [0.985]
Rule 2: (39, lift 7.0)
ancurl*adclick = 1
-> class ad [0.976]
Rule 3: (103/2, lift 6.9)
url*ads = 0
ancurl*click = 1
-> class ad [0.971]
Rule 4: (20, lift 6.8)
ancurl*n+a = 1
-> class ad [0.955]
Rule 5: (152/6, lift 6.8)
url*ads = 1
-> class ad [0.955]
Rule 6: (16, lift 6.7)
ancurl*plx = 1
-> class ad [0.944]
Rule 7: (15, lift 6.7)
url*doubleclick.net = 1
-> class ad [0.941]
Rule 8: (14, lift 6.7)
alt*visit+our = 1
-> class ad [0.938]
Rule 9: (13, lift 6.7)
origurl*home.netscape.com = 1
-> class ad [0.933]
Rule 10: (110/8, lift 6.6)
alt*click+here = 1
alt*to = 0
-> class ad [0.920]
Rule 11: (10, lift 6.5)
origurl*zdnet.com = 1
-> class ad [0.917]
Rule 12: (9, lift 6.5)
origurl*jun = 1
-> class ad [0.909]
Rule 13: (51/4, lift 6.5)
ancurl*ad = 1
-> class ad [0.906]
Rule 14: (8, lift 6.4)
width > 196
url*images+home = 1
-> class ad [0.900]
Rule 15: (3127/313, lift 1.0)
url*ads = 0
-> class nonad [0.900]
On a ten-fold cross-validation, See5's rulesets correctly classify
97% of unseen cases, once again showing that simple
theories can be useful too.
The data contain 4,601 cases described by 57 attributes, all numeric. The attributes measure the percentage of specific words or characters in the email, the average and maximum run lengths of upper case letters, and the total number of such letters in the email. See5 needs 0.2 seconds to discover 21 rules that describe the spam email and nine rules for non-spam, e.g.:
Rule 2: (92, lift 2.5)
word_freq_font > 0.12
word_freq_edu <= 0.09
char_freq_; <= 0.895
-> class spam [0.989]
Rule 7: (53, lift 2.5)
word_freq_remove <= 0
word_freq_business > 0.5
char_freq_! > 0.824
-> class spam [0.982]
Rule 22: (737, lift 1.6)
word_freq_000 <= 0.25
word_freq_money <= 0.03
word_freq_george > 0.01
char_freq_! <= 0.378
-> class nospam [0.999]
Rule 25: (208, lift 1.6)
word_freq_meeting > 0.71
-> class nospam [0.995]
The first rule above isolates 92 of the 1813 spam emails in the data on the strength of relatively frequent appearance of the word font but relatively infrequent appearance of edu or the character ";".
A ten-fold cross-validation on this dataset shows a quite useful accuracy of 94% on unseen cases.
The data for this last example come from an assay screening service related to thyroid function and concern one aspect (hypothyroidism) of thyroid diagnosis. The attributes are a mixture of measured and calculated values and information obtained from the referring physician. There are four classes: negative, primary hypothyroid, secondary hypothyroid, and compensated hypothyroid. Let's show a few examples:
Attribute Assay 1 Assay 2 Assay 3 .....
age 32 63 19
sex F M M
on thyroxine t f f
query on thyroxine f f f
on antithyroid medication f f f
sick f f f
pregnant t N/A N/A
thyroid surgery f f f
I131 treatment f f f
query hypothyroid f f t
query hyperthyroid t f f
lithium f f f
tumor f f f
goitre f f f
hypopituitary f f f
psych f f f
TSH 0.025 108 9
T3 3.7 .4 2.2
TT4 139 14 117
T4U 1.34 .98 -
FTI 104 14 -
referral source other SVI other
diagnosis negative primary compensated
hypothyr hypothyr
See5 processes 2,772 such cases in less than one-tenth of a second, giving seven rules for three of the classes. (The cases for the fourth class were too few in number to justify any rules.)
Rule 1: (31, lift 42.7)
thyroid surgery = f
TSH > 6
TT4 <= 37
-> class primary [0.970]
Rule 2: (63/6, lift 39.3)
TSH > 6
FTI <= 65
-> class primary [0.892]
Rule 3: (270/116, lift 10.3)
TSH > 6
-> class compensated [0.570]
Rule 4: (2225/2, lift 1.1)
TSH <= 6
-> class negative [0.999]
Rule 5: (296, lift 1.1)
on thyroxine = t
FTI > 65
-> class negative [0.997]
Rule 6: (240, lift 1.1)
TT4 > 153
-> class negative [0.996]
Rule 7: (29, lift 1.1)
thyroid surgery = t
FTI > 65
-> class negative [0.968]
Despite their simplicity, these rules are remarkably accurate -- their error rate on a further 1000 unseen test cases is only half of one percent!
The emphasis on rule-based classifiers is only one aspect of See5/C5.0 (albeit an important one). Other powerful facilities include:
If you would like to learn more about See5/C5.0 and how to use it effectively, please see the tutorial.
Some published applications that use See5/C5.0 might also be of interest.
| © RULEQUEST RESEARCH 2007 | Last updated November 2007 |
| home | products | download | evaluations | prices | purchase | contact us |