This page should give you a feel for the kinds of results achievable with See5 (the Windows version) and C5.0 (the Linux version).
So what makes See5/C5.0 different? One short answer is its attention to the issue of comprehensibility. At RuleQuest, we believe that a data mining system should find patterns that provide insight in addition to supporting accurate predictions. In line with this approach, See5/C5.0 emphasizes rule-based classifiers because they are easier to understand -- each rule can be examined and validated separately, without having to consider it in the context of the classifier as a whole.
You'll also notice that See5/C5.0 is fast -- each of these examples requires at most a few seconds. (The times shown here are for C5.0 on an Intel-core Q9550 PC running Fedora Core 9.) See5/C5.0 can also generate decision trees, useful in situations where classifiers must be constructed even more quickly.
Without more ado, let's jump straight in ...
Each case describes over 120 properties of a substance, such as the number of atoms of each element that it contains, the number of atoms belonging to each periodic table family, density, crystal structure group, and so on. From 1,750 cases that are labeled pos and a further 22,891 labeled neg to indicate whether or not they are magnetic, See5/C5.0 takes 0.5 seconds to construct a theory consisting of 17 rules. Magnetic substances are described by three rules:
Rule 1: (680/1, lift 14.0)
Family 3 > 0
Family 3 <= 3
Family 6 <= 0
Group in {hP38, hR19, tI26}
-> class pos [0.997]
Rule 2: (1186/9, lift 14.0)
B <= 2
Al <= 0
Si <= 0
Family 3 > 0
Family 4 <= 4
Family 5 <= 0
Family 6 <= 0
Family 3' <= 0
Family 4' <= 0
Group in {hP6, hP38, hR19, tI26, tP68, hH6, hH38, tP88, tP92}
-> class pos [0.992]
Rule 3: (2249/499, lift 11.0)
Group in {hP6, hP38, hR19, tI26, tP68, hH6, hH38, tP88, tP92}
-> class pos [0.778]
Each rule is characterized by the statistics (N/E, lift L) where
The first rule for non-magnetic substances is also interesting:
Rule 4: (2254, lift 1.1)
Family 3 <= 0
Family 5 > 0
-> class neg [1.000]
This simple rule, paraphrased as "a substance containing
no elements of Family 3 but at least one element of Family 5
is not magnetic", covers over
two thousand training cases with no exceptions.
The same rules were used to classify a further 10,000 unseen cases (that is, substances that were not available to See5/C5.0 when the theory was constructed). The result better than 99% accuracy on these test cases!
This application demonstrates that See5/C5.0 can speedily process large, high-dimensional datasets to yield interpretable, accurate models.
The second application uses data extracted from the US Census Bureau Database and available from the UCI KDD Archive. Nearly 200,000 individuals are described by seven numeric and 33 nominal attributes, and the task is to predict whether the individual's income is above or below $50,000.
The cases are split into equal-sized training and test sets. From the 99,762 cases in the former, See5/C5.0 requires 4.3 seconds to produce a classifier consisting of 84 rules, 33 for low-income individuals and 51 for those with high income. Here are some examples:
Rule 1: (19425/1, lift 1.1)
family members under 18 = Both parents present
-> class - 50000 [1.000]
Rule 6: (1603/8, lift 1.1)
education = 5th or 6th grade
capital losses <= 1876
-> class - 50000 [0.994]
Rule 34: (32, lift 15.7)
dividends from stocks > 33000
weeks worked in year > 47
-> class 50000+ [0.971]
Rule 42: (132/12, lift 14.6)
education = Prof school degree (MD DDS DVM LLB JD)
dividends from stocks > 991
weeks worked in year > 47
-> class 50000+ [0.903]
The 83-rule classifier constructed
by See5/C5.0 correctly categorizes 95% of the 99,761 unseen test cases.
The third case study uses data from the data archive at MLC++ that concerns telecommunications churn. These data are artificial but are claimed to be "based on claims similar to real world." Each case is described by 16 numeric and three nominal attributes and the data are divided into a training set of 3,333 cases and a separate test set containing 1,667. See5/C5.0 takes only one-tenth of a second to find a 19-rule classifier from the training data, 9 rules for no churn and 10 for churn. Here again are some examples:
Rule 1: (2221/60, lift 1.1)
international plan = no
total day minutes <= 223.2
number customer service calls <= 3
-> class 0 [0.973]
Rule 5: (1972/87, lift 1.1)
total day minutes <= 264.4
total intl minutes <= 13.1
total intl calls > 2
number customer service calls <= 3
-> class 0 [0.955]
Rule 10: (60, lift 6.8)
international plan = yes
total intl calls <= 2
-> class 1 [0.984]
Rule 12: (32, lift 6.7)
total day minutes <= 120.5
number customer service calls > 3
-> class 1 [0.971]
This classifier has an accuracy of 95% on the unseen test cases.
The next example uses innovative data from Nick Kushmerick. There are 3279 cases, each describing an image within an anchor tag in a HTML document. About 14% of these anchored images are banner advertisements, and the goal is to generate rules that predict whether an image is an ad. (Kushmerick's system AdEater uses this prediction to eliminate advertisement images and so speed up page downloading.)
This dataset is very high-dimensional -- there are 1558 attributes,
about half the number of cases! These features include three
numbers -- image height, width, and aspect ratio -- together with
boolean features representing the presence or absence of
phrases in the image caption, its alt tag, and the
anchor, image, and base URLs.
For example, the attribute ancurl*http+www has the value
1 if the URL referred to in the anchor contains http
followed by www (ignoring punctuation).
More than a quarter of the cases have unknown values
for one or more of the attributes.
In one second See5/C5.0 extracts 16 rules from these data, 15 describing ads and one for non-ads. Some examples are:
Rule 1: (65, lift 7.0)
ancurl*http+www = 1
ancurl*click = 0
-> class ad [0.985]
Rule 4: (24, lift 6.9)
url*ad+gif = 0
ancurl*http+www = 0
ancurl*ad = 1
-> class ad [0.962]
Rule 15: (3127/313, lift 1.0)
url*ads = 0
-> class nonad [0.900]
On a ten-fold cross-validation, See5/C5.0's rulesets correctly classify
97% of unseen cases, once again showing that simple
theories can be useful too.
The data contain 4,601 cases described by 57 attributes, all numeric. The attributes measure the percentage of specific words or characters in the email, the average and maximum run lengths of upper case letters, and the total number of such letters in the email. See5/C5.0 needs 0.1 seconds to discover 22 rules that describe the spam email and seven rules for non-spam, e.g.:
Rule 2: (92, lift 2.5)
word_freq_font > 0.12
word_freq_edu <= 0.09
char_freq_; <= 0.895
-> class spam [0.989]
Rule 6: (53, lift 2.5)
word_freq_remove <= 0
word_freq_business > 0.5
char_freq_! > 0.824
-> class spam [0.982]
Rule 23: (737, lift 1.6)
word_freq_000 <= 0.25
word_freq_money <= 0.03
word_freq_george > 0.01
char_freq_! <= 0.378
-> class nospam [0.999]
Rule 26: (208, lift 1.6)
word_freq_meeting > 0.71
-> class nospam [0.995]
The first rule above isolates 92 of the 1813 spam emails in the data on the strength of relatively frequent appearance of the word font but relatively infrequent appearance of edu or the character ";".
A ten-fold cross-validation on this dataset shows a quite useful accuracy of 94% on unseen cases.
The data for this last example come from an assay screening service related to thyroid function and concern one aspect (hypothyroidism) of thyroid diagnosis. The attributes are a mixture of measured and calculated values and information obtained from the referring physician. There are four classes: negative, primary hypothyroid, secondary hypothyroid, and compensated hypothyroid. Let's show a few examples:
Attribute Assay 1 Assay 2 Assay 3 .....
age 32 63 19
sex F M M
on thyroxine t f f
query on thyroxine f f f
on antithyroid medication f f f
sick f f f
pregnant t N/A N/A
thyroid surgery f f f
I131 treatment f f f
query hypothyroid f f t
query hyperthyroid t f f
lithium f f f
tumor f f f
goitre f f f
hypopituitary f f f
psych f f f
TSH 0.025 108 9
T3 3.7 .4 2.2
TT4 139 14 117
T4U 1.34 .98 -
FTI 104 14 -
referral source other SVI other
diagnosis negative primary compensated
hypothyr hypothyr
See5/C5.0 processes 2,772 such cases in less than one-tenth of a second, giving seven rules for three of the classes. (The cases for the fourth class were too few in number to justify any rules.)
Rule 1: (31, lift 42.7)
thyroid surgery = f
TSH > 6
TT4 <= 37
-> class primary [0.970]
Rule 2: (63/6, lift 39.3)
TSH > 6
FTI <= 65
-> class primary [0.892]
Rule 3: (270/116, lift 10.3)
TSH > 6
-> class compensated [0.570]
Rule 4: (2225/2, lift 1.1)
TSH <= 6
-> class negative [0.999]
Rule 5: (296, lift 1.1)
on thyroxine = t
FTI > 65
-> class negative [0.997]
Rule 6: (240, lift 1.1)
TT4 > 153
-> class negative [0.996]
Rule 7: (29, lift 1.1)
thyroid surgery = t
FTI > 65
-> class negative [0.968]
Despite their simplicity, these rules are remarkably accurate -- their error rate on a further 1000 unseen test cases is only half of one percent!
The emphasis on rule-based classifiers is only one aspect of See5/C5.0 (albeit an important one). Other powerful facilities include:
If you would like to learn more about See5/C5.0 and how to use it effectively, please see the tutorial.
Some published applications that use See5/C5.0 might also be of interest.
| © RULEQUEST RESEARCH 2009 | Last updated November 2009 |
| home | products | download | evaluations | prices | purchase | contact us |