Sample Applications Using See5/C5.0

This page should give you a feel for the kinds of results achievable with See5 (the Windows version) and C5.0 (the Linux version).

So what makes See5/C5.0 different? One short answer is its attention to the issue of comprehensibility. At RuleQuest, we believe that a data mining system should find patterns that provide insight in addition to supporting accurate predictions. In line with this approach, See5/C5.0 emphasizes rule-based classifiers because they are easier to understand -- each rule can be examined and validated separately, without having to consider it in the context of the classifier as a whole.

You'll also notice that See5/C5.0 is fast -- each of these examples requires at most a few seconds. (The times shown here are for C5.0 on a 3.4GHz Intel Core i7 PC running CentOS 6.) See5/C5.0 can also generate decision trees, useful in situations where classifiers must be constructed even more quickly.

Without more ado, let's jump straight in ...

Profiling High Income Earners from Census Data

The first application uses data extracted from the US Census Bureau Database and available from the UCI KDD Archive. Nearly 200,000 individuals are described by seven numeric and 33 nominal attributes, and the task is to predict whether the individual's income is above or below $50,000.

The cases are split into equal-sized training and test sets. From the 99,762 cases in the former, See5/C5.0 requires 2.7 seconds to produce a classifier consisting of 84 rules, 32 for low-income individuals and 52 for those with high income.

Each rule is characterized by the statistics (N/E, lift L) where

Rules are not intended to be mutually exclusive -- each training case is covered by 4.7 rules on average. Here are some examples:

Rule 1: (19425/1, lift 1.1)
    family members under 18 = Both parents present
    ->  class - 50000  [1.000]

Rule 6: (1603/8, lift 1.1)
    education = 5th or 6th grade
    capital losses <= 1876
    ->  class - 50000  [0.994]

Rule 33: (32, lift 15.7)
    dividends from stocks > 33000
    weeks worked in year > 47
    ->  class 50000+  [0.971]

Rule 42: (132/12, lift 14.6)
    education = Prof school degree (MD DDS DVM LLB JD)
    dividends from stocks > 991
    weeks worked in year > 47
    ->  class 50000+  [0.903]
The 84-rule classifier constructed by See5/C5.0 correctly categorizes 95% of the 99,761 unseen test cases.

Assessing Churn Risk

The second case study uses data from the data archive at MLC++ that concerns telecommunications churn. These data are artificial but are claimed to be "based on claims similar to real world." Each case is described by 16 numeric and three nominal attributes and the data are divided into a training set of 3,333 cases and a separate test set containing 1,667 cases. See5/C5.0 takes only one-tenth of a second to find a 19-rule classifier from the training data, 9 rules for no churn (class 0) and 10 for churn (class 1). Here again are some examples:

Rule 1: (2221/60, lift 1.1)
    international plan = no
    total day minutes <= 223.2
    number customer service calls <= 3
    ->  class 0  [0.973]

Rule 5: (1972/87, lift 1.1)
    total day minutes <= 264.4
    total intl minutes <= 13.1
    total intl calls > 2
    number customer service calls <= 3
    ->  class 0  [0.955]

Rule 10: (60, lift 6.8)
    international plan = yes
    total intl calls <= 2
    ->  class 1  [0.984]

Rule 12: (32, lift 6.7)
    total day minutes <= 120.5
    number customer service calls > 3
    ->  class 1  [0.971]

This classifier has an accuracy of 95% on the unseen test cases.

Detecting Advertisements on the Web

The next example uses innovative data from Nick Kushmerick. There are 3279 cases, each describing an image within an anchor tag in a HTML document. About 14% of these anchored images are banner advertisements, and the goal is to generate rules that predict whether an image is an ad. (Kushmerick's system AdEater uses this prediction to eliminate advertisement images and so speed up page downloading.)

This dataset is very high-dimensional -- there are 1558 attributes, about half the number of cases! These features include three numbers -- image height, width, and aspect ratio -- together with boolean features representing the presence or absence of phrases in the image caption, its alt tag, and the anchor, image, and base URLs. For example, the attribute ancurl*http+www has the value 1 if the URL referred to in the anchor contains http followed by www (ignoring punctuation). More than a quarter of the cases have unknown values for one or more of the attributes.

In 0.7 seconds See5/C5.0 extracts 16 rules from these data, 15 describing ads and one for non-ads. Some examples are:

Rule 1: (65, lift 7.0)
    ancurl*http+www = 1
    ancurl*click = 0
    ->  class ad  [0.985]

Rule 4: (24, lift 6.9)
        url*ad+gif = 0
        ancurl*http+www = 0
        ancurl*ad = 1
        ->  class ad  [0.962]

Rule 16: (3127/313, lift 1.0)
    url*ads = 0
    ->  class nonad  [0.900]
On a ten-fold cross-validation, See5/C5.0's rulesets correctly classify 97% of unseen cases, once again showing that simple theories can be useful too.

Identifying Spam

The next illustration has a similar flavor. These days it seems as though we are all inundated with unsolicited email ("spam") that takes time to wade through and eliminate. The goal of this application is to learn a classifier that can differentiate spam from useful mail, using data from the UCI Machine Learning Repository.

The data contain 4,601 cases described by 57 attributes, all numeric. The attributes measure the percentage of specific words or characters in the email, the average and maximum run lengths of upper case letters, and the total number of such letters in the email. See5/C5.0 needs 0.1 seconds to discover 22 rules that describe the spam email and seven rules for non-spam, e.g.:

Rule 2: (92, lift 2.5)
    word_freq_font > 0.12
    word_freq_edu <= 0.09
    char_freq_; <= 0.895
    ->  class spam  [0.989]

Rule 6: (53, lift 2.5)
    word_freq_remove <= 0
    word_freq_business > 0.5
    char_freq_! > 0.824
    ->  class spam  [0.982]

Rule 23: (737, lift 1.6)
    word_freq_000 <= 0.25
    word_freq_money <= 0.03
    word_freq_george > 0.01
    char_freq_! <= 0.378
    ->  class nospam  [0.999]

Rule 26: (208, lift 1.6)
    word_freq_meeting > 0.71
    ->  class nospam  [0.995]

The first rule above isolates 92 of the 1813 spam emails in the data on the strength of relatively frequent appearance of the word font but relatively infrequent appearance of edu or the character ";".

A ten-fold cross-validation on this dataset shows a quite useful accuracy of 94% on unseen cases.

Diagnosing Hypothyroidism

The data for this last example come from an assay screening service related to thyroid function and concern one aspect (hypothyroidism) of thyroid diagnosis. The attributes are a mixture of measured and calculated values and information obtained from the referring physician. There are four classes: negative, primary hypothyroid, secondary hypothyroid, and compensated hypothyroid. Let's show a few examples:

Attribute               Assay 1     Assay 2     Assay 3    .....

age                          32          63          19
sex                           F           M           M
on thyroxine                  t           f           f
query on thyroxine            f           f           f
on antithyroid medication     f           f           f
sick                          f           f           f
pregnant                      t         N/A         N/A
thyroid surgery               f           f           f
I131 treatment                f           f           f
query hypothyroid             f           f           t
query hyperthyroid            t           f           f
lithium                       f           f           f
tumor                         f           f           f
goitre                        f           f           f
hypopituitary                 f           f           f
psych                         f           f           f
TSH                       0.025         108           9
T3                          3.7          .4         2.2
TT4                         139          14         117
T4U                        1.34         .98           -
FTI                         104          14           -
referral source           other         SVI       other
diagnosis              negative     primary compensated
                                   hypothyr    hypothyr

See5/C5.0 processes 2,772 such cases in less than one-tenth of a second, giving seven rules for three of the classes. (The cases for the fourth class were too few in number to justify any rules.)

Rule 1: (31, lift 42.7)
    thyroid surgery = f
    TSH > 6
    TT4 <= 37
    ->  class primary  [0.970]

Rule 2: (63/6, lift 39.3)
    TSH > 6
    FTI <= 65
    ->  class primary  [0.892]

Rule 3: (270/116, lift 10.3)
    TSH > 6
    ->  class compensated  [0.570]

Rule 4: (2225/2, lift 1.1)
    TSH <= 6
    ->  class negative  [0.999]

Rule 5: (296, lift 1.1)
    on thyroxine = t
    FTI > 65
    ->  class negative  [0.997]

Rule 6: (240, lift 1.1)
    TT4 > 153
    ->  class negative  [0.996]

Rule 7: (29, lift 1.1)
    thyroid surgery = t
    FTI > 65
    ->  class negative  [0.968]

Despite their simplicity, these rules are remarkably accurate -- their error rate on a further 1000 unseen test cases is only half of one percent!

Now Read On ...

The emphasis on rule-based classifiers is only one aspect of See5/C5.0 (albeit an important one). Other powerful facilities include:

If you would like to learn more about See5/C5.0 and how to use it effectively, please see the tutorial.

Some published applications that use See5/C5.0 might also be of interest.

© RULEQUEST RESEARCH 2013 Last updated March 2013

home products licensing download contact us