Sample Applications Using See5/C5.0

This page should give you a feel for the kinds of results achievable with C5.0 and See5. For brevity, we'll now refer just to See5 (Release 2.05), the Windows implementation, with the understanding that C5.0 can generate the same results.

So what makes See5 different? One short answer is its attention to the issue of comprehensibility. At RuleQuest, we believe that a data mining system should find patterns that provide insight in addition to supporting accurate predictions. In line with this approach, See5 emphasizes rule-based classifiers because they are easier to understand -- each rule can be examined and validated separately, without having to consider it in the context of the classifier as a whole.

You'll also notice that See5 is fast -- each of these examples require only a few seconds on a 2.4GHz Core 2 Duo. (See5 can also generate decision trees, useful in situations where classifiers must be constructed even more quickly.)

Without more ado, let's jump straight in ...

Predicting Magnetic Properties of Crystals

This case study was carried out in collaboration with Dr John Rodgers of National Research Council Canada. The aim is to develop rules that predict whether a substance is magnetic or not; this would expedite the search for new synthetic compounds with desirable magnetic behavior.

Each case describes over 120 properties of a substance, such as the number of atoms of each element that it contains, the number of atoms belonging to each periodic table family, density, crystal structure group, and so on. From 1,750 cases that are labeled pos and a further 22,891 labeled neg to indicate whether or not they are magnetic, See5 takes 0.7 seconds to construct a theory consisting of 18 rules. Magnetic substances are described by six rules:

Rule 1: (367/6, lift 13.8)
    Family 6 <= 0
    Family 4' <= 0
    Group = tI26
    ->  class pos  [0.981]

Rule 2: (213/4, lift 13.8)
    Family 3 > 0
    Family 10' > 1
    Group in {hP6, tP68}
    ->  class pos  [0.977]

Rule 3: (981/27, lift 13.7)
    Family 3 > 0
    Family 6 <= 0
    Family 4' <= 0
    Group in {hP38, hR19, tI26, hH6, hH38, tP88, tP92}
    ->  class pos  [0.972]

Rule 4: (1493/52, lift 13.6)
    B <= 2
    Family 3 > 0
    Family 4 <= 4
    Family 6 <= 0
    Family 1' <= 0
    Family 3' <= 0
    Family 4' <= 0
    Group in {hP6, hP38, hR19, tI26, tP68, hH6, hH38, tP88, tP92}
    ->  class pos  [0.965]

Rule 5: (1555/59, lift 13.5)
    Si <= 0
    Ga <= 0
    In <= 0
    Tl <= 0
    Family 5 <= 0
    Family 6 <= 0
    Family 4' <= 0
    Group in {hP6, hP38, hR19, tI26, tP68, hH6, hH38, tP88, tP92}
    ->  class pos  [0.961]

Rule 6: (46/2, lift 13.2)
    B > 2
    Family 3 > 10
    Group in {hP6, tI26, tP68}
    ->  class pos  [0.938]

Each rule is characterized by the statistics (N/E, lift L) where

Rules are not intended to be mutually exclusive -- on average, each pos case is covered by approximately three rules.

The first rule for non-magnetic substances is also interesting:

Rule 7: (2254, lift 1.1)
    Family 3 <= 0
    Family 5 > 0
    ->  class neg  [1.000]
This simple rule, paraphrased as "a substance containing no elements of Family 3 but at least one element of Family 5 is not magnetic", covers over two thousand training cases with no exceptions.

The same rules were used to classify a further 10,000 unseen cases (that is, substances that were not available to See5 when the theory was constructed). The result -- 99.7% accuracy on these test cases!

This application demonstrates that See5 can speedily process large, high-dimensional datasets to yield interpretable, accurate models.


Profiling High Income Earners from Census Data

The second application uses data extracted from the US Census Bureau Database and available from the UCI KDD Archive. Nearly 200,000 individuals are described by seven numeric and 33 nominal attributes, and the task is to predict whether the individual's income is above or below $50,000.

The cases are split into equal-sized training and test sets. From the 99,762 cases in the former, See5 requires 5.9 seconds to produce a classifier consisting of 83 rules, 32 for low-income individuals and 51 for those with high income. Here are some examples:

Rule 1: (19425/1, lift 1.1)
    family members under 18 = Both parents present
    ->  class - 50000  [1.000]

Rule 6: (1603/8, lift 1.1)
    education = 5th or 6th grade
    capital losses <= 1876
    ->  class - 50000  [0.994]

Rule 33: (32, lift 15.7)
    dividends from stocks > 33000
    weeks worked in year > 47
    ->  class 50000+  [0.971]

Rule 42: (132/12, lift 14.6)
    education = Prof school degree (MD DDS DVM LLB JD)
    dividends from stocks > 991
    weeks worked in year > 47
    ->  class 50000+  [0.903]
The 83-rule classifier constructed by See5 correctly categorizes 95% of the 99,761 unseen test cases.


Assessing Churn Risk

The third case study uses data from the data archive at MLC++ that concerns telecommunications churn. These data are artificial but are claimed to be "based on claims similar to real world." Each case is described by 16 numeric and three nominal attributes and the data are divided into a training set of 3,333 cases and a separate test set containing 1,667. See5 takes only one-tenth of a second to find a 19-rule classifier from the training data, 7 rules for no churn and 12 for churn. Here again are some examples:

Rule 1: (2221/60, lift 1.1)
    international plan = no
    total day minutes <= 223.2
    number customer service calls <= 3
    ->  class 0  [0.973]

Rule 3: (1972/87, lift 1.1)
    total day minutes <= 264.4
    total intl minutes <= 13.1
    total intl calls > 2
    number customer service calls <= 3
    ->  class 0  [0.955]

Rule 8: (60, lift 6.8)
    international plan = yes
    total intl calls <= 2
    ->  class 1  [0.984]

Rule 11: (32, lift 6.7)
    total day minutes <= 120.5
    number customer service calls > 3
    ->  class 1  [0.971]

This classifier has an accuracy of 95% on the unseen test cases.


Detecting Advertisements on the Web

The next example uses innovative data from Nick Kushmerick. There are 3279 cases, each describing an image within an anchor tag in a HTML document. About 14% of these anchored images are banner advertisements, and the goal is to generate rules that predict whether an image is an ad. (Kushmerick's system AdEater uses this prediction to eliminate advertisement images and so speed up page downloading.)

This dataset is very high-dimensional -- there are 1558 attributes, about half the number of cases! These features include three numbers -- image height, width, and aspect ratio -- together with boolean features representing the presence or absence of phrases in the image caption, its alt tag, and the anchor, image, and base URLs. For example, the attribute ancurl*http+www has the value 1 if the URL referred to in the anchor contains http followed by www (ignoring punctuation). More than a quarter of the cases have unknown values for one or more of the attributes.

In 1.3 seconds See5 extracts 15 rules from these data, 14 describing ads and one for non-ads:

Rule 1: (65, lift 7.0)
    ancurl*http+www = 1
    ancurl*click = 0
    ->  class ad  [0.985]

Rule 2: (39, lift 7.0)
    ancurl*adclick = 1
    ->  class ad  [0.976]

Rule 3: (103/2, lift 6.9)
    url*ads = 0
    ancurl*click = 1
    ->  class ad  [0.971]

Rule 4: (20, lift 6.8)
    ancurl*n+a = 1
    ->  class ad  [0.955]

Rule 5: (152/6, lift 6.8)
    url*ads = 1
    ->  class ad  [0.955]

Rule 6: (16, lift 6.7)
    ancurl*plx = 1
    ->  class ad  [0.944]

Rule 7: (15, lift 6.7)
    url*doubleclick.net = 1
    ->  class ad  [0.941]

Rule 8: (14, lift 6.7)
    alt*visit+our = 1
    ->  class ad  [0.938]

Rule 9: (13, lift 6.7)
    origurl*home.netscape.com = 1
    ->  class ad  [0.933]

Rule 10: (110/8, lift 6.6)
    alt*click+here = 1
    alt*to = 0
    ->  class ad  [0.920]

Rule 11: (10, lift 6.5)
    origurl*zdnet.com = 1
    ->  class ad  [0.917]

Rule 12: (9, lift 6.5)
    origurl*jun = 1
    ->  class ad  [0.909]

Rule 13: (51/4, lift 6.5)
    ancurl*ad = 1
    ->  class ad  [0.906]

Rule 14: (8, lift 6.4)
    width > 196
    url*images+home = 1
    ->  class ad  [0.900]

Rule 15: (3127/313, lift 1.0)
    url*ads = 0
    ->  class nonad  [0.900]
On a ten-fold cross-validation, See5's rulesets correctly classify 97% of unseen cases, once again showing that simple theories can be useful too.


Identifying Spam

The next illustration has a similar flavor. These days it seems as though we are all inundated with unsolicited email ("spam") that takes time to wade through and eliminate. The goal of this application is to learn a classifier that can differentiate spam from useful mail, using data from the UCI Machine Learning Repository.

The data contain 4,601 cases described by 57 attributes, all numeric. The attributes measure the percentage of specific words or characters in the email, the average and maximum run lengths of upper case letters, and the total number of such letters in the email. See5 needs 0.2 seconds to discover 21 rules that describe the spam email and nine rules for non-spam, e.g.:

Rule 2: (92, lift 2.5)
    word_freq_font > 0.12
    word_freq_edu <= 0.09
    char_freq_; <= 0.895
    ->  class spam  [0.989]

Rule 7: (53, lift 2.5)
    word_freq_remove <= 0
    word_freq_business > 0.5
    char_freq_! > 0.824
    ->  class spam  [0.982]

Rule 22: (737, lift 1.6)
    word_freq_000 <= 0.25
    word_freq_money <= 0.03
    word_freq_george > 0.01
    char_freq_! <= 0.378
    ->  class nospam  [0.999]

Rule 25: (208, lift 1.6)
    word_freq_meeting > 0.71
    ->  class nospam  [0.995]

The first rule above isolates 92 of the 1813 spam emails in the data on the strength of relatively frequent appearance of the word font but relatively infrequent appearance of edu or the character ";".

A ten-fold cross-validation on this dataset shows a quite useful accuracy of 94% on unseen cases.


Diagnosing Hypothyroidism

The data for this last example come from an assay screening service related to thyroid function and concern one aspect (hypothyroidism) of thyroid diagnosis. The attributes are a mixture of measured and calculated values and information obtained from the referring physician. There are four classes: negative, primary hypothyroid, secondary hypothyroid, and compensated hypothyroid. Let's show a few examples:

Attribute               Assay 1     Assay 2     Assay 3    .....

age                          32          63          19
sex                           F           M           M
on thyroxine                  t           f           f
query on thyroxine            f           f           f
on antithyroid medication     f           f           f
sick                          f           f           f
pregnant                      t         N/A         N/A
thyroid surgery               f           f           f
I131 treatment                f           f           f
query hypothyroid             f           f           t
query hyperthyroid            t           f           f
lithium                       f           f           f
tumor                         f           f           f
goitre                        f           f           f
hypopituitary                 f           f           f
psych                         f           f           f
TSH                       0.025         108           9
T3                          3.7          .4         2.2
TT4                         139          14         117
T4U                        1.34         .98           -
FTI                         104          14           -
referral source           other         SVI       other
diagnosis              negative     primary compensated
                                   hypothyr    hypothyr

See5 processes 2,772 such cases in less than one-tenth of a second, giving seven rules for three of the classes. (The cases for the fourth class were too few in number to justify any rules.)

Rule 1: (31, lift 42.7)
    thyroid surgery = f
    TSH > 6
    TT4 <= 37
    ->  class primary  [0.970]

Rule 2: (63/6, lift 39.3)
    TSH > 6
    FTI <= 65
    ->  class primary  [0.892]

Rule 3: (270/116, lift 10.3)
    TSH > 6
    ->  class compensated  [0.570]

Rule 4: (2225/2, lift 1.1)
    TSH <= 6
    ->  class negative  [0.999]

Rule 5: (296, lift 1.1)
    on thyroxine = t
    FTI > 65
    ->  class negative  [0.997]

Rule 6: (240, lift 1.1)
    TT4 > 153
    ->  class negative  [0.996]

Rule 7: (29, lift 1.1)
    thyroid surgery = t
    FTI > 65
    ->  class negative  [0.968]

Despite their simplicity, these rules are remarkably accurate -- their error rate on a further 1000 unseen test cases is only half of one percent!


Now Read On ...

The emphasis on rule-based classifiers is only one aspect of See5/C5.0 (albeit an important one). Other powerful facilities include:

If you would like to learn more about See5/C5.0 and how to use it effectively, please see the tutorial.

Some published applications that use See5/C5.0 might also be of interest.

© RULEQUEST RESEARCH 2007 Last updated November 2007


home products download evaluations prices purchase contact us