Sample Applications Using See5/C5.0
- Profiling High Income Earners from Census Data
- Assessing Churn Risk
- Detecting Advertisements on the Web
- Identifying Spam
- Diagnosing Hypothyroidism
- Now Read On ...
This page should give you a feel for the kinds of results achievable with See5 (the Windows version) and C5.0 (the Linux version).
So what makes See5/C5.0 different? One short answer is its attention to the issue of comprehensibility. At RuleQuest, we believe that a data mining system should find patterns that provide insight in addition to supporting accurate predictions. In line with this approach, See5/C5.0 emphasizes rule-based classifiers because they are easier to understand -- each rule can be examined and validated separately, without having to consider it in the context of the classifier as a whole.
You'll also notice that See5/C5.0 is fast -- each of these examples requires at most a few seconds. (The times shown here are for C5.0 on a 3.4GHz Intel Core i7 PC running CentOS 6.) See5/C5.0 can also generate decision trees, useful in situations where classifiers must be constructed even more quickly.
Without more ado, let's jump straight in ...
The first application uses data extracted from the US Census Bureau Database and available from the UCI KDD Archive. Nearly 200,000 individuals are described by seven numeric and 33 nominal attributes, and the task is to predict whether the individual's income is above or below $50,000.
The cases are split into equal-sized training and test sets. From the 99,762 cases in the former, See5/C5.0 requires 2.7 seconds to produce a classifier consisting of 84 rules, 32 for low-income individuals and 52 for those with high income.
Each rule is characterized by the statistics (N/E, lift L) where
- N is the number of training cases covered by the rule,
- E (if shown) is the number of them that do not belong to the rule's class, and
- L is the estimated accuracy of the rule (the figure in square brackets, e.g. [0.903]) divided by the prior probability of the rule's class.
Rules are not intended to be mutually exclusive -- each training case is covered by 4.7 rules on average. Here are some examples:
Rule 1: (19425/1, lift 1.1) family members under 18 = Both parents present -> class - 50000 [1.000] Rule 6: (1603/8, lift 1.1) education = 5th or 6th grade capital losses <= 1876 -> class - 50000 [0.994] Rule 33: (32, lift 15.7) dividends from stocks > 33000 weeks worked in year > 47 -> class 50000+ [0.971] Rule 42: (132/12, lift 14.6) education = Prof school degree (MD DDS DVM LLB JD) dividends from stocks > 991 weeks worked in year > 47 -> class 50000+ [0.903]The 84-rule classifier constructed by See5/C5.0 correctly categorizes 95% of the 99,761 unseen test cases.
The second case study uses data from the data archive at MLC++ that concerns telecommunications churn. These data are artificial but are claimed to be "based on claims similar to real world." Each case is described by 16 numeric and three nominal attributes and the data are divided into a training set of 3,333 cases and a separate test set containing 1,667 cases. See5/C5.0 takes only one-tenth of a second to find a 19-rule classifier from the training data, 9 rules for no churn (class 0) and 10 for churn (class 1). Here again are some examples:
Rule 1: (2221/60, lift 1.1) international plan = no total day minutes <= 223.2 number customer service calls <= 3 -> class 0 [0.973] Rule 5: (1972/87, lift 1.1) total day minutes <= 264.4 total intl minutes <= 13.1 total intl calls > 2 number customer service calls <= 3 -> class 0 [0.955] Rule 10: (60, lift 6.8) international plan = yes total intl calls <= 2 -> class 1 [0.984] Rule 12: (32, lift 6.7) total day minutes <= 120.5 number customer service calls > 3 -> class 1 [0.971]
This classifier has an accuracy of 95% on the unseen test cases.
The next example uses innovative data from Nick Kushmerick. There are 3279 cases, each describing an image within an anchor tag in a HTML document. About 14% of these anchored images are banner advertisements, and the goal is to generate rules that predict whether an image is an ad. (Kushmerick's system AdEater uses this prediction to eliminate advertisement images and so speed up page downloading.)
This dataset is very high-dimensional -- there are 1558 attributes,
about half the number of cases! These features include three
numbers -- image height, width, and aspect ratio -- together with
boolean features representing the presence or absence of
phrases in the image caption, its alt tag, and the
anchor, image, and base URLs.
For example, the attribute
ancurl*http+www has the value
1 if the URL referred to in the anchor contains http
followed by www (ignoring punctuation).
More than a quarter of the cases have unknown values
for one or more of the attributes.
In 0.7 seconds See5/C5.0 extracts 16 rules from these data, 15 describing ads and one for non-ads. Some examples are:
Rule 1: (65, lift 7.0) ancurl*http+www = 1 ancurl*click = 0 -> class ad [0.985] Rule 4: (24, lift 6.9) url*ad+gif = 0 ancurl*http+www = 0 ancurl*ad = 1 -> class ad [0.962] Rule 16: (3127/313, lift 1.0) url*ads = 0 -> class nonad [0.900]On a ten-fold cross-validation, See5/C5.0's rulesets correctly classify 97% of unseen cases, once again showing that simple theories can be useful too.
UCI Machine Learning Repository.
The data contain 4,601 cases described by 57 attributes, all numeric. The attributes measure the percentage of specific words or characters in the email, the average and maximum run lengths of upper case letters, and the total number of such letters in the email. See5/C5.0 needs 0.1 seconds to discover 22 rules that describe the spam email and seven rules for non-spam, e.g.:
Rule 2: (92, lift 2.5) word_freq_font > 0.12 word_freq_edu <= 0.09 char_freq_; <= 0.895 -> class spam [0.989] Rule 6: (53, lift 2.5) word_freq_remove <= 0 word_freq_business > 0.5 char_freq_! > 0.824 -> class spam [0.982] Rule 23: (737, lift 1.6) word_freq_000 <= 0.25 word_freq_money <= 0.03 word_freq_george > 0.01 char_freq_! <= 0.378 -> class nospam [0.999] Rule 26: (208, lift 1.6) word_freq_meeting > 0.71 -> class nospam [0.995]
The first rule above isolates 92 of the 1813 spam emails in the data on the strength of relatively frequent appearance of the word font but relatively infrequent appearance of edu or the character ";".
A ten-fold cross-validation on this dataset shows a quite useful accuracy of 94% on unseen cases.
The data for this last example come from an assay screening service related to thyroid function and concern one aspect (hypothyroidism) of thyroid diagnosis. The attributes are a mixture of measured and calculated values and information obtained from the referring physician. There are four classes: negative, primary hypothyroid, secondary hypothyroid, and compensated hypothyroid. Let's show a few examples:
Attribute Assay 1 Assay 2 Assay 3 ..... age 32 63 19 sex F M M on thyroxine t f f query on thyroxine f f f on antithyroid medication f f f sick f f f pregnant t N/A N/A thyroid surgery f f f I131 treatment f f f query hypothyroid f f t query hyperthyroid t f f lithium f f f tumor f f f goitre f f f hypopituitary f f f psych f f f TSH 0.025 108 9 T3 3.7 .4 2.2 TT4 139 14 117 T4U 1.34 .98 - FTI 104 14 - referral source other SVI other diagnosis negative primary compensated hypothyr hypothyr
See5/C5.0 processes 2,772 such cases in less than one-tenth of a second, giving seven rules for three of the classes. (The cases for the fourth class were too few in number to justify any rules.)
Rule 1: (31, lift 42.7) thyroid surgery = f TSH > 6 TT4 <= 37 -> class primary [0.970] Rule 2: (63/6, lift 39.3) TSH > 6 FTI <= 65 -> class primary [0.892] Rule 3: (270/116, lift 10.3) TSH > 6 -> class compensated [0.570] Rule 4: (2225/2, lift 1.1) TSH <= 6 -> class negative [0.999] Rule 5: (296, lift 1.1) on thyroxine = t FTI > 65 -> class negative [0.997] Rule 6: (240, lift 1.1) TT4 > 153 -> class negative [0.996] Rule 7: (29, lift 1.1) thyroid surgery = t FTI > 65 -> class negative [0.968]
Despite their simplicity, these rules are remarkably accurate -- their error rate on a further 1000 unseen test cases is only half of one percent!
The emphasis on rule-based classifiers is only one aspect of See5/C5.0 (albeit an important one). Other powerful facilities include:
- boosting, a technique for constructing multiple classifiers to improve predictive accuracy;
- differential misclassification costs, allowing some mistakes to be identified as more important than others;
- case weights, when an application needs to specify the importance of each case;
- winnowing, which ignores less relevant attributes and estimates the relative importance of those remaining; and
- support for cross-validation trials and sampling.
If you would like to learn more about See5/C5.0 and how to use it effectively, please see the tutorial.
Some published applications that use See5/C5.0 might also be of interest.
|© RULEQUEST RESEARCH 2013||Last updated March 2013|