Is See5/C5.0 Better Than C4.5?

C4.5 is a widely-used free data mining tool that is descended from an earlier system called ID3 and is followed in turn by See5/C5.0. To demonstrate the advances in this new generation, we will compare See5/C5.0 Release 2.09 with C4.5 Release 8 using three sizable datasets:

Since C4.5 is a Unix-based system, results for the Linux version C5.0 are presented to facilitate comparison. Both were compiled using the Intel C compiler with the same optimization settings. Times are elapsed seconds for an unloaded Intel Core i7 (3.4GHz) with 4GB of RAM running 64-bit Fedora Core.

So, let's see how C5.0 stacks up against C4.5.

Rulesets: often more accurate, much faster, and much less memory

Results for sleep Results for income Results for forest
sleep income forest

These graphs show the accuracy on unseen test cases, number of rules produced, and construction time for the three datasets. Results for C5.0 are shown in blue.

Both C4.5 and C5.0 can produce classifiers expressed either as decision trees or rulesets. In many applications, rulesets are preferred because they are simpler and easier to understand than decision trees, but C4.5's ruleset methods are slow and memory-hungry. C5.0 embodies new algorithms for generating rulesets, and the improvement is substantial.

Decision trees: faster, smaller

Results for sleep Results for income Results for forest
sleep income forest

For all three datasets, C4.5 and C5.0 produce trees with similar predictive accuracies (although C5.0's is marginally better for the sleep application). The major differences are the tree sizes and computation times; C5.0's trees are noticeably smaller and C5.0 is faster by factors of 8, 3, and 22 respectively.

Boosting: !!!

Error rates on unseen test cases
Results for trees Results for rules
boosted decision trees boosted rulesets

Based on the research of Freund and Schapire, this is an exciting new development that has no counterpart in C4.5. Boosting is a technique for generating and combining multiple classifiers to improve predictive accuracy.

The graphs above show what happens in 10-trial boosting where ten separate decision trees or rulesets are combined to make predictions. The error rate on unseen cases is reduced for all three datasets, substantially so in the case of forest for which the error rate of boosted classifiers is about half that of the corresponding C4.5 classifier. Unfortunately, boosting doesn't always help -- when the training cases are noisy, boosting can actually reduce classification accuracy. C5.0 uses a novel variant of boosting that is less affected by noise, thereby partly overcoming this limitation.

C5.0 supports boosting with any number of trials. Naturally, it takes longer to produce boosted classifiers, but the results can justify the additional computation! Boosting should always be tried when peak predictive accuracy is required, especially when unboosted classifiers are already quite accurate.

New functionality

C5.0 incorporates several new facilities such as variable misclassification costs. In C4.5, all errors are treated as equal, but in practical applications some classification errors are more serious than others. C5.0 allows a separate cost to be defined for each predicted/actual class pair; if this option is used, C5.0 then constructs classifiers to minimize expected misclassification costs rather than error rates.

The cases themselves may also be of unequal importance. In an application that classifies individuals as likely or not likely to "churn," for example, the importance of each case may vary with the size of the account. C5.0 has provision for a case weight attribute that quantifies the importance of each case; if this appears, C5.0 attempts to minimize the weighted predictive error rate.

C5.0 has several new data types in addition to those available in C4.5, including dates, times, timestamps, ordered discrete attributes, and case labels. In addition to missing values, C5.0 allows values to be noted as not applicable. Further, C5.0 provides facilities for defining new attributes as functions of other attributes.

Some recent data mining applications are characterized by very high dimensionality, with hundreds or even thousands of attributes. C5.0 can automatically winnow the attributes before a classifier is constructed, discarding those that appear to be only marginally relevant. For high-dimensional applications, winnowing can lead to smaller classifiers and higher predictive accuracy, and can often reduce the time required to generate rulesets.

C5.0 is also easier to use. Options have been simplified and extended -- to support sampling and cross-validation, for instance -- and C4.5's programs for generating decision trees and rulesets have been merged into a single program.

The Windows version, See5, has a user-friendly graphic interface and adds a couple of interesting facilities. For example, the cross-reference window makes classifiers more understandable by linking cases to relevant parts of the classifier.

RuleQuest provides free source code for reading and interpreting See5/C5.0 classifiers. After the classifiers have been generated by See5/C5.0, this code allows you to access them from other programs and to deploy them in your own applications.

For more information on See5/C5.0, please see the tutorial.

© RULEQUEST RESEARCH 2012 Last updated February 2012

home products licensing download contact us