Is See5/C5.0 Better Than C4.5?

C4.5 is a widely-used free data mining tool that is descended from an earlier system called ID3 and is followed in turn by See5/C5.0. To demonstrate the advances in this new generation, we will compare C4.5 Release 8 with C5.0 Release 2.07 GPL Edition; free source code for both can be downloaded from the links above. The commercial version of See5/C5.0 2.07 is faster, being multithreaded, but otherwise generates exactly the same classifiers. (Later releases of See5/C5.0, however, introduce some improvements.)

The comparison uses three sizable datasets:

Both C4.5 and C5.0 were compiled using the Intel C compiler with the same optimization settings. Times are given in seconds for an unloaded Intel Core i7-6700 (3.4GHz) desktop with 16GB of RAM running 64-bit Ubuntu 16.04.

So, let's see how C5.0 stacks up against C4.5.


Rulesets: often more accurate, much faster, and much less memory

The following table show the error rate on unseen test cases, number of rules produced, and construction time for the three datasets. Results for C5.0 are shown in blue.

sleep
error
rate
rules
time
(secs)
C4.5 26.9% 710 3,234
C5.0 26.0% 830 12
income
error
rate
rules
time
(secs)
C4.5 5.0% 190 2,782
C5.0 5.0% 84 4
forest
error
rate
rules
time
(secs)
C4.5 6.8% 5,316 30,115
C5.0 5.9% 4,269 170

Both C4.5 and C5.0 can produce classifiers expressed either as decision trees or rulesets. In many applications, rulesets are preferred because they are simpler and easier to understand than decision trees, but C4.5's ruleset methods are slow and memory-hungry. C5.0 embodies new algorithms for generating rulesets, and the improvement is substantial.


Decision trees: faster, smaller

The following table shows the predictive accuracies, numbers of leaves, and construction times for our datasets:

sleep
error
rate
leaves
time
(secs)
C4.5 27.7% 3,546 3
C5.0 27.0% 2,160 1
income
error
rate
leaves
time
(secs)
C4.5 5.0% 264 5
C5.0 4.9% 122 1
forest
error
rate
leaves
time
(secs)
C4.5 6.1% 10,169 62
C5.0 6.1% 9,185 4

C4.5 and C5.0 produce trees with similar predictive accuracies for the forest dataset but C5.0 is somewhat better for the other applications. The major differences are the tree sizes and computation times; C5.0's trees are noticeably smaller and C5.0 is faster by factors of 3, 5, and 15 respectively.


Boosting: !!!

Based on the research of Freund and Schapire, this is an exciting new development that has no counterpart in C4.5. Boosting is a technique for generating and combining multiple classifiers to improve predictive accuracy.

sleep income forest
C5.0 trees 27.0% 5.0% 6.1%
boosted C5.0 trees 24.6% 4.6% 3.4%
C5.0 rules 26.0% 5.0% 5.9%
boosted C5.0 rules 24.5% 4.5% 3.4%

The table above shows the error rates of C5.0 on the test cases before and after 10-trial boosting, where ten separate decision trees or rulesets are combined to make predictions. The error rate is reduced for all three datasets, substantially so in the case of forest for which the error rate of boosted classifiers is about half that of the corresponding C4.5 classifier. Unfortunately, boosting doesn't always help -- when the training cases are noisy, boosting can actually reduce classification accuracy. C5.0 uses a novel variant of boosting that is less affected by noise, thereby partly overcoming this limitation.

C5.0 supports boosting with any number of trials, with more trials generally yielding further improvements. Naturally, it takes longer to produce boosted classifiers, but the results can justify the additional computation! Boosting should always be tried when peak predictive accuracy is required, especially when unboosted classifiers are already quite accurate.


New functionality

C5.0 incorporates several new facilities such as variable misclassification costs. In C4.5, all errors are treated as equal, but in practical applications some classification errors are more serious than others. C5.0 allows a separate cost to be defined for each predicted/actual class pair; if this option is used, C5.0 then constructs classifiers to minimize expected misclassification costs rather than error rates.

The cases themselves may also be of unequal importance. In an application that classifies individuals as likely or not likely to "churn," for example, the importance of each case may vary with the size of the account. C5.0 has provision for a case weight attribute that quantifies the importance of each case; if this appears, C5.0 attempts to minimize the weighted predictive error rate.

C5.0 has several new data types in addition to those available in C4.5, including dates, times, timestamps, ordered discrete attributes, and case labels. In addition to missing values, C5.0 allows values to be noted as not applicable. Further, C5.0 provides facilities for defining new attributes as functions of other attributes.

Some recent data mining applications are characterized by very high dimensionality, with hundreds or even thousands of attributes. C5.0 can automatically winnow the attributes before a classifier is constructed, discarding those that appear to be only marginally relevant. For high-dimensional applications, winnowing can lead to smaller classifiers and higher predictive accuracy, and can often reduce the time required to generate rulesets.

C5.0 is also easier to use. Options have been simplified and extended -- to support sampling and cross-validation, for instance -- and C4.5's programs for generating decision trees and rulesets have been merged into a single program.

RuleQuest provides free source code for reading and interpreting See5/C5.0 classifiers. After the classifiers have been generated by See5/C5.0, this code allows you to access them from other programs and to deploy them in your own applications.

For more information on See5/C5.0, please see the tutorial.

© RULEQUEST RESEARCH 2017 Last updated February 2017


home products licensing download contact us