Is See5/C5.0 Better Than C4.5?
C4.5 is a widely-used free data mining tool that is descended from an earlier system called ID3 and is followed in turn by See5/C5.0. To demonstrate the advances in this new generation, we will compare C4.5 Release 8 with C5.0 Release 2.07 GPL Edition; free source code for both can be downloaded from the links above. The commercial version of See5/C5.0 2.07 is faster, being multithreaded, but otherwise generates exactly the same classifiers. (Later releases of See5/C5.0, however, introduce some improvements.)
The comparison uses three sizable datasets:
- Sleep stage scoring data (sleep, 105,908 cases). Every case in this monitoring application is described by six numeric-valued attributes and belongs to one of six classes. C5.0 and C4.5 use 52,954 cases to construct classifiers that are tested on the remaining 52,954 cases. The data are available from here.
- Census income data (income, 199,523 cases). The goal of this application is to predict whether a person's income is above or below $50,000 using seven numeric and 33 discrete (nominal) attributes. The data are divided into a training set of 99,762 cases and a test set of 99,761. They can be obtained from the UCI KDD Archive.
- Forest cover type data (forest, 581,012 cases); also from UCI. This application has seven classes (possible types of forest cover), and the cases are described in terms of 12 numeric and two multi-valued discrete attributes. As before, half of the data -- 290,506 cases -- is used for training and the remainder for testing the learned classifiers.
So, let's see how C5.0 stacks up against C4.5.
Rulesets: often more accurate, much faster, and much less memory
The following table show the error rate on unseen test cases, number of rules produced, and construction time for the three datasets. Results for C5.0 are shown in blue.
Both C4.5 and C5.0 can produce classifiers expressed either as decision trees or rulesets. In many applications, rulesets are preferred because they are simpler and easier to understand than decision trees, but C4.5's ruleset methods are slow and memory-hungry. C5.0 embodies new algorithms for generating rulesets, and the improvement is substantial.
- Accuracy: The C5.0 rulesets have noticeably lower error rates on unseen cases for the sleep and forest datasets. The C4.5 and C5.0 rulesets have the same predictive accuracy for the income dataset, but the C5.0 ruleset is smaller.
- Speed: C5.0 is much faster; it uses different algorithms and is highly optimized. For instance, C4.5 required more than eight hours to find the ruleset for forest, but C5.0 completed the task in under three minutes
- Memory: C5.0 commonly uses an order of magnitude less memory than C4.5 during ruleset construction. For the forest dataset, C4.5 needs more than 3GB but C5.0 requires less than 200MB.
Decision trees: faster, smallerThe following table shows the predictive accuracies, numbers of leaves, and construction times for our datasets:
C4.5 and C5.0 produce trees with similar predictive accuracies for the forest dataset but C5.0 is somewhat better for the other applications. The major differences are the tree sizes and computation times; C5.0's trees are noticeably smaller and C5.0 is faster by factors of 3, 5, and 15 respectively.
Based on the research of Freund and Schapire, this is an exciting new development that has no counterpart in C4.5. Boosting is a technique for generating and combining multiple classifiers to improve predictive accuracy.
|boosted C5.0 trees
|boosted C5.0 rules
The table above shows the error rates of C5.0 on the test cases before and after 10-trial boosting, where ten separate decision trees or rulesets are combined to make predictions. The error rate is reduced for all three datasets, substantially so in the case of forest for which the error rate of boosted classifiers is about half that of the corresponding C4.5 classifier. Unfortunately, boosting doesn't always help -- when the training cases are noisy, boosting can actually reduce classification accuracy. C5.0 uses a novel variant of boosting that is less affected by noise, thereby partly overcoming this limitation.
C5.0 supports boosting with any number of trials, with more trials generally yielding further improvements. Naturally, it takes longer to produce boosted classifiers, but the results can justify the additional computation! Boosting should always be tried when peak predictive accuracy is required, especially when unboosted classifiers are already quite accurate.
C5.0 incorporates several new facilities such as variable misclassification costs. In C4.5, all errors are treated as equal, but in practical applications some classification errors are more serious than others. C5.0 allows a separate cost to be defined for each predicted/actual class pair; if this option is used, C5.0 then constructs classifiers to minimize expected misclassification costs rather than error rates.
The cases themselves may also be of unequal importance. In an application that classifies individuals as likely or not likely to "churn," for example, the importance of each case may vary with the size of the account. C5.0 has provision for a case weight attribute that quantifies the importance of each case; if this appears, C5.0 attempts to minimize the weighted predictive error rate.
C5.0 has several new data types in addition to those available in C4.5, including dates, times, timestamps, ordered discrete attributes, and case labels. In addition to missing values, C5.0 allows values to be noted as not applicable. Further, C5.0 provides facilities for defining new attributes as functions of other attributes.
Some recent data mining applications are characterized by very high dimensionality, with hundreds or even thousands of attributes. C5.0 can automatically winnow the attributes before a classifier is constructed, discarding those that appear to be only marginally relevant. For high-dimensional applications, winnowing can lead to smaller classifiers and higher predictive accuracy, and can often reduce the time required to generate rulesets.
C5.0 is also easier to use. Options have been simplified and extended -- to support sampling and cross-validation, for instance -- and C4.5's programs for generating decision trees and rulesets have been merged into a single program.
RuleQuest provides free source code for reading and interpreting See5/C5.0 classifiers. After the classifiers have been generated by See5/C5.0, this code allows you to access them from other programs and to deploy them in your own applications.
For more information on See5/C5.0, please see the tutorial.
|© RULEQUEST RESEARCH 2017
|Last updated February 2017