The methods for finding the bounds within which this combination is invoked have been re-designed to make them both faster and more effective. This option can lead to noticeably better predictive performance and is now recommended for applications with many continuous attributes.
In Release 2.04, an attribute was considered to have been "used" if its value was required to determine which rules applied to a case. This definition proved unsatisfactory because it depended on the order in which rule conditions were checked, and also because many attributes ended up having usage figures around 100%. In Release 2.05, an attribute is "used" to classify a case when it referenced by one or more conditions of an applicable rule (i.e., a rule whose conditions are all satisfied by a case). Usage figures for trees and rulesets are now more similar.
Two small bugs in the Windows GUI have been rectified. These concern the display of implicitly-defined discrete attributes when the value is unknown, and the possible change of classifier settings when the "Cross-reference" or "Making predictions" windows are invoked immediately after a cross-validation.
cases file.
These errors are now reported via pop-up messages.
The changes in Release 1.20 were as follows:
price has
numeric values, the class specifier
price: 100, 1000, 5000.
price less than or equal to 100price greater than 100 but less
than or equal to 1000price greater than 1000 but less
than or equal to 5000price greater than 5000.
(This may sound as though 1.18 was rather wasteful with memory. The reduction is achieved, however, by compressing certain data structures as they are constructed; when information is required, relevant parts are temporarily restored.)
For example, the following graphs compare the performance of 1.18
to the previous release (1.17) on three large datasets:
|
|
|
| sleep | income | forest |
Results for 1.18 are shown in blue. Releases 1.17 and 1.18 generate rulesets with similar predictive accuracies, but notice how much faster 1.18 is -- on the largest dataset it is nearly five times as fast as 1.17.
A new option allows global pruning to be disabled if desired.
The graph compares the times required by 1.16 and 1.17 to construct a decision tree for different-sized subsets of a dataset. The releases perform similarly for 25,000 training cases, but 1.17's advantage increases with size -- at around 200,000 cases, 1.17 is almost twice as fast as 1.16.
A new winnowing option helps to overcome this problem by investigating the usefulness of all attributes before any classifier is constructed. Attributes found to be irrelevant or harmful to predictive performance are disregarded ("winnowed") and only the remaining attributes are used to construct decision trees or rulesets.
See5/C5.0 now uses a modified test selection strategy when the training data contains thousands of cases or more. The change is intended to reduce the number of unhelpful tests appearing in classifiers so that they are smaller and/or have higher predictive accuracy. Even without the winnowing option, the classifiers produced by Release 1.16 may differ from those generated by earlier releases.
YYYY-MM-DD HH:MM:SS using a 24-hour clock.
(Recall that See5/C5.0 already has data types for
times and for dates.)
A timestamp is rounded to the nearest minute and
implicitly defined attributes can be used to compute functions
of timestamps such as the number of minutes between two of them.
.names file.
This can be either a list of the allowable attributes or,
alternatively, a list of the attributes to be excluded.
HH:MM:SS.
Implicitly defined attributes can be used to compute functions
of times, such as the number of seconds between two times.
Dates can now be entered as either
YYYY/MM/DD or
YYYY-MM-DD.
To ease the changeover, See5/C5.0 and the new public code will still read model files (.tree and .rules) generated by Release 1.11.
xval
script. The -X option invokes cross-validation
and specifies the number of folds.
The xval script is still used for multiple
cross-validations. However, the option +d that
preserves detailed outputs now saves one file for each
cross-validation rather than one file for each C5.0 run.
YYYY/MM/DD
and can be used with implicitly defined attributes to determine,
for instance, the number of days between two dates or the day of the
week on which a date falls.
Ordered discrete values are nominal values that have
a natural ordering, such as small, medium, large, XL, XXL.
When an attribute's discrete values are noted as ordered, See5/C5.0
exploits this information to
test subranges of the values,
e.g. [large-XXL].
This tends to produce more compact models with higher predictive accuracy.
A new button on the toolbar allows the previous output to be redisplayed.
Previous releases of See5/C5.0 had a fuzzy thresholds option, but these soft thresholds were used only in interactive classification. The method for finding fuzzy thresholds has been changed in Release 1.10 and they are now used whenever a case is classified by a decision tree. (Note that fuzzy thresholds have no effect on rulesets.)
.test and .cases files
in addition to the .data file.
The cross-reference window itself now indicates whether cases
are misclassified.
| © RULEQUEST RESEARCH 2009 | Last updated November 2009 |
| home | products | download | evaluations | prices | purchase | contact us |