Notes from Previous Releases
- Rule utility analysis
Release 2.08 calculates the utility of each rule as measured by the additional error on the training set that would result from removing that rule. Rules with insignificant utility are deleted.
Previous releases ordered model rules by the average value of the target value over the cases covered by the rule. Release 2.08 instead shows rules starting with the most useful and ending with the least.
Although the utility analysis requires additional computation, Release 2.08 contains efficiency improvements sufficient to cover this cost for most datasets.
- New default option values
Some default options have been changed in line with the larger datasets to which Cubist is often applied. The default extrapolation parameter has been reduced from 10% to 5%, and the default maximum rules parameter increased from 100 to 500. Of course, these new default values can still be overridden by the user.
Note: A revision of this release was issued in August 2015 to address problems when the target attribute has non-zero values of small magnitude (less than 0.0001). For applications with many such values, Cubist could overlook important patterns or display small values with too few significant digits.
- 64-bit Windows support
This release includes 64-bit versions of Cubist and CubistX (the batch executable). These versions allow the use of more than 2GB of memory, as required by some extremely large data mining tasks.
The 32-bit release of Cubist will run under either 32-bit or 64-bit Windows, so there is no need to change unless your tasks may use more than 2GB of memory. The 64-bit version of Cubist will run only under 64-bit Windows Xp, Windows Vista, or Windows 7.
The network version of Cubist includes both 32-bit and 64-bit versions for installation on client PCs. A client PC running 64-bit Windows Xp/Vista/7 can install and use the 64-bit version, even if the server runs 32-bit Windows.
Cubist continues to be available in both 32-bit and 64-bit versions for Linux.
- New option: unbiased rules
By default, Cubist rules attempt to minimize absolute error on unseen cases. This necessitates minimizing the median rather than the mean residual, so a Cubist rule is generally biased -- its mean prediction differs from the mean of the training cases that it covers. This new option leads to approximately unbiased rules but also to greater absolute error.
This option is recommended for applications where there are many cases with the same target value (such as zero). Unbiased rules will usually give more variation in predicted values near this common value.
- Changes to composite models
Release 2.07 is faster when it generates composite models for large applications. Cubist's procedures for setting the number of nearest neighbors have also been changed, and distances are now calculated to a higher precision.
- Improved public code
RuleQuest provides public source code that enables models generated by Cubist to be employed in users' programs.
- The public code now runs in multiple threads and so can benefit from modern multi-core processors.
- The calculation of error limits has been modified slightly.
Warning: The public code for Release 2.07 cannot be used with models produced by previous Cubist releases.
- Bug fixes
For composite models, the distance to a neighbor could be under-estimated under some rarely-occurring circumstances.
The scatterplot of real versus predicted values could fail for very large applications with hundreds of thousands of cases. In such situations Cubist now shows a scatterplot for only a sample of cases, although the statistics displayed are still computed over all cases.
The public code sometimes incorrectly flagged a case as having a value outside the range observed in the training cases.
When the public code was used with the
-ioption and without a label attribute, incorrect nearest neighbors were shown.
- Changes to public code
We provide public source code to enable models generated by Cubist
to be employed in users' programs. The sample program included with
the code that reads cases and predicts their values has been improved,
- The program now flags any case whose value of a relevant attribute lies outside the range observed in the training data.
- A new option for composite models shows the nearest neighbors for each case and their distances from the case.
- This option can be used by itself or in conjunction with the option to estimate the error bounds of the prediction for each case.
- Revised documentation
The tutorial is now based on a different application with the aim of
better explaining the effects of Cubist's options for building models.
- Faster committee models
Additional multi-threading allows Cubist to construct committee models more
quickly for large applications.
- Bug fix: composite models
- Release 2.06 rectifies a quirk that could allow a composite model to make a prediction outside the permitted extrapolation range.
- More accurate models
Cubist models should now have somewhat lower average absolute error on
- Improved multi-threading
Another bottleneck in Cubist's model-building algorithm has been
parallelized and will now run on multiple CPUs or cores. Assignment of
some tasks to processors has also been adjusted to balance loads
better and so reduce the time taken to process larger applications.
- Faster composite models
- When Cubist constructs a composite model for applications with hundreds of thousands of training cases, a significant proportion of the total run time is taken up by calculating the accuracy of the model on these same cases. Instead, Release 2.05 uses a large sample of the training cases to estimate this accuracy.
- Attribute usage
A new summary highlights the usage of attributes that appear
in a Cubist model. This shows, for each attribute, the percentage
of training cases for which that attribute is used in the
conditions of an applicable rule, and the percentage of cases
for which it is used in an associated linear model.
- Enhanced multi-threading
Release 2.04 will now use up to four processors and so
will run faster on the new quad-core CPUs and computers with
two dual-core processors.
- Linux GUI
For Linux uses who have installed a recent version of
Wine, the new release includes
an optional graphical interface with many features of the Windows version.
cross-reference facility in particular
provides information that is not available from the command-line
version.) The Linux GUI calls the native Linux version of Cubist, so
there is no appreciable performance penalty.
- Minor improvements to models
There have been a few changes to Cubist's model-building algorithms
that often lead to more compact models with slightly higher predictive
- Bug fix (Windows only)
- Release 2.04a affects only the Windows GUI version Cubist.exe that could sometimes freeze during cross-validation. The batch mode executable CubistX.exe is not affected.
- Improved accuracy on larger datasets
Cubist's model-simplification heuristics have been revised, with
the effect that it will usually produce more rules from large
numbers of training cases. On the positive side, this increase in
complexity is usually matched by increased accuracy on new cases.
On one dataset containing approximately 30,000 cases, for example,
cross-validated average error on unseen cases was
half that of release 2.02.
- Weighting individual cases
By default, all training cases are treated as equal, but
particular applications may need to emphasize some data more than others.
Release 2.03 provides an optional attribute that specifies the
importance of each case; when this is used, Cubist attempts to
minimize case-weighted error or, in other words, pays more attention
to fitting more important cases.
- Simpler control of model complexity
Previous releases provided two user-configurable parameters that
affected model complexity. One of these (minimum case cover)
could sometimes prevent Cubist from finding a good model and
has been dropped. User control over model complexity
is now achieved by a straightforward limit on the number of rules in
a model, with default value 100.
- Error bound option for public code
- The free C code for reading and interpreting Cubist 2.03 models has an option to estimate error bounds. If this option is invoked, each predicted value is shown as value +- error, where value is the predicted value and error is the (nominally 95%) absolute error; i.e., approximately 95% of the time, the real value should lie between value - error and value + error. Now, "95%" should not be interpreted too rigorously and will certainly vary from application to application. When this option is used, the public code also shows the actual percentage of cases whose real values lie within the estimated bounds.
- Faster composite models
Composite models use both nearest-neighbor and rule-based
prediction as described here.
When there are many training cases as potential neighbors, finding the
nearest n of them can be slow. Release 2.02 is considerably
faster in this regard, achieved by using more powerful indexing methods and
by taking advantage of dual processors, dual cores, or Intel
hyper-threading if these are available.
- Smaller memory footprint
Release 2.02 requires less memory for applications with very many
attributes/predictors. (You probably won't notice this, though, unless your
application has hundreds of them.)
- Bug fixes (Windows version)
The Windows version of release 2.01 sometimes crashes when model construction
is interrupted via the Stop button.
Release 2.02 should simply stop as it's supposed to.
Release 2.02 also recovers all previous settings when an application is re-run. Some of the previous settings revert to defaults in 2.01.
- Adaptation to Microsoft bug-fix (network version only)
- To improve security, Microsoft Windows updates have disabled a feature that is used by clients to read on-line help, as documented here. The client installation program has been modified to set appropriate registry entries on the client, and also leaves a local copy of the help (CubistHelp.chm) in the Cubist folder as a workaround in case new Windows updates affect HTMLHelp.
The core of Cubist has been rewritten so that it can take
advantage of computers with dual processors or
Intel PCs with Hyper-Threading Technology.
This can significantly reduce the time taken to process
very large datasets.
- 64-bit Linux version
Cubist is now available in a 64-bit Linux version for
AMD PCs with Athlon64 and Opteron CPUs, and Intel PCs with
Extended Memory 64 Technology.
- Bug fix
In previous releases, use of the cross-reference facility in
the Windows version of Cubist could cause problems when there
were errors in the
casesfile. These errors are now reported via pop-up messages.
- New distribution format for Windows
- Cubist Release 2.01 is distributed as a self-contained Inno executable.
- Simpler models
The push towards simpler models for large datasets that was begun in
Release 1.12 has been continued in Release 1.13.
The goal is still to make the models easier to understand without
impairing their predictive accuracy.
On an oceanographic application with 178,000 cases and 12 attributes, for instance, the model produced by Release 1.11 has 58 rules, Release 1.12's has 46, and Release 1.13's has only 38.
- Less memory, faster
The memory required to analyze larger datasets has been
reduced, with the added benefit that computation is also speedier.
Run times for the oceanographic application mentioned above
have decreased from 23.6 seconds for Release 1.12 to 18.0 seconds
for 1.13 (both measured on a 3GHz Pentium IV).
- Bug fixes
For some applications with many attributes,
Release 1.12 sometimes produced rules whose linear sub-models
had large coefficients
and these were saved to the model file with insufficient precision.
Both problems have been addressed in Release 1.13.
- Windows on-line help rewritten
- The on-line help for the Windows version has been updated to the more modern HtmlHelp format and now corresponds closely to the tutorial available on the web.
- Target attribute can be defined by formula
In previous releases of Cubist, the target attribute was required to be
one the explicitly-defined attributes. Release 1.12 allows the
target to be defined as a function of other attributes.
As a simple example, if the data contain values for an attribute X,
the target value to be modeled might be log(X).
- Simpler models
Release 1.12 attempts to simplify models even further
by reducing the numbers of rules.
This mechanism is most noticeable with larger datasets; on an oceanographic
application (178,000 cases, 12 attributes), the model produced by
Release 1.11 has 58 rules while Release 1.12's has 46.
- Further speed improvements
Larger datasets are processed faster by Release 1.12. Run times for
the application cited above have been reduced from Release 1.11's 28.0 seconds
(3GHz Pentium IV) to 23.6 seconds.
- Bug fix
- In previous releases, attributes with constant values could sometimes appear in linear models associated with rules. This bug had only a cosmetic effect since the values computed by the linear models were still correct, but it certainly impeded interpretation of the rules.
- Speed improvement
Release 1.11 is noticeably faster when processing larger datasets.
For instance, the previous release took 64 seconds on a 1GHz PC to build a
model from 30,000 cases with 36 attributes; Release 1.11 does
the same job in just over 33 seconds.
- GUI (Windows version)
- There have been some small changes to the Windows GUI. In particular, the cross-reference window shows predicted values to one decimal place more than the precision of the target values.
- Timestamp attributes
Attributes can now have timestamp values consisting of a date and
a time, e.g. 2001-04-30 13:21:10.
Timestamps are accurate to the nearest minute and subtracting
one timestamp from another gives the number of minutes between them.
- More direct control of model complexity
The old `brevity' control has been superseded by an optional parameter
specifying the maximum number of rules in a Cubist model.
Models with a restricted number of rules are easier to understand
but, of course, they may also have lower predictive accuracy
than unrestricted models.
- Estimated model error
Each rule in a Cubist model gives an estimate of the expected
error when the rule is used to predict values for new cases.
A bug that produced erroneously high values for this estimate
has been fixed.
- Simpler linear models
- Linear models generated by Release 1.10 generally have fewer, simpler coefficients.
- Control of attributes used in models
.namesfile now has a facility to restrict the attributes that can appear in models.
This allows attributes to be used in formulas defining other attributes but not directly in a model. For example, suppose that the data contain two numeric attributes
Bbut background knowledge suggests that only their difference is important. It is now possible to define a new attribute
Diff := A - B.
Bthemselves to appear in any model.
This same facility makes it much easier to experiment with restricted subsets of the attributes.
- Time attributes
An attribute declared to be a `
time' takes values in the form
HH:MM:SS. As with dates, attributes defined by formulas can subtract one time from another to give an interval (in seconds).
- Composite models
The instance-based component of composite models has been extensively
revised. Two changes that you will notice are:
- Previous releases always used five nearest neighbors. The number of neighbors can now be set to any value from 1 to 9 or, alternatively, Cubist will determine an appropriate value in this range.
- Instances are indexed using kd-trees so that a case's nearest neighbors can now be found more quickly. The kd-tree indexing has also been incorporated into the public code.
- Several key components of cubist have been optimized to improve, for example, their cache performance. The benefits are particularly noticeable on larger datasets -- Release 1.09 can be more than twice as fast as 1.08.
- Committee models
A new option is available to generate committee models.
As the name implies, a committee model consists of several
distinct Cubist models, all generated from the same training data.
When a prediction is to be made, each model is consulted
and the results from all models are averaged.
Committee models are of most value in applications for which single Cubist models are already pretty accurate.
- New data values
A new value N/A can be used when the value of an attribute
is not applicable to a case. For example, consider the attributes
`purchased ticket?' with values `yes' and `no', and
`ticket cost' with numeric values. If a case's value of
the former is `no', the appropriate value for the latter
is now `N/A'.
Dates can now be entered as either
- Changes to saved models
Up to Release 1.07, Cubist models have been stored as binary files.
From this release
.modelfiles have been changed to ASCII format, so that models generated on one machine type may be deployed on machines of another type. The source code that facilitates such deployment has also changed substantially.
To ease the changeover, Cubist and the new public code will still read model files generated by Release 1.07.
- New Unix option
Cross-validation has now been incorporated directly into Cubist
rather than being available only through the
-Xoption invokes cross-validation and specifies the number of folds.
xvalscript is still used for multiple cross-validations. However, the option
+dthat preserves detailed outputs now saves one file for each cross-validation rather than one file for each Cubist run.
- Improved error messages
- Problems with application files (.names, .data, .test etc) can be corrected more easily because the error message identifies the line number of the file in question.
- New data types
Dates are input and output in the form
YYYY/MM/DDand can be used with implicitly defined attributes to determine, for instance, the number of days between two dates or the day of the week on which a date falls.
Ordered discrete values are nominal values that have a natural ordering, such as
small, medium, large, XL, XXL. When an attribute's discrete values are noted as ordered, Cubist exploits this information to test subranges of the values, e.g.
[large-XXL]. This tends to produce more compact models with higher predictive accuracy.
- New Unix option
The random number seed can now be set, with the result that
runs with sampling etc. are repeatable.
- Improvements to the Windows GUI
The output window is now more readable, and can be copied and printed
directly (without having to switch to WordPad).
A new button on the toolbar allows the previous output to be redisplayed.
- Revision of source code
- The source code for reading and interpreting models constructed by Cubist has been further revised.
- Attributes defined by formulas
It is sometimes convenient to define the value of an attribute
as a function of other attribute values rather than by giving
the value explicitly in data files. Release 1.06a allows
such implicitly-defined attributes to be described by formulas in
an application's names file. The formulas need not be simple -- both
numeric and logical values can be introduced in this way.
- New parameter controlling simplicity/accuracy tradeoff
The issue of simplicity versus accuracy is one of those things
that will always be with us in data mining. Simpler models are
easier to understand and convey more insight, but some
applications require all the accuracy they can get and
insight is not an issue.
Release 1.06a contains a new parameter that influences this tradeoff. When the brevity factor is set to a high value, Cubist will emphasize simplicity (usually at some expense to accuracy). Similarly, a low value puts a premium on accuracy, but may substantially increase model complexity. The choice is now yours!
- Further improvements in models
Some fundamental changes to the model-building mechanism mean that
Cubist models are now noticeably improved.
The rules tend to overlap more, with the result that predictions
change more smoothly as attribute values of a case are varied.
- Changes to the Cubist GUI
There have been several improvements in line with suggestions
made by users (and please keep them coming!):
- A new Edit menu brings up the names file in WordPad, making it easier to change this file.
- The model construction settings last used with an application are stored and are reset whenever that application is selected again. (There's also a new button on the dialog box to reset all of them to their default values.)
- The main window can be clicked on top of the output window.
- Models that are easier to understand
The linear models associated with rules are now ordered so that
attributes with a higher differential effect on the model values
appear before less-important factors.
- More accurate models
Cubist models have also been changed somewhat to improve their
predictive accuracy. Those generated by R1.05 will sometimes
have more rules than those from previous releases.
- Global extrapolation limits
The predictions made by a rule are limited to an extension of
the range of values observed for training cases that match the
rule (see the extrapolation parameter). The same
restriction now applies to "global" predictions that use instances and
- Improved model construction dialog box (Windows version)
- This has been revised so that options can be specified more easily from the keyboard.
- New attribute type label
- In some applications, each case has an identifying code or
serial number; this information can be recorded in a label
attribute. A label attribute does not affect models in
any way, but its value is displayed where possible with information about
the case such as error messages, cross-referencing results etc.
- Sample locking (Windows version)
- The sampling option introduced in Release 1.03 allows random
train/test splits of an application's data to be generated
automatically. In some situations,
for instance when investigating
alternative model construction options,
it is desirable to be able
to `lock in' a particular sample, and
an additional option on the model
construction dialog box is now provided for this purpose.
- Saving cross-referencing results (Windows version)
- Cubist's cross-referencing facility is a powerful tool for finding the cases covered by particular model rules, and rules relevant to particular cases. The information in the cross-reference window at any point in time can now be saved as a text file.
- Sampling option
- Cubist now includes an option to sample from large
datasets. This enables a fixed percentage of the cases in
a data file to be used for training. As an added
convenience, models constructed using the sampling option
are now automatically evaluated on a disjoint set of test
- Batch-mode version of Cubist for Windows
- GUIs are great, but it's sometimes useful to be able to run Cubist non-interactively from a MS-DOS command window. Cubist Release 1.03 includes an additional program CubistX that can be executed as a console application. Options for CubistX are set by command-line parameters in exactly the same way as for the Unix version. (Not included with the free demonstration download.)
- Model Presentation
- The rules in a Cubist model are now ordered by the
average target value of the cases covered.
Rules that tend to predict low values appear before rules
that predict high values.
This is solely to make the models more intelligible --
the order of rules does not affect the value predicted.
- Speed improvements
- Cubist is now considerably faster for large datasets, particularly when using composite models.
|© RULEQUEST RESEARCH 2016||Last updated January 2016|