Cubist: Illustrative Examples
- Housing Prices in Boston
- The Fat Content of Meat
- Concrete Compressive Strength
- A Simple Time Series Example: Fraser River
- Now Read On ...
This page illustrates Cubist models and their predictive performance on some diverse applications. Like See5/C5.0, Cubist pays particular attention to the issue of comprehensibility. RuleQuest believes that a data mining system should find patterns that not only facilitate accurate predictions, but also provide insight. We hope this is evident in the following examples, all of which were run with Cubist's default parameter values. Times are for a 2.6GHz Core i7 running 64-bit Linux.
Housing Prices in Boston
This first application uses data on housing prices circa 1980 in Boston tracts. Each case describes average characteristics of houses in a tract that might be expected to affect their price. Here are a few examples:
Abbrev Attribute Tract 1 Tract 2 Tract 3 ..... CRIM crime rate 7.67 2.24 0.08 ZN proportion large lots - - 45 INDUS proportion industrial 18.1 19.6 3.4 NOX nitric oxides ppm .69 .61 .44 RM av rooms per dwelling 5.7 5.9 7.2 AGE proportion pre-1940 98.9 91.8 38.9 DIS distance to employment centers 1.6 2.4 4.6 RAD accessibility to radial highways 24 5 5 TAX property tax rate per $10,000 666 403 398 PTRATIO pupil-teacher ratio 20.2 14.7 15.2 LSTAT percentage low income earners 19.9 11.6 5.4 PRICE average price ($'000) 8.5 22.7 36.4
From 506 cases like this, Cubist takes less than a tenth of a second to construct a model consisting of four rules:
Model: Rule 1: [163 cases, mean 31.43, range 16.5 to 50, est err 2.75] if RM > 6.226 LSTAT <= 9.59 then CMEDV = -4.82 + 2.26 CRIM + 9.2 RM - 0.83 LSTAT - 0.019 TAX - 0.7 PTRATIO - 0.71 DIS - 0.039 AGE - 1.7 NOX + 0.008 ZN + 0.02 RAD Rule 2: [101 cases, mean 13.79, range 5 to 27.5, est err 2.21] if NOX > 0.668 then CMEDV = 2.05 + 2.03 DIS - 0.37 LSTAT + 21.4 NOX - 0.06 CRIM Rule 3: [203 cases, mean 19.42, range 7 to 31, est err 2.13] if NOX <= 0.668 LSTAT > 9.59 then CMEDV = 30.72 + 2.6 RM - 0.25 LSTAT - 0.79 PTRATIO - 0.72 DIS - 0.038 AGE - 3.7 NOX - 0.0025 TAX + 0.04 RAD - 0.03 CRIM + 0.007 ZN Rule 4: [43 cases, mean 24.16, range 18.2 to 50, est err 2.68] if RM <= 6.226 LSTAT <= 9.59 then CMEDV = -23.77 + 0.95 CRIM + 0.81 RAD + 8.5 RM - 0.83 LSTAT + 0.0075 TAX - 0.4 DIS - 0.12 PTRATIO - 0.009 AGE + 0.005 ZN
Each rule has three parts: some descriptive information, conditions that must be satisfied before the rule can be used, and a linear formula.
- The descriptive information notes the number of training cases covered by the rule, their mean and range of the dependent variable PRICE, and a rough estimate of the error to be expected when this rule is used on new data.
- The conditions constrain the values of some of the attributes by thresholding numeric attributes or restricting the values of nominal attributes. The conjunction of conditions establishes a context for the linear formula.
- The linear formula is a simplified fit of the training data covered by the rule. The terms of the formula are ordered with the most important attributes appearing first. Looking at Rule 2, for instance, we see that DIS (distance from employment centers) has the most effect (positive) on housing values for cases covered by this rule; LSTAT is the next most important (a negative effect), and so on.
How can a model like this be used to make predictions about new cases? With some slight suppression of details, the procedure is as follows:
- Identify the rule or rules that cover the case.
- Calculate the values for the case given by the corresponding linear formulas.
- Average them.
All very well, you might say, but how good is the model? Here are the results of a 10-fold cross-validation on this dataset, plotting the predicted value of each unseen case against its real value.
Even though the model is quite simple, these results compare favorably with most published results for this dataset.
The Fat Content of Meat
Statlib is a central repository used by statisticians. One of the datasets obtainable from this interesting site concerns estimating the fat content of meat samples using absorbency in the near infrared spectrum. This data comes from the Tecator Infratec Food and Feed Analyzer using 100 channels. Each attribute consists of the value of the instrument reading in one channel, so this is a high-dimensional prediction task.
Cubist derives a model with two rules from 240 training examples (again in less than 0.1 seconds):
Rule 1: [111 cases, mean 26.446, range 1.7 to 58.5, est err 1.752] if A40 > 3.08971 then Fat = 9.486 + 16452.2 A37 + 15737.4 A53 - 12562 A38 - 14972.2 A13 - 11479.4 A54 - 11986.9 A28 - 9209.1 A34 + 8808.5 A81 + 10616.9 A12 + 9129.9 A29 - 8040.2 A80 + 8468.1 A25 - 6534.6 A52 + 5934.3 A90 - 6832.1 A05 - 5200.6 A95 + 6535.9 A09 + 5110.3 A44 - 5277.7 A23 + 4046.6 A57 - 3996 A45 + 3701.5 A99 + 4050.4 A17 + 3049 A36 - 2226.8 A89 - 1621.9 A98 - 1627.8 A58 - 1524.8 A85 + 1811.2 A00 - 563.2 A48 - 470.8 A40 + 345 A49 + 330.2 A30 Rule 2: [129 cases, mean 11.667, range 0.9 to 36.2, est err 0.823] if A40 <= 3.08971 then Fat = 7.211 + 7044.2 A38 - 6235.6 A37 + 5775.9 A36 + 4324 A53 - 3683.7 A34 + 4046.7 A12 - 3265.2 A95 - 2947.1 A40 - 2954.6 A52 - 3150.5 A05 - 2554.9 A30 - 2787.9 A17 + 2472.8 A25 + 2120.1 A97 - 2074.4 A70 + 1886 A60 - 2196.6 A13 + 1765.1 A76 - 1684.2 A54 - 1579.7 A58 + 1963.5 A07 + 1292.3 A81 + 1339.5 A29 - 1179.6 A80 + 870.6 A90 + 958.9 A09 + 691.7 A44 - 774.3 A23 + 593.7 A57 - 586.3 A45 + 543.1 A99 + 490.4 A28 - 326.7 A89 - 237.9 A98 - 223.7 A85 + 265.7 A00
Despite the high dimensionality of this data, a ten-fold cross-validation shows a very good fit on the unseen cases:
Concrete Compressive Strength
The third example uses a dataset from the UCI Machine Learning Repository. The data, donated by Prof. I-Cheng Yeh, consist of information on 1,030 concrete samples showing, for each, the value of eight relevant properties and its compressive strength.
A model consisting of 21 rules is constructed in less than the same tenth of a second. Here are the most important rules:
Rule 1: [106 cases, mean 59.542, range 31.72 to 82.6, est err 5.348] if Superplasticizer > 7.8 Age > 28 then Concrete compressive strength = 76.053 + 0.14 Blast Furnace Slag + 0.1 Cement - 0.393 Water + 0.098 Fly Ash + 0.091 Age - 0.75 Superplasticizer Rule 2: [161 cases, mean 23.666, range 7.68 to 52.3, est err 3.174] if Cement <= 427.5 Blast Furnace Slag <= 190.1 Superplasticizer <= 3.9 Age > 3 Age <= 28 then Concrete compressive strength = 155.599 + 0.493 Age + 3.65 Superplasticizer + 0.057 Cement - 0.262 Water - 0.064 Fine Aggregate - 0.057 Coarse Aggregate + 0.003 Blast Furnace Slag Rule 3: [106 cases, mean 21.444, range 7.32 to 55.51, est err 3.833] if Cement <= 218.9 Blast Furnace Slag <= 76 Age <= 28 then Concrete compressive strength = -226.902 + 0.732 Age + 0.297 Cement + 0.223 Blast Furnace Slag + 0.165 Fly Ash + 0.093 Fine Aggregate + 0.078 Coarse Aggregate Rule 4: [94 cases, mean 20.290, range 2.33 to 49.25, est err 3.675] if Cement <= 218.9 Blast Furnace Slag > 76 Fly Ash <= 116 Superplasticizer <= 7.6 Age <= 28 then Concrete compressive strength = -76.875 + 0.752 Age + 0.133 Cement + 0.94 Superplasticizer + 0.059 Blast Furnace Slag + 0.131 Water + 0.029 Fine Aggregate + 0.005 Fly Ash + 0.002 Coarse Aggregate
A scatter-plot of the results on unseen cases from a ten-fold cross-validation shows that the model is not too bad.
A Simple Time Series Example: Fraser River
This example also comes from Statlib. The data, contributed by Ian McLeod, concern 946 successive mean monthly flows of the Fraser River at Hope, B.C.
The goal in this application is to predict the flow in a particular month in terms of the flows for previous months. In this example, we will use the previous 20 months' mean flows: there are thus 926 cases described by 20 independent attributes and the target attribute, all continuous values.
Cubist finds five rules from the 926 cases (in 0.1 seconds):
Rule 1: [591 cases, mean 1399.5, range 482 to 4460, est err 249.6] if [-12 months] <= 2640 then This month = 128.3 + 0.65 [-1 month] + 0.205 [-11 months] - 0.15 [-2 months] + 0.072 [-3 months] - 0.056 [-13 months] + 0.047 [-12 months] - 0.011 [-10 months] - 0.01 [-14 months] Rule 2: [171 cases, mean 5963.9, range 2080 to 10800, est err 812.7] if [-15 months] <= 4010 [-13 months] > 2780 [-12 months] > 2640 then This month = 3876.3 + 0.796 [-8 months] - 0.792 [-9 months] + 0.51 [-1 month] + 0.426 [-10 months] - 0.42 [-2 months] - 0.333 [-15 months] - 0.032 [-5 months] + 0.026 [-12 months] + 0.019 [-11 months] - 0.015 [-14 months] + 0.015 [-3 months] - 0.011 [-13 months] Rule 3: [68 cases, mean 4932.4, range 1220 to 8170, est err 845.9] if [-15 months] <= 4010 [-13 months] <= 2780 [-12 months] > 2640 [-1 month] > 1050 then This month = 5006.4 - 1.254 [-17 months] - 0.792 [-13 months] + 0.68 [-18 months] + 0.47 [-1 month] + 0.293 [-12 months] - 0.228 [-8 months] - 0.044 [-9 months] - 0.03 [-2 months] + 0.018 [-11 months] - 0.018 [-15 months] Rule 4: [83 cases, mean 3212.5, range 1300 to 5460, est err 332.6] if [-15 months] > 4010 [-12 months] > 2640 then This month = 1435.2 + 0.47 [-1 month] - 0.076 [-15 months] + 0.037 [-12 months] - 0.03 [-9 months] - 0.03 [-2 months] + 0.029 [-4 months] + 0.025 [-8 months] + 0.024 [-11 months] - 0.015 [-5 months] - 0.015 [-13 months] - 0.011 [-14 months] Rule 5: [13 cases, mean 4185.1, range 896 to 6550, est err 856.0] if [-12 months] > 2640 [-1 month] <= 1050 then This month = -4938.5 + 8.5 [-1 month] + 0.349 [-10 months] - 0.093 [-9 months] + 0.076 [-8 months] - 0.07 [-2 months] - 0.053 [-15 months] + 0.041 [-11 months]
A scatter-plot of the results of a ten-fold cross-validation shows a reasonably high level of agreement between actual and predicted flows for the unseen cases.
The default "persistence" model obtained by always predicting the previous month's flow explains only 45% of the variance for unseen cases, noticeably less than the 88% explained by the Cubist model.
Now Read On ...
Since these examples were all run using Cubist's default parameter settings, they do not illustrate several additional capabilities:
- Cubist incorporates a novel method for generating composite instance-based (nearest neighbor) and rule-based models. These often improve predictive accuracy, although at the cost of being more difficult to understand than rule-based models alone. Composite models can be selected by an option, or you can let Cubist decide whether they are appropriate for your application.
- Cubist can also construct committees of models. The first model is found as usual, the second model attempts to compensate for the errors of the first model, the third tries to compensate for the second, and so on. A committee prediction is obtained by averaging the predictions made by each model in the committee. Committee models are usually both more accurate than single models and faster to evaluate than composite models (since finding nearest neighbors in large datasets is slow).
- By default, all cases from which a model is constructed have equal importance. This is not appropriate in some applications--for example, cases describing high-value loans might be more critical that those of lower value. Cubist allows an optional case weight attribute to indicate the relative importance of each case.
- Most models represent a trade-off between simplicity and accuracy -- simpler models are easier to understand, but may under-fit the data. The balance can be shifted towards simplicity by setting a ceiling on the number of rules that may appear in a model. (All the examples above use the default maximum value.)
- Cubist has built-in support for both cross-validation and sampling from large datasets.
- Source code is provided to enable models constructed by Cubist to be used in your own programs.
Please see the tutorial for more details.
© RULEQUEST RESEARCH 2016 | Last updated January 2016 |
home | products | licensing | download | contact us |