Cubist: Illustrative Examples

This page illustrates Cubist models and their predictive performance on some diverse applications. Like See5/C5.0, Cubist pays particular attention to the issue of comprehensibility. RuleQuest believes that a data mining system should find patterns that not only facilitate accurate predictions, but also provide insight. We hope this is evident in the following examples, all of which were run with Cubist's default parameter values. Times are for a 2.4GHz Core 2 Duo running Linux.

Housing Prices in Boston

This first application uses data on housing prices circa 1980 in Boston tracts. Each case describes average characteristics of houses in a tract that might be expected to affect their price. Here are a few examples:

Abbrev   Attribute                      Tract 1   Tract 2   Tract 3    .....

CRIM     crime rate                        7.67      2.24      0.08
ZN       proportion large lots                -         -        45
INDUS    proportion industrial             18.1      19.6       3.4
NOX      nitric oxides ppm                  .69       .61       .44
RM       av rooms per dwelling              5.7       5.9       7.2
AGE      proportion pre-1940               98.9      91.8      38.9
DIS      distance to employment centers     1.6       2.4       4.6
RAD      accessibility to radial highways    24         5         5
TAX      property tax rate per $10,000     666       403       398
PTRATIO  pupil-teacher ratio               20.2      14.7      15.2
LSTAT    percentage low income earners     19.9      11.6       5.4

PRICE    average price ($'000)              8.5      22.7      36.4

From 506 cases like this, Cubist takes 0.1 secs to construct a model consisting of four rules:

  Rule 1: [101 cases, mean 13.79, range 5 to 27.5, est err 2.21]

    if
        NOX > 0.668
    then
        CMEDV = 2.05 + 2.03 DIS - 0.37 LSTAT + 21.4 NOX - 0.06 CRIM

  Rule 2: [203 cases, mean 19.42, range 7 to 31, est err 2.13]

    if
        NOX <= 0.668
        LSTAT > 9.59
    then
        CMEDV = 31.13 + 2.5 RM - 0.24 LSTAT - 0.79 PTRATIO - 0.72 DIS
                - 0.038 AGE - 3.6 NOX - 0.0024 TAX + 0.04 RAD - 0.03 CRIM
                + 0.007 ZN

  Rule 3: [43 cases, mean 24.16, range 18.2 to 50, est err 2.68]

    if
        RM <= 6.226
        LSTAT <= 9.59
    then
        CMEDV = -23.77 + 0.95 CRIM + 0.81 RAD + 8.5 RM - 0.83 LSTAT + 0.0075 TAX
                - 0.4 DIS - 0.12 PTRATIO - 0.009 AGE + 0.005 ZN

  Rule 4: [163 cases, mean 31.43, range 16.5 to 50, est err 2.76]

    if
        RM > 6.226
        LSTAT <= 9.59
    then
        CMEDV = -4.79 + 2.27 CRIM + 9.2 RM - 0.83 LSTAT - 0.019 TAX - 0.72 DIS
                - 0.7 PTRATIO - 0.039 AGE + 0.03 RAD - 1.8 NOX + 0.008 ZN

Each rule has three parts: some descriptive information, conditions that must be satisfied before the rule can be used, and a linear model.

The rules are ordered by the average value of the target attribute for the cases that they cover. Rule 1 is therefore relevant mainly to low-cost houses, Rule 4 to the most expensive.

How can a model like this be used to make predictions about new cases? With some slight suppression of details, the procedure is as follows:

All very well, you might say, but how good is the model? Here are the results of a 10-fold cross-validation on this dataset, plotting the predicted value of each unseen case against its real value.

Even though the model is quite simple, these results compare favorably with most published results for this dataset.


The Fat Content of Meat

Statlib is a central repository used by statisticians. One of the datasets obtainable from this interesting site concerns estimating the fat content of meat samples using absorbency in the near infrared spectrum. This data comes from the Tecator Infratec Food and Feed Analyzer using 100 channels. Each attribute consists of the value of the instrument reading in one channel, so this is a high-dimensional prediction task.

Cubist derives a model with four rules from 240 training examples (in 0.1 seconds):

  Rule 1: [13 cases, mean 7.577, range 1.7 to 15.9, est err 0.932]

    if
        A05 > 3.17348
        A89 <= 3.82829
    then
        Fat = -7.31 - 3037.1 A38 + 2932.7 A37 - 3411.4 A13 + 2798.8 A53
              + 2372.9 A39 + 2885.3 A12 - 2183.8 A54 - 1856.9 A34 - 1496.3 A98
              - 1453.4 A40 - 1859.6 A05 + 1229.7 A99 + 1146.5 A57 + 980.3 A60
              - 996.9 A52 - 949.7 A58 + 1090.4 A09 + 930.8 A30 + 801.6 A44
              - 857.7 A28 - 723.3 A61 - 623.2 A95 + 477.5 A97 + 495.5 A25
              + 560 A07 - 445.4 A48 + 547.1 A00 + 416.4 A36 - 330.8 A45
              + 326.5 A49 + 303.4 A93 + 256.8 A90 - 176.6 A50 - 112.5 A89
              - 32.8 A81

  Rule 2: [129 cases, mean 11.667, range 0.9 to 36.2, est err 0.833]

    if
        A40 <= 3.08971
    then
        Fat = 7.263 + 6059.9 A38 - 5593.8 A37 + 5348.6 A36 + 4581.8 A53
              + 5121 A12 - 3990.6 A40 - 3798.5 A34 - 4133.9 A05 - 2824.2 A95
              - 2866.6 A52 - 3267.1 A17 + 2744.9 A60 + 2497.6 A97 - 2793.3 A13
              - 2189.5 A58 - 2056.2 A54 - 2003.8 A70 + 2424 A07 - 1877 A30
              + 1722.9 A39 + 1705 A76 + 1582.1 A28 + 1595.4 A25 - 1210.7 A98
              + 1079.5 A57 + 942.4 A99 + 1026.7 A09 - 687.9 A61 + 635.7 A44
              + 515.1 A00 - 350.3 A45 + 241.7 A90 - 179.7 A49

  Rule 3: [35 cases, mean 26.331, range 10 to 56.6, est err 1.588]

    if
        A05 > 3.17348
        A89 > 3.82829
    then
        Fat = 15.412 + 6740.4 A39 + 6256.3 A49 - 5734 A48 - 5374.6 A38
              - 5371.3 A34 - 4599.2 A40 + 4257.9 A36 - 3282.7 A95 + 3508 A30
              + 3102.5 A93 - 2987 A98 + 2990.1 A99 - 2766 A28 - 2033.9 A50
              + 1903.8 A37 + 1816.9 A53 + 1731.1 A44 - 1925.9 A13 + 1873 A12
              - 1417.7 A54 - 1419.4 A05 + 928 A25 + 744.3 A57 + 636.4 A60
              - 647.2 A52 - 616.5 A58 + 707.9 A09 + 519.8 A45 - 400.1 A61
              - 335.7 A81 + 310 A97 + 363.5 A07 + 355.2 A00 + 166.7 A90

  Rule 4: [63 cases, mean 30.403, range 2.9 to 58.5, est err 1.721]

    if
        A05 <= 3.17348
        A40 > 3.08971
    then
        Fat = 11.101 - 12872.6 A38 + 11360.2 A37 + 10884.9 A39 + 10841.3 A53
              - 11491.9 A13 + 11176.5 A12 - 8549.5 A34 - 8459.4 A54 - 7322.7 A40
              - 8778.5 A05 - 6452.5 A98 + 5477.3 A99 + 4441.1 A57 + 4477.9 A30
              + 4813.9 A09 + 3797.3 A60 - 3861.6 A52 - 3727.5 A95 - 4041.9 A28
              - 3678.8 A58 + 3499.5 A44 + 3038.2 A36 - 2779.3 A61 - 2742.6 A48
              + 2451.8 A49 + 2121.4 A93 + 2116.7 A25 + 1849.5 A97 + 2169.1 A07
              + 2119.4 A00 - 1244.4 A45 - 1101.8 A50 + 994.6 A90 - 294.8 A16
              - 229.5 A81

Despite the high dimensionality of this data, a ten-fold cross-validation shows a very good fit on the unseen cases:


The Age of Abalone

The third example uses a dataset from the The UCI Data Repository. The age of an abalone is determined by counting its rings, then adding 1.5. To quote the documentation with the dataset: "The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task." The idea here is to use other more easily-obtained information to estimate the number of rings, and hence the age.

This dataset is divided into a training set of 2800 cases and a separate test set of 1376. Cubist takes 0.1 seconds to find this five-rule model from the training cases:

  Rule 1: [177 cases, mean 5.4, range 1 to 11, est err 0.9]

    if
        Shucked weight <= 0.057
    then
        Rings = 3.2 + 45.71 Whole weight - 61.5 Shucked weight + 5 Shell weight
                + 1.6 Diameter

  Rule 2: [751 cases, mean 8.3, range 4 to 21, est err 1.2]

    if
        Sex = I
        Shucked weight > 0.057
    then
        Rings = 4.9 + 31.7 Shell weight - 10.5 Shucked weight
                + 1.68 Whole weight + 2 Diameter - 1.3 Length - 1 Viscera weight
                + 2.1 Height

  Rule 3: [322 cases, mean 8.9, range 4 to 18, est err 1.4]

    if
        Sex in {M, F}
        Shucked weight > 0.057
        Shell weight <= 0.1685
    then
        Rings = 7 + 17.73 Whole weight - 36.9 Shucked weight + 14.1 Shell weight
                - 10.6 Viscera weight + 0.5 Diameter + 0.9 Height

  Rule 4: [916 cases, mean 11.4, range 6 to 27, est err 1.9]

    if
        Shucked weight <= 0.4445
        Shell weight > 0.1685
    then
        Rings = 11.5 + 20.09 Whole weight - 30.6 Shucked weight
                - 17.1 Viscera weight - 12 Length + 8.6 Shell weight
                + 2.8 Height + 0.4 Diameter

  Rule 5: [871 cases, mean 11.4, range 6 to 29, est err 1.6]

    if
        Shucked weight > 0.4445
        Shell weight > 0.1685
    then
        Rings = 7.6 + 7.45 Whole weight - 16.3 Shucked weight
                + 10.7 Shell weight - 3.8 Viscera weight - 2.6 Length
                + 6.9 Height + 1.4 Diameter

The plot of predicted versus actual values on the remaining 1376 unseen cases shows a reasonable level of agreement:


A Simple Time Series Example: Fraser River

This example also comes from Statlib. The data, contributed by Ian McLeod, concern 946 successive mean monthly flows of the Fraser River at Hope, B.C.

The goal in this application is to predict the flow in a particular month in terms of the flows for previous months. In this example, we will use the previous 20 months' mean flows: there are thus 926 cases described by 20 independent attributes and the target attribute, all continuous values.

Cubist finds five rules from this data (in 0.1 seconds):

  Rule 1: [591 cases, mean 1399.5, range 482 to 4460, est err 249.6]

    if
        [-12 months] <= 2640
    then
        This month = 128.3 + 0.65 [-1 month] + 0.205 [-11 months]
                     - 0.15 [-2 months] + 0.072 [-3 months] - 0.056 [-13 months]
                     + 0.047 [-12 months] - 0.011 [-10 months]
                     - 0.01 [-14 months]

  Rule 2: [83 cases, mean 3212.5, range 1300 to 5460, est err 332.6]

    if
        [-15 months] > 4010
        [-12 months] > 2640
    then
        This month = 1435.2 + 0.47 [-1 month] - 0.076 [-15 months]
                     + 0.037 [-12 months] - 0.03 [-9 months] - 0.03 [-2 months]
                     + 0.029 [-4 months] + 0.025 [-8 months]
                     + 0.024 [-11 months] - 0.015 [-5 months]
                     - 0.015 [-13 months] - 0.011 [-14 months]

  Rule 3: [13 cases, mean 4185.1, range 896 to 6550, est err 856.0]

    if
        [-12 months] > 2640
        [-1 month] <= 1050
    then
        This month = -4938.5 + 8.5 [-1 month] + 0.349 [-10 months]
                     - 0.093 [-9 months] + 0.076 [-8 months] - 0.07 [-2 months]
                     - 0.053 [-15 months] + 0.041 [-11 months]

  Rule 4: [68 cases, mean 4932.4, range 1220 to 8170, est err 845.9]

    if
        [-15 months] <= 4010
        [-13 months] <= 2780
        [-12 months] > 2640
        [-1 month] > 1050
    then
        This month = 5006.4 - 1.254 [-17 months] - 0.792 [-13 months]
                     + 0.68 [-18 months] + 0.47 [-1 month] + 0.293 [-12 months]
                     - 0.228 [-8 months] - 0.044 [-9 months] - 0.03 [-2 months]
                     + 0.018 [-11 months] - 0.018 [-15 months]

  Rule 5: [171 cases, mean 5963.9, range 2080 to 10800, est err 812.7]

    if
        [-15 months] <= 4010
        [-13 months] > 2780
        [-12 months] > 2640
    then
        This month = 3876.3 + 0.796 [-8 months] - 0.792 [-9 months]
                     + 0.51 [-1 month] + 0.426 [-10 months] - 0.42 [-2 months]
                     - 0.333 [-15 months] - 0.032 [-5 months]
                     + 0.026 [-12 months] + 0.019 [-11 months]
                     - 0.015 [-14 months] + 0.015 [-3 months]
                     - 0.011 [-13 months]

A scatter-plot of the results of a ten-fold cross-validation shows a reasonably high level of agreement between actual and predicted flows for the unseen cases.

The default "persistence" model obtained by always predicting the previous month's flow explains only 45% of the variance for unseen cases, noticeably less than the 90% explained by the Cubist model.


A Larger Example: El Niño

This example originates at the Pacific Marine Environmental Laboratory in the National Oceanic and Atmospheric Administration of the US Department of Commerce, and can be found among the datasets in the UCI KDD Archive. There are 178,080 observations taken between 1980 and 1998 that record values for the following:

Abbrev       Attribute                 Case 1    Case 2    Case 3    .....

year         year of observation           80        80        80
month        month ditto                    3         3         3
day          day of month                   7         8         9
latitude     latitude of buoy           -0.02     -0.02     -0.02
longitude    longitude ditto          -109.46   -109.46   -109.46
zon.winds    zonal winds (M/S)           -6.8      -4.9      -4.5
mer.winds    meridional winds (M/S)       0.7       1.1       2.2
humidity     relative humidity (%)          ?         ?         ?
air temp     air temperature             23.5      23.6     23.62

s.s.temp     sea surface temperature    24.01     23.91     23.82

Here we will attempt to predict the value of the sea surface temperature from the values of the other attributes.

The cases were divided into training and test sets containing 89,040 cases each, and Cubist took 4.7 seconds to build a model containing 100 rules (the default maximum number). Two rules each for the lowest and highest sea temperatures should illustrate the idea:

  Rule 1: [1438 cases, mean 23.430, range 18.87 to 29.69, est err 0.281]

        year <= 96
        latitude <= -1.99
        longitude > -109.66
        longitude <= -94.7
    then
        s.s.temp = -83.4 - 0.8999 longitude + 1.02 air temp - 0.144 latitude
                   - 0.037 year - 0.029 mer.winds + 0.008 humidity
                   + 0.01 zon.winds

  Rule 2: [317 cases, mean 23.467, range 19.37 to 27.22, est err 0.874]

    if
        year <= 94
        month > 4
        latitude > -0.1
        latitude <= 5.31
        longitude > -139.9
        longitude <= -95.15
        air temp > 26.86
        air temp <= 26.869
    then
        s.s.temp = 17.749 + 1.217 latitude - 0.415 month + 0.296 mer.winds
                   - 0.1 zon.winds + 0.097 year - 0.016 day - 0.005 humidity

  Rule 99: [506 cases, mean 29.676, range 26.81 to 30.98, est err 0.296]

    if
        latitude <= -0.57
        longitude > -95.15
        air temp > 26.86
        air temp <= 26.869
    then
        s.s.temp = 28.631 + 0.0087 longitude + 0.097 latitude - 0.03 mer.winds
                   - 0.016 zon.winds

  Rule 100: [885 cases, mean 29.688, range 27.4 to 31.17, est err 0.293]

    if
        year > 93
        year <= 96
        latitude <= 6.68
        longitude > -124.98
        zon.winds > -0.5
        mer.winds <= -0.3
        air temp > 26.869
    then
        s.s.temp = 1.7 + 0.0056 longitude + 0.202 year + 0.29 air temp
                   + 0.058 mer.winds - 0.046 zon.winds - 0.026 latitude

The scatter-plot for the unseen test cases is fairly dense since it contains nearly 90,000 points, but the correlation coefficient (.98) shows that the model is not bad:


Cubist and Classification: Churn

Although Cubist builds regression models, it can also be used for classification, especially when there are just two classes. The idea is to replace the discrete class with an indicator variable for one of the classes and let Cubist develop a model for this value. The training cases all have values of 0 or 1 for the target indicator variable but we can exploit the fact that the Cubist model will generally predict more fine-grained values.

This dataset for an artificial churn-predicting application comes from the MLC++ site at SGI. Five thousand `customers' are divided into a training set of 3333 cases and a test set of 1667, with the following attributes:

state
account length
area code
phone number (ignored)
international plan
voice mail plan
number vmail messages
total day minutes
total day calls
total day charge
total eve minutes
total eve calls
total eve charge
total night minutes
total night calls
total night charge
total intl minutes
total intl calls
total intl charge
number customer service calls

1(churn) -- 0 if no churn, 1 if churn
Cubist finds a model containing 31 rules (in 0.2 seconds) as exemplified by the following:
  Rule 1: [138 cases, mean 0.0, range 0 to 1, est err 0.0]

        international plan = yes
        total day minutes <= 223.2
        total intl minutes <= 13.1
        total intl calls > 2
        number customer service calls <= 3
    then
        1(churn) = -0.1 + 0.01569 total eve minutes - 0.1826 total eve charge
                   + 0.126 total night charge - 0.00562 total night minutes
                   + 0.156 total intl charge - 0.04 total intl minutes
                   + 0.00057 total day minutes
                   + 0.014 number customer service calls
                   - 0.0014 total day charge - 0.0006 number vmail messages

  Rule 24: [31 cases, mean 0.4, range 0 to 1, est err 0.4]

    if
        state in {NJ, AL, IN, IA, VT, VA, FL, CO, AZ, NE, HI, IL, MD, OR, NC,
                  DC, KY}
        total day minutes > 138.7
        total day minutes <= 223.2
        total eve minutes <= 200.6
        number customer service calls > 3
    then
        1(churn) = 5 - 0.66389 total eve minutes + 7.737 total eve charge
                   - 0.02259 total day minutes + 0.0456 total day charge
                   - 0.00378 total night minutes

  Rule 31: [58 cases, mean 1.0, range 1 to 1, est err 0.0]

    if
        total day minutes <= 138.7
        total eve charge <= 22.41
        number customer service calls > 3
    then
        1(churn) = 1

Cubist predicts values to a precision one decimal place greater than that observed in the training data, so instead of a 0-1 prediction, the values predicted by Cubist are 0.0, 0.1, ... . The following table shows a breakdown of the predicted values for the 1667 unseen test cases:

Predicted
Value
No Churn Churn

0.0 1301 42
0.1 53 9
0.2 30 2
0.3 24 6
0.4 14 11
0.5 13 8
0.6 2 9
0.7 1 8
0.8 1 12
0.9 2 12
1.0+ 2 105

We can turn the predicted value of the indicator back into a categorical classification by thresholding it appropriately. For instance, if we decide that all values of 1(churn) greater than 0.5 will be treated as "yes", the categorical error rate of the Cubist model on the unseen cases is 5.2% (8 false positives and 78 false negatives).

The predicted value of the indicator variable can also be interpreted as a confidence that the case is a "yes". Suppose that a mail-out was proposed to the customers judged most at risk of churning. Mailing the 154 unseen customers with predicted values greater than 0.5 would reach 146 of the 224 churners (65%) at 9% of the cost of a complete mail-out, a lift of 7.1.


Now Read On ...

Since these examples were all run using Cubist's default parameter settings, they do not illustrate several additional capabilities:

Please see the tutorial for more details.

© RULEQUEST RESEARCH 2008 Last updated March 2008


home products download evaluations prices purchase contact us