| Warning: All use of ODBCHook is at your own risk. Use of ODBCHook or its documentation signifies your acceptance of the licence agreement. If you use ODBCHook to access data on remote machines, you should consult your database administrator about potential security hazards. |
ODBCHook is a data import utility that runs on Windows platforms. Using ODBCHook, you can extract data from ODBC-compliant sources (including Access databases and Excel workbooks) and format it for use with See5/C5.0, Cubist, and GritBot.
This short tutorial will illustrate the way that ODBCHook is used. Since there are numerous ODBC data sources and their associated drivers, the tutorial can cover only a few of the more common situations that you might encounter.
The first time that you run it, ODBCHook will complain that it cannot find a current licence ID. Select the option to re-enter the licence ID, being careful to type it exactly as given to you.
RuleQuest tools See5/C5.0, Cubist, and GritBot all require at least two ASCII files:
Often a potential user already has the data stored in a database or spreadsheet, and converting the data to the format required by RuleQuest programs can be a nuisance.
That's where ODBCHook comes in. If the data resides in an ODBC-accessible source such as the common relational databases, ODBCHook can import the data via the appropriate ODBC driver and prepare the .data file.
To be painfully honest, ODBC can be tricky due to the large variety of potential ODBC sources and drivers, different ODBC versions, and varying conformance levels. (Some older ODBC 2.0 drivers, particularly those that implement only the core conformance level, will not work correctly with ODBCHook.) Different ODBC drivers also need different types of connection information. If your situation is anything other than vanilla, you may need to consult your local ODBC and/or database gurus for help with installing drivers, establishing connection requirements, editing SQL statements and so on.
There are two prerequisites for using ODBCHook:
If you use WordPad to construct the names file, please save it in text mode and remove the additional .txt extension that will be added to the file name. You may need to make file extensions visible via the folder options.
If you plan to use a Microsoft Office tool such as Excel or Access as the data source, the ODBC drivers can be installed from your Office CDROM. This may not have happened automatically if you performed a "typical" install.
Without more ado, let's look at a very simple illustration. The data source in this case is an Excel sheet "Sales" in a workbook "HouseSales.xls" that lists features for property sales:
In addition to the heading there are three rows or cases; in real life there would obviously be many more than this!
The names file HouseSales.names for this application is:
Sale Price. | the target attribute
Listing No: label. | identifies the case
Beds: continuous. | number of bedrooms
Baths: continuous. | number of bathrooms
Age: continuous. | age of house
Sale Date: date. | date sold
Sale Price: continuous. | price at which sold
Auction?: yes, no. | was it an auction?
Average Price 2000: continuous. | average for houses in area
Amenities Value: continuous. | area amenities rating
CBD Distance: continuous. | approx distance to CBD
In this example, each attribute in the name file has a matching column in the Excel sheet and all attributes can be found in the same sheet. As you might expect, ODBCHook finds this example straightforward.
| ODBCHook has a simple main window that shows progress through the four steps. Initially the window looks like this: |
|
The first step is to locate the names file for this run using the familiar Windows browse dialog.
| In this example there is a folder "Data" containing HouseSales.names, the names file we have just seen. We select this and click the "Open" button. |
|
We must next select a data source from which information for the data file will be extracted. If the data source has already been set up via the Windows "ODBC Administrator" application, this part is easy. Let's assume that it has not, so we must create a new data source entry ourselves.
There are two basic types of data sources: information about a machine data source is stored in the Windows Registry, while that for a file data source is saved in a file.
| Here, the "Machine Data Source" tab has been selected and the know data sources are displayed. The data source we want is not on the list, so we click the "New" button. |
|
| Another decision! Do we want our new data source to be available only to us, or to everyone using this machine? We choose the former (a "User" data source) and click "Next". |
|
| Different types of ODBC data sources use different drivers. In this example, the data source is an Excel workbook, so we choose the Microsoft Excel Driver and click "Next". |
|
| The new data source has been named "house-sales-xls" and we have provided a brief description. Now we must identify the particular Excel workbook containing the data, so we click "Select Workbook". |
|
| The browse dialog enables us to locate the Excel file in the same "Data" folder that holds the names file. (They don't have to be together -- the Excel file could be anywhere.) After we click "OK", ... |
|
| ... we arrive back at the "Select Data Source" dialog, but now our new data source is included in the list. All we need to do is click "OK". |
|
From here on in, ODBCHook takes over. Since all the attributes in the names file have matching columns in a single sheet ("Sales") in the Excel workbook, ODBCHook can find everything unassisted.
| ODBCHook asks whether we wish to see or edit the SQL statement that will be used to retrieve the data. We decide to skip this option and press ahead. |
|
| This final message shows that the data file has been written -- we are done. |
|
A new file HouseSales.data has been created and looks like this:
412.0,4.0,2.0,59.0,2001-11-07,502000.0,no,697000.0,82.0,25.0.
721.0,5.0,3.0,31.0,2002-01-11,724000.0,no,697000.0,82.0,25.0.
212.0,3.0,1.0,7.0,2002-02-02,550000.0,yes,560000.0,81.0,11.5.
The contents of the file look very similar to the Excel data except that entries are separated by commas and the dates have been converted to the format used by See5/C5.0, Cubist, and GritBot.
Let's now look at a more realistic example, based on Microsoft Access, that illustrates the need for more input from the user. In the simple example above, all the information that we needed was held in one sheet. This time, it is spread over three tables:
The names file identifies a target attribute (here Sale Price) -- for See5/C5.0 and Cubist, this is the dependent attribute that will be modeled. This attribute plays an important role, because the table that contains the target attribute (here "Sales") will become the main table. ODBCHook will write one record to the data file for each record in the main table; any other tables will provide information that does not appear in the main table.
Let's take the first record of the "Sales" table to illustrate how this happens. The names file includes attributes Average Price 2000, CBD Distance, and Amenities Value that do not appear in the main table "Sales". However, "Sales" contains a column Postcode that can be used as a key to look up entries in the table "Postcodes". Since our first record has postcode 2040, the information in the row 2040 of table "Postcodes" applies to this record. The relevant value of Average Price 2000 that applies to this first record is therefore 560000. (In other words, average prices are stored for each postcode.)
Now, the table "Postcodes" contains another column Zone that can be used similarly as a key for the table "Zones". Our first record has postcode 2040 and so belongs in the zone Inner West. The "Zones" table shows that, for the Inner West zone, the appropriate value for CBD Distance is 11.5 and for Amenities Value is 81.
Now that we see how this process is meant to happen, let's follow the interaction between ODBCHook and the user.
| After the same names file HouseSales.names has been selected as before, ODBCHook immediately warns us that a data file HouseSales.data corresponding to this names file already exists and asks whether we wish to overwrite it. We click "Yes" and press on. |
|
| Once again our data source (the Access file HouseSales.mdb) must be set up; this time the appropriate driver is the Microsoft Access Driver. |
|
| The setup dialog is a little different for this driver but, as before, we provide the data source name and description and... |
|
| ...select the appropriate Access database (on the desktop, this time). |
|
| Now the fun begins! ODBCHook finds the main table, but cannot find any table with a field named Sale Date. It therefore asks for our help in identifying the field that should correspond to this attribute. |
|
| There is a column "Date of Sale" in the "Sales" table that seems to contain the same information. We select that column and click "OK". |
|
| ODBCHook makes a simplifying assumption that all keys must consist of a single column. It notices that the main table has a field Postcode that has the same name as a field in the "Postcodes" database, and asks if this field can be used as the key for extracting information from "Postcodes" for the main table. We click "Yes". |
|
| ODBCHook needs to extract values of Amenities Value from the "Zones" table, and so it needs a key for this table. |
|
| We select the column Zone (since there is a single entry in the "Zones" table for each zone) and click "OK". |
|
| We now must identify the value of another column that contains the value of this key. |
|
| ODBCHook notices another column with the same name in the "Zones" table, and shows this first as a probable choice. We select it and move on. |
|
| ODBCHook now has all the information that it needs. This time, when we are asked whether we would like to see or edit the SQL statement prepared by ODBCHook, we choose "Yes". This is what the statement looks like. |
|
After we click "OK", ODBCHook proceeds to write the data file.
412,4,2,59,2001-11-07,502000,no,697000,82,25.0.
721,5,3,31,2002-01-11,724000,no,697000,82,25.0.
212,3,1,7,2002-02-02,550000,yes,560000,81,11.5.
It differs slightly from the first data file -- this is not
whimsy on ODBCHook's part, but reflects the form in which the
information is provided by the particular ODBC driver.
However, both data files are equivalent for See5/C5.0, Cubist, or
GritBot.
Both the examples above involve data sources on the computer running ODBCHook, but ODBC can also be used to access data over your local area network. Let's look at two illustrations.
Suppose that, instead of being on the computer we are using, the Access database from the second example resides on a computer called RQ5.
The process of setting up the data source name proceeds as for the example above until the step of locating the database HouseSales.mdb.
| Instead of browsing the local directories, we click the "Network" button. |
|
| The new dialog box suggests a drive letter for the network folder containing HouseSales.mdb; we accept this letter and click "Browse". |
|
| The familiar browse dialog allows us to select the folder on the network. In this case, we select the desktop on RQ5 and click "OK". |
|
| The drive letter "F" is now assigned to that folder. We click "Finish". |
|
| Back in the "Select Database" dialog, we select the file HouseSales.mdb (in the network folder), click "OK" and we are done. |
|
A couple more clicks and ODBCHook is reading data from the database over the network.
The next example illustrates connection to a database on a Unix machine. For this example we will use the MySQL database running on a local Sun Sparc with IP address 192.168.0.107. Before we can access a MySQL database through ODBC, the appropriate driver must be installed on our Windows machine -- in this case the Windows NT/2000 driver was downloaded from mysql.com.
| In the driver selection step, we now click on "MySQL" to use the MySQL ODBC driver. |
|
| Different drivers have different requirements, as is clear from this new dialog. The mandatory fields are the DSN that we will use for the connection from Windows (arbitrary), and the Sun's real IP address and MySQL database name. Some other information may be required, depending on how the MySQL server has been set up on the Sun. |
|
After clicking "OK", we are now ready to connect to the database as before.
The whole idea of ODBC is that databases from Vendor 1 can connect with applications from Vendor 2 running on operating systems supplied by Vendor 3, using drivers and "middleware" supplied by Vendor 4. In the case of ODBCHook, the database and middleware vendors are determined by you, the user; as might be expected, quite a lot can go wrong in this scenario.
Errors and error messages can be divided into two primary categories:
Errors generated by ODBCHook include problems with the names file and problems with the values obtained from the data source. For instance, the names file of the examples states that the value of Beds is a number, but the database might return a value of "3-4", say. Here are a couple of examples:
| This is a problem with the names file itself -- continuous was probably what the user intended. Problems with the names file cause the run to be terminated. |
|
| Here, the value obtained from the data source does not match any of the permissible values specified in the names file (i.e., yes or no). Errors of this kind are not fatal -- the spurious values are simply recorded as is in the data file. However, after ten such errors have been found, ODBCHook gives the user the opportunity to terminate the run. |
|
ODBC errors can look something like this:
These errors are generated by the ODBC components and not by ODBCHook, so RuleQuest cannot provide any help with them. For problems like these you will need to consult a local expert.
| © RULEQUEST RESEARCH 2004 | Last updated June 2004 |