A Quick Look at ODBCHook

Warning: All use of ODBCHook is at your own risk. Use of ODBCHook or its documentation signifies your acceptance of the licence agreement. If you use ODBCHook to access data on remote machines, you should consult your database administrator about potential security hazards.

ODBCHook is a data import utility that runs on Windows platforms. Using ODBCHook, you can extract data from ODBC-compliant sources (including Access databases and Excel workbooks) and format it for use with See5/C5.0, Cubist, and GritBot.

This short tutorial will illustrate the way that ODBCHook is used. Since there are numerous ODBC data sources and their associated drivers, the tutorial can cover only a few of the more common situations that you might encounter.

The first time that you run it, ODBCHook will complain that it cannot find a current licence ID. Select the option to re-enter the licence ID, being careful to type it exactly as given to you.


Why ODBCHook?

RuleQuest tools See5/C5.0, Cubist, and GritBot all require at least two ASCII files:

Often a potential user already has the data stored in a database or spreadsheet, and converting the data to the format required by RuleQuest programs can be a nuisance.

That's where ODBCHook comes in. If the data resides in an ODBC-accessible source such as the common relational databases, ODBCHook can import the data via the appropriate ODBC driver and prepare the .data file.

To be painfully honest, ODBC can be tricky due to the large variety of potential ODBC sources and drivers, different ODBC versions, and varying conformance levels. (Some older ODBC 2.0 drivers, particularly those that implement only the core conformance level, will not work correctly with ODBCHook.) Different ODBC drivers also need different types of connection information. If your situation is anything other than vanilla, you may need to consult your local ODBC and/or database gurus for help with installing drivers, establishing connection requirements, editing SQL statements and so on.

Prerequisites for using ODBCHook

There are two prerequisites for using ODBCHook:


Simple example using Excel

Without more ado, let's look at a very simple illustration. The data source in this case is an Excel sheet "Sales" in a workbook "HouseSales.xls" that lists features for property sales:

Excel sheet

In addition to the heading there are three rows or cases; in real life there would obviously be many more than this!

The names file HouseSales.names for this application is:

	Sale Price.                             | the target attribute

	Listing No:             label.          | identifies the case
	Beds:                   continuous.     | number of bedrooms
	Baths:                  continuous.     | number of bathrooms
	Age:                    continuous.     | age of house
	Sale Date:              date.           | date sold
	Sale Price:             continuous.     | price at which sold
	Auction?:               yes, no.        | was it an auction?
	Average Price 2000:     continuous.     | average for houses in area
	Amenities Value:        continuous.     | area amenities rating
	CBD Distance:           continuous.     | approx distance to CBD

In this example, each attribute in the name file has a matching column in the Excel sheet and all attributes can be found in the same sheet. As you might expect, ODBCHook finds this example straightforward.

ODBCHook has a simple main window that shows progress through the four steps. Initially the window looks like this: main window

The first step is to locate the names file for this run using the familiar Windows browse dialog.

In this example there is a folder "Data" containing HouseSales.names, the names file we have just seen. We select this and click the "Open" button. select names file

We must next select a data source from which information for the data file will be extracted. If the data source has already been set up via the Windows "ODBC Administrator" application, this part is easy. Let's assume that it has not, so we must create a new data source entry ourselves.

There are two basic types of data sources: information about a machine data source is stored in the Windows Registry, while that for a file data source is saved in a file.

Here, the "Machine Data Source" tab has been selected and the know data sources are displayed. The data source we want is not on the list, so we click the "New" button. DSN selection 1

Another decision! Do we want our new data source to be available only to us, or to everyone using this machine? We choose the former (a "User" data source) and click "Next". DSN selection 2

Different types of ODBC data sources use different drivers. In this example, the data source is an Excel workbook, so we choose the Microsoft Excel Driver and click "Next". DSN selection 3

The new data source has been named "house-sales-xls" and we have provided a brief description. Now we must identify the particular Excel workbook containing the data, so we click "Select Workbook". DSN selection 4

The browse dialog enables us to locate the Excel file in the same "Data" folder that holds the names file. (They don't have to be together -- the Excel file could be anywhere.) After we click "OK", ... DSN selection 5

... we arrive back at the "Select Data Source" dialog, but now our new data source is included in the list. All we need to do is click "OK". DSN selection 6

From here on in, ODBCHook takes over. Since all the attributes in the names file have matching columns in a single sheet ("Sales") in the Excel workbook, ODBCHook can find everything unassisted.

ODBCHook asks whether we wish to see or edit the SQL statement that will be used to retrieve the data. We decide to skip this option and press ahead. SQL display option

This final message shows that the data file has been written -- we are done. finished!

A new file HouseSales.data has been created and looks like this:

	412.0,4.0,2.0,59.0,2001-11-07,502000.0,no,697000.0,82.0,25.0.
	721.0,5.0,3.0,31.0,2002-01-11,724000.0,no,697000.0,82.0,25.0.
	212.0,3.0,1.0,7.0,2002-02-02,550000.0,yes,560000.0,81.0,11.5.

The contents of the file look very similar to the Excel data except that entries are separated by commas and the dates have been converted to the format used by See5/C5.0, Cubist, and GritBot.


More complex example using Access

Let's now look at a more realistic example, based on Microsoft Access, that illustrates the need for more input from the user. In the simple example above, all the information that we needed was held in one sheet. This time, it is spread over three tables:

Access table 1

Access table 2

Access table 3

The names file identifies a target attribute (here Sale Price) -- for See5/C5.0 and Cubist, this is the dependent attribute that will be modeled. This attribute plays an important role, because the table that contains the target attribute (here "Sales") will become the main table. ODBCHook will write one record to the data file for each record in the main table; any other tables will provide information that does not appear in the main table.

Let's take the first record of the "Sales" table to illustrate how this happens. The names file includes attributes Average Price 2000, CBD Distance, and Amenities Value that do not appear in the main table "Sales". However, "Sales" contains a column Postcode that can be used as a key to look up entries in the table "Postcodes". Since our first record has postcode 2040, the information in the row 2040 of table "Postcodes" applies to this record. The relevant value of Average Price 2000 that applies to this first record is therefore 560000. (In other words, average prices are stored for each postcode.)

Now, the table "Postcodes" contains another column Zone that can be used similarly as a key for the table "Zones". Our first record has postcode 2040 and so belongs in the zone Inner West. The "Zones" table shows that, for the Inner West zone, the appropriate value for CBD Distance is 11.5 and for Amenities Value is 81.

Now that we see how this process is meant to happen, let's follow the interaction between ODBCHook and the user.

After the same names file HouseSales.names has been selected as before, ODBCHook immediately warns us that a data file HouseSales.data corresponding to this names file already exists and asks whether we wish to overwrite it. We click "Yes" and press on. data file exists

Once again our data source (the Access file HouseSales.mdb) must be set up; this time the appropriate driver is the Microsoft Access Driver. DSN selection 7

The setup dialog is a little different for this driver but, as before, we provide the data source name and description and... DSN selection 8

...select the appropriate Access database (on the desktop, this time). DSN selection 9

Now the fun begins! ODBCHook finds the main table, but cannot find any table with a field named Sale Date. It therefore asks for our help in identifying the field that should correspond to this attribute. user input 1

There is a column "Date of Sale" in the "Sales" table that seems to contain the same information. We select that column and click "OK". user input 2

ODBCHook makes a simplifying assumption that all keys must consist of a single column. It notices that the main table has a field Postcode that has the same name as a field in the "Postcodes" database, and asks if this field can be used as the key for extracting information from "Postcodes" for the main table. We click "Yes". user input 3

ODBCHook needs to extract values of Amenities Value from the "Zones" table, and so it needs a key for this table. user input 4

We select the column Zone (since there is a single entry in the "Zones" table for each zone) and click "OK". user input 5

We now must identify the value of another column that contains the value of this key. user input 6

ODBCHook notices another column with the same name in the "Zones" table, and shows this first as a probable choice. We select it and move on. user input 7

ODBCHook now has all the information that it needs. This time, when we are asked whether we would like to see or edit the SQL statement prepared by ODBCHook, we choose "Yes". This is what the statement looks like. user input 8

After we click "OK", ODBCHook proceeds to write the data file.

	412,4,2,59,2001-11-07,502000,no,697000,82,25.0.
	721,5,3,31,2002-01-11,724000,no,697000,82,25.0.
	212,3,1,7,2002-02-02,550000,yes,560000,81,11.5.
It differs slightly from the first data file -- this is not whimsy on ODBCHook's part, but reflects the form in which the information is provided by the particular ODBC driver. However, both data files are equivalent for See5/C5.0, Cubist, or GritBot.


Databases on other computers

Both the examples above involve data sources on the computer running ODBCHook, but ODBC can also be used to access data over your local area network. Let's look at two illustrations.

Suppose that, instead of being on the computer we are using, the Access database from the second example resides on a computer called RQ5.

The process of setting up the data source name proceeds as for the example above until the step of locating the database HouseSales.mdb.

Instead of browsing the local directories, we click the "Network" button. remote DSN 1

The new dialog box suggests a drive letter for the network folder containing HouseSales.mdb; we accept this letter and click "Browse". remote DSN 2

The familiar browse dialog allows us to select the folder on the network. In this case, we select the desktop on RQ5 and click "OK". remote DSN 3

The drive letter "F" is now assigned to that folder. We click "Finish". remote DSN 4

Back in the "Select Database" dialog, we select the file HouseSales.mdb (in the network folder), click "OK" and we are done. remote DSN 5

A couple more clicks and ODBCHook is reading data from the database over the network.

The next example illustrates connection to a database on a Unix machine. For this example we will use the MySQL database running on a local Sun Sparc with IP address 192.168.0.107. Before we can access a MySQL database through ODBC, the appropriate driver must be installed on our Windows machine -- in this case the Windows NT/2000 driver was downloaded from mysql.com.

In the driver selection step, we now click on "MySQL" to use the MySQL ODBC driver. MySQL DSN 1

Different drivers have different requirements, as is clear from this new dialog. The mandatory fields are the DSN that we will use for the connection from Windows (arbitrary), and the Sun's real IP address and MySQL database name. Some other information may be required, depending on how the MySQL server has been set up on the Sun. MySQL DSN 2

After clicking "OK", we are now ready to connect to the database as before.


When things go wrong

The whole idea of ODBC is that databases from Vendor 1 can connect with applications from Vendor 2 running on operating systems supplied by Vendor 3, using drivers and "middleware" supplied by Vendor 4. In the case of ODBCHook, the database and middleware vendors are determined by you, the user; as might be expected, quite a lot can go wrong in this scenario.

Errors and error messages can be divided into two primary categories:

Errors generated by ODBCHook include problems with the names file and problems with the values obtained from the data source. For instance, the names file of the examples states that the value of Beds is a number, but the database might return a value of "3-4", say. Here are a couple of examples:

This is a problem with the names file itself -- continuous was probably what the user intended. Problems with the names file cause the run to be terminated. Hook error 1

Here, the value obtained from the data source does not match any of the permissible values specified in the names file (i.e., yes or no). Errors of this kind are not fatal -- the spurious values are simply recorded as is in the data file. However, after ten such errors have been found, ODBCHook gives the user the opportunity to terminate the run. Hook error 2

ODBC errors can look something like this:

ODBC error 3

ODBC error 4

ODBC error 5

These errors are generated by the ODBC components and not by ODBCHook, so RuleQuest cannot provide any help with them. For problems like these you will need to consult a local expert.


Some ODBC terms

Driver
A database-specific program that is used to connect to a database. Each database type has its own driver; for example, Microsoft Excel is accessed via the Microsoft Excel driver. The driver for a database type must be installed before ODBCHook can be used to extract information from databases of that type.

Data source
A collection of information about a database and how to connect to it. The term is sometimes used interchangeably with DSN (Data Source Name).

File data source
A data source that is stored in a file (usually with the extension .DSN).

Machine data source
A data source that is saved in the Windows Registry.

ODBC
(Open Database Connectivity) A standard protocol for applications to issue queries to database servers, possibly over a network. ODBC drivers must be installed for each type of database that you will use.

SQL
Structured Query Language, a database query and programming language. The language used to query ODBC data sources.

SQL statement
A specific request or set of instructions for retrieving, modifying, inserting, or deleting data in a database. A SELECT statement is an SQL query that returns rows from one or more tables. A SELECT statement can contain specifications for the columns to return, the rows to select, the order to put the rows in, and how to group (summarize) information.

System data source
A machine data source that is accessible by any user.

Table
A basic component of a relational database (which may consist of many tables). Each table is a matrix where a column denotes an attribute or property and a row concerns one entry or case.

User data source
A machine data source that is accessible only by the user who created it.


© RULEQUEST RESEARCH 2004 Last updated June 2004