C5.0: An Informal Tutorial

Welcome to C5.0, a system that extracts informative patterns from data. The following sections illustrate the use of the system and introduce some of its more important options.


Data

We will illustrate C5.0 using a real application from the printing industry. Bob Evans and Doug Fisher (IEEE Expert, February 1994) describe the following problem:

Rotogravure printing involves rotating a chrome-plated, engraved copper cylinder in a bath of ink, scraping off the excess ink, and pressing a continuous supply of paper against the inked image with a rubber roller. .... Sometimes a series of grooves -- called a band --appears in the cylinder during printing, ruining the finished product. ... [Banding] results in the press being shut down for an average of about 1.5 hours ... Many process features were thought to contribute to banding, but there was no documentation conclusively associating them with the problem.

The plant of R.R. Donnelley and Sons at Gallatin, Tennessee, has collected data from several hundred printing jobs, some affected by banding and others not. In this example we will concentrate on the data for one particular printing press. Each datum or case describes characteristics of one job and the control settings on the printing hardware. Here are a few examples:

        Attribute               Job 1     Job 2     Job 3    .....

        grain screened          no        yes       no
        proof on ctd ink        yes       unknown   yes
        blade mfg		benton    unknown   benton
        paper type              coated    uncoated  coated
        ink type                coated    uncoated  coated
        direct steam use        no        no        yes
        solvent type            line      unknown   line
        type on cylinder        yes       unknown   yes
        cylinder size		tabloid   tabloid   tabloid
        paper mill location     Scand     unknown   Canadian
        proof cut		40        55        30
        viscosity               42        56        51
        caliper                 0.2       0.3       0.333
        ink temperature         14.5      21        16
        humidity                74        80        76
        roughness               1         0.875     0.875
        blade pressure          25        35        25
        varnish pct             8         unknown   19.1
        press speed             2100      1650      2250
        ink pct                 57.5      unknown   42.7
        solvent pct             34.5      unknown   38.2
        ESA Voltage             0         unknown   3
        ESA Amperage            0         unknown   0
        wax                     2.3       2.1       1.2
        hardener                0.7       0.6       0.5
        roller durometer        35        unknown   30
        current density         40        40        35
        anode space ratio       107.4     100       103.3
        chrome content          100       100       100
        result                  band      band      no band

This is exactly the sort of task for which C5.0 was designed. Each case belongs to one of a small number of mutually exclusive classes (band or noband). Properties of every case that may be relevant to its outcome (banding or no banding) are provided (although some cases may have unknown values for some attributes). There are 30 attributes in this example, but C5.0 can deal with any number of attributes.

C5.0's job is to find how to predict a case's class from its attribute values -- here, to decide whether a printing run is likely to suffer from banding by examining its values of the attributes shown above. C5.0 does this by constructing a classifier that makes this prediction. As we will see, C5.0 can construct classifiers expressed as decision trees or as sets of rules.

Every C5.0 application has a short name called a filestem; we will use the filestem banding for this illustration. All files read or written by C5.0 for an application have names of the form filestem.extension, where filestem identifies the application and extension describes the contents of the file.

Two files are essential for all C5.0 applications and there are two further optional files, each identified by its extension. The first essential file is the names file (e.g. banding.names) that lists the classes to which cases may belong and the attributes used to describe each case. Attributes are of two types: discrete attributes have a value drawn from a set of possibilities, and continuous attributes have numeric values.

The file banding.names looks like this:

	result. | the target attribute

	grain screened:		yes, no.
	proof on ctd ink: 	yes, no .
	blade mfg:		benton, daetwyler, uddeholm.
	paper type:		uncoated, coated, super.
	ink type:		uncoated, coated, cover.
	direct steam use:	yes, no.
	solvent type:		xylol, lactol, naptha, line, other.
	type on cylinder: 	yes, no .
	cylinder size:		catalog, spiegel, tabloid.
	paper mill location:	northus, southus, canadian, scandanavian, mideuropean.
	proof cut:		continuous.
	viscosity:		continuous.
	caliper:		continuous.
	ink temperature:	continuous.
	humidity:		continuous.
	roughness:		continuous.
	blade pressure:		continuous.
	varnish pct:		continuous.
	press speed:		continuous.
	ink pct:		continuous.
	solvent pct:		continuous.
	ESA Voltage:		continuous.
	ESA Amperage:		continuous.
	wax:			continuous.
	hardener: 		continuous.
	roller durometer: 	continuous.
	current density: 	continuous.
	anode space ratio: 	continuous.
	chrome content:		continuous.
	result:			band, noband.
Whitespace (blank lines, spaces, and tab characters) is ignored except inside a name or value and can be used to improve legibility. The vertical bar `|' can appear anywhere in a file: it causes the remainder of the line to be ignored and is handy for including comments.

The first line of the names file gives the classes, either by naming a discrete attribute that contains the class value (as in this example), or by listing them explicitly. The attributes are then defined in the order that they will be given for each case. The name of each attribute is followed by a colon `:' and a description of the values taken by the attribute. There are five possibilities:

continuous
The attribute takes numeric values.
a comma-separated list of names
The attribute takes discrete values, and these are the allowable values.
discrete N for some integer N
Ditto, but the values are assembled from the data itself; N is the maximum number of such values. (This is not recommended, since the data cannot be checked, but it can be handy for discrete attributes with many values.)
ignore
The values of the attribute should be ignored.
label
This attribute contains an identifying label for each case. The value of the attribute is ignored when classifiers are constructed, but is used when referring to individual cases. A label attribute can facilitate location of errors in the data and cross-referencing of results to individual cases. If there are two or more label attributes, only the last is used.

The second essential file, the application's data file (e.g. banding.data) provides information on the training cases from which C5.0 will extract patterns. The entry for each case consists of one or more lines that give the values for all attributes. If the classes are listed explicitly in the names file, the attribute values are followed by the case's class value. If an attribute value is not known, it is replaced by a question mark `?'. Values are separated by commas and the entry is optionally terminated by a period. Once again, anything on a line after a vertical bar is ignored. (If the information for a case occupies more than one line, make sure that the line breaks occur after commas.)

The first three cases from file banding.data are:

	no,yes,benton,coated,coated,no,line,yes,tabloid,scandanavian,40,42,0.2,
	    14.5,74,1,25,8,2100,57.5,34.5,0,0,2.3,0.7,35,40,107.4,100,band.
	yes,?,?,uncoated,uncoated,no,?,?,tabloid,?,55,56,0.3,
	    21,80,0.875,35,?,1650,?,?,?,?,2.1,0.6,?,40,100,100,band.
	no,yes,benton,coated,coated,yes,line,yes,tabloid,canadian,30,51,0.333,
	    16,76,0.875,25,19.1,2250,42.7,38.2,3,0,1.2,0.5,30,35,103.3,100,noband.
Don't forget the commas between values! If you leave them out, C5.0 will not be able to process your data.

Of course, the value of predictive patterns lies in their ability to make accurate predictions! It is difficult to judge the accuracy of a classifier by measuring how well it does on the cases used in its construction; the performance of the classifier on new cases is much more informative. (For instance, any number of gurus tell us about patterns that `explain' the rise/fall behavior of the stock market in the past. Even though these patterns may appear plausible, they are only valuable to the extent that they make useful predictions about future rises and falls.) The third kind of file used by C5.0 is a test file of new cases (e.g. banding.test) on which the classifier can be evaluated. This file is optional and, if used, has exactly the same format as the data file.

Another optional file, the cases file (e.g. banding.cases), differs from a test file only in allowing the cases' classes to be unknown. The cases file is used primarily with the public source code described later on.

The last kind of file, the costs file (e.g. banding.costs), is also optional and sets out differential misclassification costs. In some applications there is a much higher penalty for certain types of mistakes. In this application, a prediction that banding will not occur could be very costly if in fact it does occur. On the other hand, predicting incorrectly that banding will occur may only cause certain plant parameters to be changed needlessly, with a lower cost. C5.0 allows different misclassification costs to be associated with each combination of real class and predicted class. We will return to this topic near the end of the tutorial.

Decision Trees

Now everything is ready to begin using C5.0. The files banding.names, banding.data and banding.test have been set up as above. The command

	c5.0 -f banding

invokes C5.0 with the -f option that identifies the application name (here banding). If no filestem is specified using this option, C5.0 uses a default filestem that is probably incorrect. (Moral: always use the -f option!)

The output generated by this command is:

	C5.0 INDUCTION SYSTEM [Release 1.09]	Mon Aug  3 09:57:25 1998
	------------------------------------

	    Options:
		File stem <banding>

	Class specified by attribute result

	Read 138 cases (30 attributes) from banding.data

	Decision tree:

	paper type = super: band (9.0)
	paper type = uncoated:
	:...ink type = cover: band (1.0)
	:   ink type = coated:
	:   :...viscosity <= 40: band (2.0)
	:   :   viscosity > 40: noband (17.0/1.0)
	:   ink type = uncoated:
	:   :...blade pressure > 25: band (12.0)
	:       blade pressure <= 25:
	:       :...ESA Voltage <= 0: noband (17.0/1.0)
	:           ESA Voltage > 0:
	:           :...proof cut > 55: noband (2.0)
	:               proof cut <= 55:
	:               :...proof cut <= 42.5: noband (3.0/1.0)
	:                   proof cut > 42.5: band (7.0)
	paper type = coated:
	:...current density <= 37: noband (18.0)
	    current density > 37:
	    :...cylinder size = catalog: noband (0.0)
	        cylinder size = spiegel: band (2.0)
	        cylinder size = tabloid:
	        :...proof cut > 45: noband (12.3/0.3)
	            proof cut <= 45:
	            :...viscosity > 55: noband (14.1/1.1)
	                viscosity <= 55:
	                :...chrome content <= 95: noband (2.0)
	                    chrome content > 95:
	                    :...type on cylinder = no: band (3.0)
	                        type on cylinder = yes:
	                        :...hardener <= 0.8: band (8.0/1.0)
	                            hardener > 0.8: noband (8.6/2.6)


	Evaluation on training data (138 cases):

		    Decision Tree   
		  ----------------  
		  Size      Errors  

		    18    8( 5.8%)    <<


		   (a)   (b)	<-classified as
		  ----  ----
		    43     7	(a): class band
		     1    87	(b): class noband


	Evaluation on test data (100 cases):

		    Decision Tree   
		  ----------------  
		  Size      Errors  

		    18   24(24.0%)    <<


		   (a)   (b)	<-classified as
		  ----  ----
		    24    13	(a): class band
		    11    52	(b): class noband


	Time: 0.2 secs
(Since hardware platforms can differ in floating point precision and rounding, the output that you see might differ very slightly from the above.)

The first part identifies the version of C5.0, the run date, and the options with which the system was invoked. C5.0 constructs a decision tree from the 138 training cases in the file banding.data, and this appears next. Although it may not look much like a tree, this output can be paraphrased as:

	if paper type = super then band
	else
	if paper type = uncoated then
	   if ink type = cover then band
	   else
	   if ink type = coated then
	      if viscosity < 40 then band
	      else
	      . . . .
	if paper type = coated then
	   if current density < 37 then noband
	   else
	   . . . .
and so on. The tree employs a case's attribute values to map it to a leaf containing one of the classes band or noband. Every leaf of the tree is followed by a cryptic (n) or (n/m). For instance, the last leaf of the decision tree is noband (8.6/2.6), for which n is 8.6 and m is 2.6. The value of n is the number of cases in the file banding.data that are mapped to this leaf, and m (if it appears) is the number of them that are classified incorrectly by the leaf. (A non-integral number of cases can arise because, when the value of an attribute in the tree is not known, C5.0 splits the case and sends a fraction down each branch.)

The last section of the C5.0 output concerns the evaluation of the decision tree, first on the cases in banding.data from which it was constructed, and then on the new cases in banding.test. The size of the tree is its number of leaves and the column headed Errors shows the number and percentage of cases misclassified. The tree, with 18 leaves, misclassifies 8 of the 138 given cases, an error rate of just under 6%. Performance on these cases is further analysed in a confusion matrix that pinpoints the kinds of errors made. In this example, the decision tree misclassifies seven of the band cases as noband and a single noband case as band.

A very simple classifier (called a majority classifier) predicts that every new case belongs to the most common class in the training data. In this example, 88 of the 138 training cases belong to class noband so that a majority classifier would always opt for noband. The 100 test cases from file banding.test include 37 belonging to class band, so a simple majority classifier would have an error rate of 37%. The decision tree has a lower error rate of 24% on the new cases, but notice that this is considerably higher than its error rate on the training cases. The confusion matrix for the test cases again shows the detailed breakdown of the classification performance.

The construction of a decision tree is usually completed quickly, even when there are thousands of cases. Some of the options described later, such as ruleset generation and boosting, can slow things down considerably. The progress of C5.0 on long runs can be monitored by examining the last few lines of the temporary file filestem.tmp (e.g. banding.tmp). This file displays the stage that C5.0 has reached and, for most stages, gives an indication of the fraction of the stage that has been completed.

Discrete Value Subsets

The topmost node of the decision tree tests the value of the discrete attribute paper type, with one branch for each of its three possible values super, uncoated, and coated. This is the `standard' way in which C5.0 constructs tests on discrete attributes, but tests with a high fan-out can have the undesirable side-effect of fragmenting the data during construction of the decision tree. C5.0 has an option -s that can mitigate this fragmentation to some extent: attribute values are grouped into subsets and each subtree is associated with a subset rather than with a single value. Invoking this option gives the tree

	paper type = super: band (9.0)
	paper type in {uncoated,coated}:
	:...ink type = cover: band (1.0)
	    ink type = uncoated:
	    :...blade pressure > 25: band (12.0)
	    :   blade pressure <= 25:
	    :   :...ESA Voltage <= 0: noband (17.0/1.0)
	    :       ESA Voltage > 0:
	    :       :...proof cut > 55: noband (2.0)
	    :           proof cut <= 55:
	    :           :...proof cut <= 42.5: noband (3.0/1.0)
	    :               proof cut > 42.5: band (7.0)
	    ink type = coated:
	    :...current density <= 37: noband (20.0)
	        current density > 37:
	        :...ink pct <= 56.8: noband (30.3/3.3)
	            ink pct > 56.8:
	            :...proof on ctd ink = no: noband (4.5/0.3)
	                proof on ctd ink = yes:
	                :...anode space ratio > 106.9: band (6.0)
	                    anode space ratio <= 106.9:
	                    :...roughness > 0.8125: noband (7.2)
	                        roughness <= 0.8125:
	                        :...type on cylinder = no: band (4.7/1.2)
	                            type on cylinder = yes:
	                            :...varnish pct > 0: noband (2.3/0.3)
	                                varnish pct <= 0:
	                                :...press speed <= 2200: band (8.4/2.8)
	                                    press speed > 2200: noband (3.7)

that is quite different from the tree shown earlier. Notice that the values of paper type are now divided into two subsets, instead of three single values as before. This tree has two fewer leaves than the first one but has a slightly higher error rate of 26% on the test cases.

Rulesets

Decision trees can sometimes be very difficult to understand. An important feature of C5.0 is its mechanism to convert trees into collections of rules. The option -r causes rules to be derived from trees produced as above, with or without the subsetting option -s. The command

	c5.0 -f banding -r

gives the following rules:

	Rule 1: (cover 17)
	    	ink type = uncoated
	    	blade pressure > 25
		->  class band  [0.947]

	Rule 2: (cover 10)
	    	ink type = uncoated
	    	proof cut > 42.5
	    	proof cut <= 55
	    	ESA Voltage > 0
		->  class band  [0.917]

	Rule 3: (cover 9)
	    	paper type = super
		->  class band  [0.909]

	Rule 4: (cover 9)
	    	paper type = coated
	    	proof cut <= 45
	    	viscosity <= 55
	    	hardener <= 0.8
	    	current density > 37
		->  class band  [0.818]

	Rule 5: (cover 3)
	    	paper type = coated
	    	type on cylinder = no
	    	viscosity <= 55
	    	current density > 37
	    	chrome content > 95
		->  class band  [0.800]

	Rule 6: (cover 2)
	    	paper type = uncoated
	    	ink type = coated
	    	viscosity <= 40
		->  class band  [0.750]

	Rule 7: (cover 1)
	    	ink type = cover
		->  class band  [0.667]

	Rule 8: (cover 17)
	    	ink type = uncoated
	    	blade pressure <= 25
	    	ESA Voltage <= 0
		->  class noband  [0.895]

	Rule 9: (cover 17)
	    	paper type = uncoated
	    	ink type = coated
	    	viscosity > 40
		->  class noband  [0.895]

	Rule 10: (cover 6)
	    	proof cut > 55
	    	blade pressure <= 25
		->  class noband  [0.875]

	Rule 11: (cover 14)
	    	paper type = uncoated
	    	proof cut <= 42.5
	    	blade pressure <= 25
		->  class noband  [0.875]

	Rule 12: (cover 68)
	    	paper type = coated
		->  class noband  [0.757]

	Default class: noband

Each rule consists of:

When a ruleset like this is used to classify a case, it may happen that several of the rules are applicable (that is, all their conditions are satisfied). If the applicable rules predict different classes, there is an implicit conflict that could be resolved in two ways: we could believe the rule with the highest confidence, or we could attempt to aggregate the rules' predictions to reach a verdict. C5.0 adopts the latter strategy -- each applicable rule votes for its predicted class with a voting weight equal to its confidence value, the votes are totted up, and the class with the highest total vote is chosen as the final prediction. There is also a default class, here noband, that is used when none of the rules apply.

For instance, the case labelled Job 1 above satisfies rules 4 and 12. Since rule 4 (band) has a higher confidence value than rule 12 (noband), the case is correctly classified as band.

Rulesets are generally much simpler to understand than trees since each rule describes a specific context associated with a class. Furthermore, a ruleset generated from a tree usually has fewer rules than than the tree has leaves, another plus for comprehensibility. In this example, the first decision tree with 18 leaves is reduced to twelve rules. Finally, rules are often more accurate predictors than decision trees -- a point not illustrated here, since both have an error rate of 24% on the test cases. For very large datasets, however, generating rules with the -r option can require considerably more computer time.

Boosting

Another innovation incorporated in C5.0 is adaptive boosting, based on the work of Rob Schapire and Yoav Freund. The idea is to generate several classifiers (either decision trees or rulesets) rather than just one. When a new case is to be classified, each classifier votes for its predicted class and the votes are counted to determine the final class.

But how can we generate several classifiers from the same data? As the first step, a single decision tree or ruleset is constructed as before from the training data (e.g. banding.data). This classifier will usually make mistakes on some cases in the data; the first decision tree, for instance, gives the wrong class for 8 cases in banding.data. When the second classifier is constructed, more attention is paid to these cases in an attempt to get them right. As a consequence, the second classifier will generally be different from the first. It also will make errors on some cases, and these become the the focus of attention during construction of the third classifier. This process continues for a pre-determined number of iterations.

The option -t x instructs C5.0 to construct up to x classifiers in this manner; an alternative option -b is equivalent to -t 10. Naturally, constructing x distinct classifiers takes about x times as much computer time as constructing a single classifier, so boosting is slower -- but it can be worth it! Trials over numerous datasets, large and small, show that 10-classifier boosting on average reduces the number of errors on test cases by about 25%. In this example, the command

	c5.0 -f banding -r -b

causes ten rulesets to be generated. The summary of the rulesets' individual and aggregated performance on the 100 test cases is:

	Trial	    Decision Tree           Rules     
	-----	  ----------------    ----------------
		  Size      Errors      No      Errors

	   0	    18   24(24.0%)      12   24(24.0%)
	   1	    15   40(40.0%)      11   39(39.0%)
	   2	    16   38(38.0%)       9   41(41.0%)
	   3	    16   26(26.0%)      14   23(23.0%)
	   4	    20   31(31.0%)      12   34(34.0%)
	   5	    12   18(18.0%)      11   18(18.0%)
	   6	     9   42(42.0%)       8   37(37.0%)
	   7	    15   35(35.0%)       6   35(35.0%)
	   8	    14   31(31.0%)       9   32(32.0%)
	   9	    15   36(36.0%)      10   30(30.0%)
	boost	         24(24.0%)           20(20.0%)   <<


		   (a)   (b)	<-classified as
		  ----  ----
		    25    12	(a): class band
		     8    55	(b): class noband

(Again, different floating point hardware can lead to slightly different results.) The performance of the classifier constructed at each iteration or trial is summarised on a separate line, while the line labelled boost shows the result of voting all the previous classifiers. The tree and ruleset constructed on Trial 0 are identical to those produced without the -b option. Some of the subsequent trees and rulesets produced by paying more attention to certain cases have quite high overall error rates. When the ten rulesets are combined by voting, however, the final predictions have error rates of 20% on the test cases -- somewhat better than those of the single ruleset.

Additional Options

Three further options enable aspects of the classifier-generation process to be tweaked. These are all best regarded as advanced options that should be used sparingly (if at all), so that this section can be skipped without much loss.

C5.0 constructs decision trees in two phases. A large tree is first grown to fit the data reasonably well. This tree is then pruned to avoid over-fitting by removing parts of the tree that have a high predicted error rate on new cases. The option -c CF affects this prediction and hence the severity of pruning; values smaller than the default (25%) cause more of the initial tree to be pruned, while larger values result in less pruning.

The option -m cases constrains the degree to which the initial tree can fit the data. At each branch point in the decision tree, the stated minimum number of training cases must follow at least two of the branches. Values higher than the default (2 cases) can lead to an initial tree that fits the training data only approximately -- a form of pre-pruning. (This option is complicated by the presence of missing attribute values and by the use of differential misclassification costs, discussed below. Both cause adjustments to the apparent number of cases following a branch.)

Finally, the option -p affects the way that thresholds are interpreted when classifiers are used interactively (see the later section "Using Classifiers"). The usual interpretation of a condition such as Weight > 125 is cut and dried -- the condition is either satisfied or it isn't. When this option is invoked, however, attribute values near the cutoff (125 in this example) are taken to satisfy the condition with some probability. If Weight has the value 126, for instance, the condition may be interpreted as only weakly satisfied. In such circumstances, predictions are made assuming that the condition is satisfied and then assuming that it is not, and the results are combined probabilistically. Please note that this option does not affect the classification of cases in any test file -- only interactive use of a classifier is affected.

Cross-Validation Trials

As we saw earlier, the predictive accuracy of a classifier constructed from the cases in a data file can be estimated from its performance on new cases in a test file. Unless there are a very large number of cases in both files, this estimate can be rather erratic. If the cases in banding.data and banding.test were to be shuffled and divided into a new 138-case training set and a 100-case test set, C5.0 would probably construct a different classifier whose error rate on the test cases might vary considerably.

One way to get a more reliable estimate of predictive accuracy is by f-fold cross-validation. The cases (including those in the test file, if it exists) are divided into f blocks of roughly the same size and class distribution. For each block in turn, a classifier is constructed from the cases in the remaining blocks and tested on the cases in the hold-out block. In this way, each case is used just once as a test case. The error rate of a classifier produced from all the cases is estimated as the ratio of the total number of errors on the hold-out cases to the total number of cases.

This cross-validation procedure can be repeated for different random partitions of the cases into blocks. The average error rate from these distinct cross-validations is then a relatively reliable estimate of the error rate of the single classifier produced from all the cases.

A shell script and associated programs for carrying out (multiple) cross-validations is included with C5.0. The shell script xval is invoked with any combination of C5.0 options and some further options that describe the cross-validations themselves:

F=n specifies the number of cross-validation folds (default 10)
R=r causes the cross-validation to be repeated r times (default 1)
+s for string s, adds an identifying suffix +s to all files
+d retains the files output by individual runs

If detailed results are retained via the +d option, they appear in files named filestem.oi.j[+s] where i is the cross-validation number (0 to r-1) and j is the number of the fold in that cross-validation (0 to f-1). A summary of the cross-validations is written to file filestem.res[+s].

As an example, the command

	xval -f banding -b -r R=10 +br

has the effect of running ten 10-fold cross-validations. Each classifier is produced using 10-trial boosting with rules extracted from the trees. So this causes a total of 1000 trees to be generated, each of which is converted to rule form. File banding.res+br contains the following summary:

	   XVal      Decision Tree           Rules     
	   ----    ----------------    ----------------
	             Size    Errors        No    Errors
	
	     0          *    18.4%          *    18.5%   
	     1          *    14.7%          *    15.5%   
	     2          *    17.3%          *    18.6%   
	     3          *    18.9%          *    19.3%   
	     4          *    17.7%          *    17.6%   
	     5          *    17.2%          *    18.0%   
	     6          *    17.6%          *    17.6%   
	     7          *    15.6%          *    14.7%   
	     8          *    16.3%          *    15.1%   
	     9          *    18.4%          *    18.1%   
	
	   Mean              17.2%               17.3%   
	   SE                 0.4%                0.5%   

(The size values are omitted here since, with boosting, the classifiers are not single trees or rulesets.) The SE figures (the standard errors of the means) provide an estimate of the variability of these results.

Since a single cross-validation fold uses only part of the application's data, running a cross-validation does not result in a classifier being saved. To save a classifier for later use, simply run C5.0 without employing cross-validation.

Sampling From Large Datasets

Even though C5.0 is relatively fast, building classifiers from very large numbers of cases can take an inconveniently long time, especially when options such as boosting are employed. C5.0 incorporates a facility to extract a random sample from a dataset, construct a classifier from the sample, and then test the classifier on a disjoint collection of cases. By using a smaller set of training cases in this way, the process of generating a classifier is expedited, but at the cost of a possible reduction in the classifier's predictive performance.

The option -S x has two consequences. Firstly, a random sample containing x% of the cases in the application's data file is used to construct the classifier. Secondly, the classifier is evaluated on a non-overlapping set of test cases consisting of another (disjoint) sample of the same size as the training set (if x is less than 50%), or all cases that were not used in the training set (if x is greater than or equal to 50%).

As an example, suppose that the application's data file contains 100,000 cases. If a sample of 10% is used, the classifier will be constructed from a sample of 10,000 cases and tested on a disjoint sample of 10,000 cases. Alternatively, selecting sampling with 60% will cause the classifier to be constructed from 60,000 cases and tested on the remaining 40,000 cases.

The random sample above changes every time that a classifier is constructed. As a result, successive runs of C5.0 with the same parameters will usually produce different results.

Differential Misclassification Costs

Up to this point, all errors have been treated as equal -- we have simply counted the number of errors made by a classifier to summarize its performance. Let us now turn to the situation in which the `cost' associated with a classification error depends on the predicted and true class of the misclassified case.

C5.0 allows costs to be assigned to any combination of predicted and true class via entries in the optional file filestem.costs. Each entry has the form

    predicted class, true class: cost

where cost is a non-negative real number. The file may contain any number of entries; if a particular combination is not specified explicitly, its cost is taken to be 0 if the predicted class is correct and 1 otherwise.

To illustrate the idea, consider a hypothetical file banding.costs consisting of the single line

        noband, band: 5.5

This specifies that the cost of misclassifying a band situation as noband is 5.5 units. Since it is not given explicitly, the converse error (misclassifying a noband situation as band) is 1 unit. In other words, the first kind of error is 5.5 times more costly.

The presence of this costs file affects the classifiers produced by C5.0. The command

	c5.0 -f banding
now gives the following output:

	C5.0 INDUCTION SYSTEM [Release 1.09]	Mon Aug  3 10:09:23 1998
	------------------------------------

	    Options:
		File stem <banding>

	Class specified by attribute result
	Read 138 cases (30 attributes) from banding.data
	Read misclassification costs from banding.costs

	Decision tree:

	current density <= 35:
	:...roller durometer <= 35: noband (22.0)
	:   roller durometer > 35: band (1.2/0.1)
	current density > 35:
	:...press speed > 2200:
	    :...type on cylinder = yes: noband (16.2)
	    :   type on cylinder = no:
	    :   :...press speed <= 2300: noband (2.0)
	    :       press speed > 2300: band (2.0)
	    press speed <= 2200:
	    :...blade pressure > 29: band (41.0/11.0)
	        blade pressure <= 29:
	        :...ink type = cover: band (0.0)
	            ink type = coated:
	            :...paper mill location = northus: noband (3.9/0.2)
	            :   paper mill location in {southus,mideuropean}: band (0.0)
	            :   paper mill location = scandanavian: band (5.2/2.5)
	            :   paper mill location = canadian:
	            :   :...ink pct <= 61.7: noband (15.4/0.8)
	            :       ink pct > 61.7: band (1.3)
	            ink type = uncoated:
	            :...hardener <= 0.75: noband (5.0)
	                hardener > 0.75:
	                :...proof cut > 45: band (12.2/2.0)
	                    proof cut <= 45:
	                    :...wax <= 2.6: noband (9.4/0.5)
	                        wax > 2.6: band (1.2)


	Evaluation on training data (138 cases):

		       Decision Tree       
		  -----------------------  
		  Size      Errors   Cost  

		    16   24(17.4%)   0.17   <<


		   (a)   (b)	<-classified as
		  ----  ----
		    50      	(a): class band
		    24    64	(b): class noband


	Evaluation on test data (100 cases):

		       Decision Tree       
		  -----------------------  
		  Size      Errors   Cost  

		    16   31(31.0%)   0.67   <<


		   (a)   (b)	<-classified as
		  ----  ----
		    29     8	(a): class band
		    23    40	(b): class noband


	Time: 0.2 secs

This new decision tree has a higher error rate than the first decision tree for both the training and test cases, and might therefore appear entirely inferior to it. The real difference comes when we add up the total cost of misclassifications for the original and new tree. For the first decision tree, which was derived without reference to the differential costs, the result is:

Cases Predicted Class Real Class Total Cost
7nobandband38.5
1bandnoband1.0

for a total cost of 39.5. For the new tree we have:

Cases Predicted Class Real Class Total Cost
24bandnoband24.0

for a cost of 24.0. That is, the total misclassification cost associated with the training cases is noticeably lower than that of the old tree. The misclassification cost of the predictions made for the new cases in banding.test is similarly much reduced.

Also notice that the new decision tree is lopsided in its error distribution. If the cost of misclassifying a band case is increased sufficiently, the tree will classify all cases as band to be on the safe side! Large costs should therefore be used with caution.

Using Classifiers

Once a classifier has been constructed, an interactive interpreter can be used to assign new cases to classes. The command to do this is

	predict
whose options are:

-f filestem to identify the application
-r if rules are to be used
-p causes the classifier to be printed as a reminder

This is illustrated in the following dialog that uses the first decision tree to predict the class of a case. Input from the user is shown underlined and the return key as ¤.

	paper type: uncoated¤
	ink type: uncoated¤
	blade pressure: 35¤

	    ->	band  [1.00]

	Retry, new case or quit [r,n,q]: r¤

	paper type [uncoated]: ¤
	ink type [uncoated]: ¤
	blade pressure [35]: 20¤
	ESA Voltage: ?¤
	proof cut: 55¤

	    ->	noband  [0.55]

	Retry, new case or quit [r,n,q]: q¤
Since the values of all attributes may not be needed for classification, predict prompts for the values of those attributes that are required. The reply `?' indicates that a requested attribute value is unknown. When all the relevant information has been entered, the most probable class (or classes) are printed, each with a certainty value. Next, predict asks whether the same case is to be tried again with changed attribute values (a kind of what if scenario), a new case is to be classified, or all cases are complete. If a case is retried, each prompt for an attribute value shows the previous value in square brackets. A new value can be entered, followed by a carriage return, or a carriage return alone can be used to indicate that the value is unchanged.

The first case is positively identified as a banding risk. When it is retried and the blade pressure reduced, however, C5.0 asks for the value of two more attributes before deciding that the new situation is marginally non-banding.

Linking to Other Programs

The classifiers generated by C5.0 are retained in binary files, filestem.tree for decision trees and filestem.rules for rulesets. C source code to read these classifier files and to use them to make predictions is freely available. Using this code, it is possible to call C5.0 classifiers from other programs. As an example, the source includes a program to read cases from a cases file, and to show how each is classified by boosted or single trees or rulesets.

Click here to download a gzipped tar file containing the public source.

© RULEQUEST RESEARCH 1997, 1998