About Decision Trees

Decision trees allow users to visually explore their data and build powerful rule-based process models.

They are hierarchical classification structures that group the observations and relate a number of independent variables to a single, discrete, dependent variable. Independent variables are also known as attributes, and dependent variables are known as classes.

Commonly used in research operations, decision trees are used to identify a process strategy most likely to enable achieving the process goal. This decision support tool visually represents the analytical model of decisions and their possible consequences as a tree-like graph, and is described in terms of a root, nodes, branches and leaves.

They are also used as a descriptive manner in which to calculate conditional probabilities, where the expected values of competing alternatives are calculated. In this way, decision trees are used to represent the mathematical relationships underlying observations for a particular problem.

Decision trees are predictive models, and as such, map observations about a process to conclusions about the selected target field. In these models, the branches represent a choice of classifying features, and leaves represent the actual classifications.

A decision tree consists of roots, nodes and branches.

A node can be either:
- a decision node: a classification test for an attribute that divides the current subset of observations into two or more smaller subsets. The attribute data type may be discrete or continuous.
- a terminal node: classifies the remaining subset of observations in that node with a particular class label. The class variable data may only be a discrete data type.
Branches indicate the path that must be followed as decisions are made at each decision node until a terminal node is reached.

Decision tree example

A typical example of a decision tree is illustrated below. Here seawater corrosion of stainless steel is described in terms of a number of attributes or features. These are:

the type of stainless steel alloy,
the depth in meters under the sea surface at which the corrosion took place,
the period in days over which the metal alloy was exposed to seawater,
the maximum depth in millimeters of pitting that was observed and the depth in millimeters to which crevices formed in the metal.

The class variable is the amount of corrosion that occurred, i.e. Low, Medium or High.

Figure 1: A Typical decision tree

The classification of a new observation, for which a class label is not known, occurs as follows. A path is followed from the root of the tree (Maximum Pit Depth in Figure 1), applying the test of each decision node to the attributes of the new observation, until a terminal node is reached. The label of the terminal node is then taken as the class of the new observation.

An observation with attribute values of Maximum Pit Depth = 120mm, Depth Beneath Surface = 1400m and Exposure Time = 100 days would therefore be classified as having a Low level of corrosion.