McGill University: School of Computer Science Winter 1997 Class Notes for 308-251B DATA STRUCTURES AND ALGORITHMSTopic #26: Depth-First Search

Introduction

We have already seen various tree traversals, for example preorder, inorder and postorder; now we look at graph traversals. We will use the notation (V,E) for a graph, where V represents the set of vertices and E the set of edges. Graph traversals are used constantly when dealing with graphs, e.g. when looking for sinks or cycles, or when colouring. There are two types of graph traversals: depth-first search and breadth-first search, often abbreviated DFS and BFS respectively. In some instances one type of search or the other may be more desirable. With the Adjacency List implementation of the graph, both types of traversals give O( |V| + |E| ) time for a complete traversal. This time is optimal, since in order to completely traverse a graph using any method, you must visit all of the vertices and edges. The key to a fast traversal is avoiding loops - you don't want to get stuck going around and around a cycle endlessly.

Undirected Graphs

We will begin the discussion of DFS with undirected graphs. The material covered in this lecture can be found in the text in Chapter 23. The basic object of the traversal is to visit each node at least once, but the algorithm will actually give us much more than that. The DFS algorithm is quite famous and can be used to solve a variety of graph problems. Three main concepts are involved:

1) Colouring nodes.

As the algorithm proceeds, nodes are coloured, reflecting the passage of time:
 Phase Condition of Node Colour 1 undiscovered white 2 discovered and being processed grey 3 finished black

Colouring the nodes acts as a memory device - it is like dropping bread crumbs in a maze in order to know where they've already been.

2) Keeping time.

Each node u has two times associated with it: the discovery time d[u] and the finish time f[u]. The discovery time corresponds to the colour change from white to grey and the finish time corresponds to the colour change from grey to black. Time starts at 1, and with each labelling of an event (discovery or finish) the time is incremented by one. One drawback of this is that with many nodes, the storage required for all of these times becomes huge; however, even if they are not stored, the times are useful as abstract concepts to aid us in understanding the algorithm.

3) DFS tree.

Each connected component of the graph can be made into a tree by removing some of the edges. Each node u in the tree will have a pointer to its parent p[u], making this a parent-pointer tree.

The DFS Algorithm

This fragment performs an initialization and starts the depth-first-search of a connected graph:
```
for all nodes u do
{ colour[u]=white;
p[u]=NULL;
}
time = 0;

for all nodes u do
if colour[u] == white then DFS(u);```
This is the algorithm itself, which performs the depth-first search:
```DFS(u):

visit(u);
time = time + 1;
d[u] = time;
colour[u] = grey;

for all nodes v adjacent to u do
{ if colour[v] == white then
{ p[v] = u;
DFS(v);
}
}

time = time + 1;
f[u] = time;
colour[u] = black;```

Example Traversal

For a convention, the neighbours in the following example are considered in alphabetical order. The traversal starts at the I.
1. I turns grey, d[I] = 1
2. E turns grey, d[E] = 2
3. E has two neighbours, B and F. By the convention B is visited first: B turns grey, d[B] = 3
4. Similarly, A, C, & D are visited
5. D has 4 neighbours: A, C, G, H. A and C are grey so G is visited first, then F.
6. All of F's neighbours are grey so F is finished: f[F] = 9, F turns black.
7. Similarly G is finished
8. H is visited because we have not finished visiting all of D's neighbours
9. The remaining nodes are finished in the reverse order, backtracking along the original path, and ending with I.
The resultant parent-pointer tree has its root at I, since this is the first node visited. Each new node visited becomes the child of the most recently-visited node. Also, note that while I is the first node to turn grey, it is the last node to turn black (within the main connected component). This is due to the recursion - each of I's neighbours must be processed and finished (turned black) before I can be finished (turned black). As well, all edges of the graph which are not included in the tree (not used in the traversal) are between a node and its ancestor (because the ancestor is gray); these are known as vertical edges. This property of the DFS tree contrasts it with the BFS tree.

The maximum time index in the example is 18, equal to 2 times the number of nodes. This is because time is incremented exactly when a discovery time or a finish time is assigned, and each node is assigned one of each during its lifetime. The time should not be confused with the complexity of the algorithm, which we will discuss soon.

Properties of the DFS Algoritm and Tree

If u is discovered before v, then there are two possibilities:

1) If v is discovered while u is being processed (is grey), then v will be finished before u. This is called the Nested Interval Property, and is a consequence of the recursive nature of the algorithm: when v is discovered it is "pushed" on to the recursion stack, and u being processed at that time means that u is lower down in the stack. Therefore v will be "popped" before u, i.e. finished before u.

2) The other possibility is that u is finished before v is discovered. In the sample graph above, F and H are an example of this: their start and finish times are 8, 9 and 11, 12 respectively.

Note that the situation depicted here is impossible, by the argument in (1) above:

DFS trees are usually elongated, but are useful because they display the entire recursion path. As well, to determine if any node is a descendant of any other node, one need only compare their times d[u] and f[u]. This is much faster and easier than performing the same task for ordinary binary trees without additional information.

Here is a snapshot of a DFS tree in mid-traversal:

The grey nodes in the tree always form a chain. This represents the current chain of recursion. Note that the black hanging pieces cannot be connected to other parts of the grey chain, since the black section is finished - anything connected to it which is not above it in the tree (below it in the recursion stack) must also be black.

An important property is that:
The white path property states that v is a descendant of u if and only if, at the time of discovery of u, there is a white path from u to v.

If v is a descendant of u, then at time d[u], there must be at least one white path from u to v for the DFS algorithm to follow. Conversely, if there is a white path from u to v at time d[u], then the last node on that path is a neighbour of v which has not yet been processed.

One useful aspect of the DFS algorithm is that it traverses connected components one at a time, and thus it can be used to identify the connected components in a given graph. As an example, if a handwriting sample is scanned into a pixelized image and DFS is performed on that image, all connected components below a certain threshold size (which represent unwanted dirt in the sample) can be removed.

Complexity

There are two loops in the algorithm; the first one loops through all vertices and the second loops through all neighbours of a particular vertex. All other operations performed within these loops, such as changing times and colours, are considered O(1). The time spent on assignment statements in the outer loop is O(|V|). The loop through all neighbours of all vertices takes

Therefore the total complexity is O(|V| + |E|), which is optimal, as mentioned in the introduction.

Searching for Cycles

Problem: Find if an undirected graph has a cycle, in O(|V|) time.

We start by looking at a cycle in an undirected graph.

Assume that d[u] < d[v]. The straight edge between u and v is looked at twice during DFS, once from u and once from v (while looping through neighbours). At time d[u], v is white and at time d[v], u is grey (keep in mind that there is no way to ever look at a black node - this follows from the comment about the snapshot of a DFS tree).

More generally, we have the following theorem:

An undirected graph has a cycle if and only if at some point during the traversal, when u is grey, one of the neighbours v of u considered in the inner loop of the DFS is coloured grey and is not p[u].

Proof: If we arrive at v and find that u is grey (and u is not p[v]) then u must be an ancestor of v, and since we travelled from u to v via a different route, there is a cycle in the graph. Conversely, suppose u and v are included in a cycle:

Since d[u] < d[v], when v is reached and becomes grey, u is already grey so v has a grey neighbour. The result can be summarized as:

grey neighbour (not parent) if and only if a cycle exists. (!)

If we apply this theorem to the problem posed above, the search for a cycle in an undirected graph becomes the search for grey neighbours. If there is no cycle then the maximum number of edges is |V| - 1, and if there is a cycle, it will be found before the |V|th edge is looked at, so in both cases the DFS algorithm will terminate its search for a cycle in O(|V|) time.

Directed Graphs

The situation is now completely different. The same DFS algorithm can be used, however previously derived properties are no longer necessarily applicable to this new case. For example, with directed graphs DFS visits each connected component in turn; in fact, DFS can be used to find the connected components, such as tiny "islands" of dirt in a handwriting sample. With directed graphs, connected components do not even exist in the sense that they are not well-defined (explanation below).

We now define 4 types of DFS tree edges found in the DFS tree of a directed graph. Each edge is classified at the time it is first looked at, and from the point of view of the node occupied by DFS at that point.

1. Tree edge - an edge to a white neighbour.
2. Back edge - an edge to a grey neighbour (an ancestor).
3. Forward edge - an edge to a black descendant (impossible in an undirected graph). The edge from A to C is an example.
4. Cross edge - an edge to a black neighbour which is not a descendant, e.g. the edge from D to C.
Note that if the traversal began at C, C would turn black immediately, thereby forming its own component, if we assume that DFS visits components one at a time. However, starting at A makes the whole graph into one component, so the number of components depends on where the traversal starts. In spite of this it is still true for directed graphs that cycles occur exactly where there are grey non-parental neighbours.

Euler Tours

The problem is to traverse the graph while visiting each edge exactly once, and to finish at the starting point (this is known as the Euler tour). An Euler graph is a graph which contains an Euler tour; the figure on top is an Euler graph, but the figure below it is not. The reason is the difference in degree sequences. The degree of a node in an undirected graph is the number of edges emanating from it. If all the degrees are even then the graph can be traversed in the manner described. This is because each node must be arrived at and departed from the same number of times, and each time using a different edge. The converse is also true: all connected undirected graphs which are Euler graphs have even degrees at all nodes.

In a directed graph, define an "in" degree to be the number of incoming edges, and an "out" degree to be the number of outgoing edges.

A connected directed graph is an Euler graph if and only if every node's "in" degree equals its "out" degree. It is relatively easy to check if a graph is an Euler graph, but the more difficult task is to actually find the Euler Tour. This requires a modified version of the DFS (see Cormen, Leiserson and Rivest on page 485, Exercise 23.3-9), and can be done in O(|V| + |E|) time.

References

T. Cormen, C. Leiserson, R. Rivest. Introduction to Algorithms. McGraw-Hill, 1990. pp. 477-85

R. Decker. Data Structures. Prentice Hall, 1989. pp. 125-6, 263-5

E. Horowitz, S. Sahni. Fundamentals of Data Structures. Computer Science Press, Inc., 1976. pp. 292-3