Preview

This is a short illustration on how the algorithm operates in extracting parallelism from acyclic code regions through a a simple example (without complicated issues). If this is the kind of algorithm you are looking for, then it might be worth retrieving and printing the whole article.

Input Data

The algorithm extracts parallelism from an acyclic code region represented by an acyclic Control Flow Graph (with a single Entry and a single eXit vertex) and a Data Dependence Graph.

CFG & DDG

Upper Bound for Parallelism

The algorithm then builds the Control Dependence Graph and uses it to build a Petri Net that models parallel execution of that code region constrained by control dependences.

PetriNet/CDG

(If you are unfamiliar with the symbols used in this net, check the key table).

Sequencing code

The algorithm then looks at each data dependence, and attempts to sequence code if to prevent parallel executions where memory locations are accessed out of order. In this example, places 2 and 3 in the net are sequenced in order to enforce an execution order that respects the data dependence between vertices 2 and 3.

The eXit place is also sequenced with place 5 with this algorithm, because of the date dependence between vertex 5 and eXit. Data dependences from vertices 3 and 4 cannot be handled by sequencing code once the transformation involving place 5 and eXit is performed.

Sequenced CDG

Essential Dependences

Milind Girkar's algorithm (see "Functional Parallelism: Theoretical foundations and Implementation") is then applied to find which data dependences still need explicit synchronization to enforce a correct execution precedence. In this example, ensuring that vertices 3 and 4 finish execution before vertex eXit starts, has to be enforced using explicit synchronization.

Essential dependences marked

Explicit Synchronization

Explicit synchronization mechanisms are inserted to enforce execution precedence.

Final net

The doted lines around some net regions represent possible code regions that could be gathered into a single procedure, if we were to generate conventional multi-threaded code where each thread must have a procedure as an entry point.

The dashed filled circle represents an explicit synchronization mechanism created to enforce execution precedence for the eXit vertex.

Partitioning

One of the advantages of this algorithm is that it can be used within an iterative partitioning framework (such as the one described by the following algorithm) to generate solutions with increasingly larger grain of parallelism, up to a desired level.

Partitioning algorithm

In this example, if we consider that place 4 represents a task that is to small to justify the overheads of creating a thread just for executing it, we can add artificial dependences around vertex 4 and re-execute the parallelism extraction algorithm to produce a solution with less parallelism.

In this case, we add data dependences from vertex 2 and vertex 3 (the vertices in its neighbor concurrent region) onto vertex 4 (2->4 and 3->4 added to the Data Dependence Graph). The resulting solution (in this particular case) has no parallelism, and efficient sequential code can be generated for it.

Solution after second partitioning iteration

Key table

Back to the main page.