Some Experimental Results

The algorithm described in this paper has been implemented and integrated into a parallelizing compiler for the ANDF language (an ANDF-to-ANDF parallelizing compiler). The results presented here have been obtained using this implementation.

(To obtain more information on this project please contact the authors).

Parallelizing Independent Loops

This set of results illustrates two optimal cases where a parallelizing compiler employing this algorithm can achieve very good results. This kind of optimal case arises when in an acyclic code region analyzed by the parallelizer there are two or more blocks of code for which: The two examples presented here are: The data presented compares runs from three different compilers.
tcc
A front-end driver for a set of ANDF tools performing the C-to-ANDF-to-machine translation steps.
xparallelise
The parallelizing compiler (driven by a graphical interface). tcc is used by the parallelizer to generate the ANDF capsule file from the C source, and also to generate the final executable from the parallelized ANDF capsule.
gcc
Do I need to say anything ? (version 2.6.1)
Results parallelizing independent loops

Parallelizing a Compiler

TNC is an ANDF tool that converts ANDF binary capsules into a readable text format and vice-versa (a notation compiler).

The following data was gathered using the ANDF parallelizing compiler to parallelize the top-level acyclic code region of each procedure.

Results parallelizing a compiler

Some Considerations

These runs cannot be taken as a reference results for evaluating the algorithm. Much more runs would have to be performed on a larger set of programs, and more elaborate data dependences analysis and preliminary code transformations can still be implemented. Nevertheless, we feel that the run on "Parallelizing a Compiler" somewhat reflects our experience with non number-crunching applications written in C, which typically manipulate heap based data structures (lists and so on) and perform IO operations.

The bad news in our experience is that, although some fine-grain parallelism (almost at instruction level) can be found on most functions, it cannot be exploited in conventional multi-threaded code with the (relatively) large overheads of thread management operations.

The algorithm results are dependent on the quality of the data dependence analysis and grain estimation components. It is still unclear where a much more elaborate data dependence analysis can significantly improve the results.

The good news is that, even with a simple heuristic method for determining the minimum acceptable grain for parallelism, the parallelizer was able to successfully exploit the cases where this algorithm achieves good speedup results (as in "Parallelizing Independent Loops") and to avoid the generation of parallel code for the cases where the inefficient exploitation of fine-grained parallelism would result in loss of speed (as in "Parallelizing a Compiler").


Back to the main page.