The Parallel Systems and Computer Architecture Lab

PASCAL

Speculative Execution Model for GPGPU:

Introduction

Unlike the typical CPU, a GPU is specialized in data-parallel computations where hundreds of threads can simultaneously execute the same set of instructions on independent pieces of data. However, because of its optimized architecture design, a GPGPU suffers from synchronization problems which prevent its general use. Even though some techniques may help reduce or even suppress the usage of synchronizations, they are not always practical and often add significant complexity to the programming step. Complex dependency patterns also affect the performance by reducing the chances of parallelism.

Modern computation needs are already well within the Peta range, and some applications are even approaching the Exa scale. To satisfy these future demands, the enormous computing power which GPGPUs may yield will be quite important. The problem is how to exploit this potential and enhance the utilization and programmability of GPGPUs

Goals and Approach

We have observed that some important workloads not only have much intrinsic data parallelism, but also plenty of redundancy of data values, so value prediction can open a chance to the issues of synchronization. Therefore, we are working on the following tasks:

Develop a theoretical framework of value prediction schemes.
Measure how much value prediction schemes can enhance the performance of GPGPU architecture
Explore the programming interface and task structure
Enhance our previously suggested speculative model and apply it to broader use of programming

Results

As a first step, we observed the characteristics of workloads which are frequently used in the benchmark suits. Traditionally, data level parallelism of workloads has been the target of parallel processing. The limit is the degree of synchronization, as we have seen with Amdahl’s law. Our observation is there are abundant redundancy of data in the workloads, and in many cases, because of redundancy, the dependencies can be predicted accurately and frequently. We measured the degree of predictability using Bayes’ Theorem.

	Routine	BBL		Routine	BBL		Routine	BBL
Barnes	0.95	0.76	Radix	0.97	0.90	Canneal	0.63	0.79
Radiosity	0.91	0.47	Raytrace	0.98	0.79	Dedup	0.73	0.95
FFT	0.88	0.61	Volrend	0.84	0.86	Facesim	0.68	0.69
FMM	0.88	0.77	Swaptions	0.82	0.83	Fluidanimate	0.83	0.85
LU	0.96	0.69	Bodytrack	0.73	0.84	Freqmine	0.70	0.78
Ocean	0.96	0.72	Blackscholes	0.89	0.88	Streamcluster	0.29	0.86

Table 1 shows the results of predictability with our suggesting prediction scheme. More details can be found in [2].

The next stage was to bring our idea into real world. Figure 1 depicts overall structure of our speculative execution model to overcome synchronization problem on GPGPU architecture. Loop carried dependencies were our first target. The code in the loop can be expressed as computation kernels, and they will be mapped in stage 2 of our frame work. Prediction is made in the stage 1, and evaluation is made in stage 3.

Here are some results of our framework. In Figure 2, transform_PRED is a loop selected from fp_tree of freqmine in PARSEC. In Figure 3, decompress_SPEC is a loop in the decompress routine inside bzip2 of SPEC 2006. The two workloads shows different degree of intrinsic parallelism, but our frame work, especially in Figure 3 with abundant dependencies, helps the workload to get benefit from GPGPU architecture. More details and results can be found in [1].

Current Status

Still, our framework is experimental stage under the consideration with programming interface. We are extending this framework into a more productive stage so that can be used and tested with more general workloads.

Publications

[1] Speculative Execution on GPU: An Exploratory Study
Shaoshan Liu, Christine Eisenbeis and Jean-Luc Gaudiot
Proceedings of the Thirty Ninth International Conference on Parallel Processing, San Diego, California, September 13-16, 2010

[2] A Theoretical Framework for Value Prediction in Parallel Systems
Shaoshan Liu, Christine Eisenbeis and Jean-Luc Gaudiot
Proceedings of the Thirty Ninth International Conference on Parallel Processing, San Diego, California, September 13-16, 2010

[3] Value Prediction in Modern Many-Core Systems
Shaoshan Liu (Ph.D. Student) and Jean-Luc Gaudiot (Advisor)
Proceedings of the 23^rd IEEE International Parallel & Distributed Processing Symposium (IPDPS 2009), TCPP-Ph.D. Forum, Rome, Italy, May 25-29, 2009

More coming…

Lab Equipment

A Workstation configured with 2 Tesla C1060 boards
A Workstation configured with 2 Tesla C2070 boards
Thanks to NVIDIA for donating GPGPU boards

Specification of C1060	Description
Chip	Tesla T10 GPU
Processor Clock	1296 MHz
Memory Clock	800 MHz
Memory size	4 GB
Stream Processor Cores	240
Peak Single Precision Floating Point Performance	933 GFlops
Peak Double Precision Floating Point Performance	78 GFlops
Memory Bandwidth	102 GB/sec

Specification of C2070	Description
Chip	Tesla T20 GPU
Processor Clock	1.15 GHz
Memory Clock	1.50 GHz
Memory size	6 GB
Stream Processor Cores	448
Peak Single Precision Floating Point Performance	1.03 TFlops
Peak Double Precision Floating Point Performance	515 GFlops
Memory Bandwidth	144 GB/sec