The goal of developing an analytical model is to predict the performance of an application given a new machine specified by some hardware parameters. Since the advent of RISC, a very accurate model based on instruction mix, costs of instructions, and cache hit rates has been developed to predict program performance. Our goal is to develop a performance model for parallel programs that would be as comprehensive as the model proposed by Hennessey and Patterson for RISC processor performance. However, at this stage, we have not been able to develop a model that characterizes aspect of a parallel machine. In particular, our model is defficient in capturing contention issues that result in poor program performance. The two common scenarios that are not captured in our model are bandwidth limitation and lock-based serialization. In both cases, the program runs slower due to interference between the activities of different processors, which could be either non-deterministic or at least timing specific.
We would like our model to capture the load balance characteristics of a program running on a new machine as well as predict the performance penalties incurred due to communication. We, therefore, would like to measure the following aspects by running the program through a simulator.
We need to measure distance (in terms of instructions) between an issue of a remote memory reference and its subsequent usage. Organize these numbers in the form of an histogram. When we actually plug in the machine parameters for remote memory latency, we would use charge different costs based on this distance. For example, if a value gets used after k cycles and if an access costs l cycles, the cost of the instruction would be (l - k) cycles if (k < l) or 1 cycle if (k > l).
We also need to measure instruction mix of each parallel thread in every phase of the program (between every pair of barrier synchronizations). The hardware model for the machine would provide the scaling parameters for these instruction counts (just like the CPI calculation for RISC processors). By combining the parameters of the hardware model with the inherent behaviour of the program, we can calculate the execution time of the program on a new machine. Also, since we do not make any assumptions about instruction latencies in this model, we can also calculate the load balance of a program running on a specific machine, which addresses the issues raised in the concurrency page.