ASPLOS-VII Session Notes

Technical Sessions

The following are questions people asked after the presentations. Again, they are not complete for the reasons I listed on the front page. Questions and comments by audience are in bold-italic. Answers by the presenters are in normal type.

The Case for a Single-Chip Multiprocessor

How about OS? They could behave differently. Was that included in the measurements? The OS was involved in the pmake benchmark. The total timing included OS activities but there was no separate study that attempted to isolate OS performance.
The choice of benchmarks was questionable...
How does this compare to the less conventional researches conducted elsewhere (such as M-Machine, for example, I think)? This approach requires less interaction and coordination among the processors, thus it requires less real estate for control.
Studies have shown that the real performance bottleneck (at least for databases and OLTP environments) is the off-chip bandwidth. How does your approach work for these workloads? This approach requires the applications to be reprogrammed to explicitly use multiple threads (either by the compiler or the programmer). The multiple threads mean that the OLTP programs will have a better chance of tolerating latency.

Synchronization and Communication in the T3E Multiprocessor

It seems that more steps are involved for a remote memory operations compared to the T3D. What effect does that have on the latency? The use of "E-registers" introduced one drawback for the T3E. Its latency is actually worse than its predecessor, the T3D.

The Rio File Cache: Surviving Operating System Crashes

How does this prevent the user from mmap'ing the file and then screw it up? It doesn't. RIO doesn't prevent the user from screwing himself. It prevents the kernel from screwing the file system.
This seems awfully similar to the micro-kernel idea. You have a small core kernel and the rest of the functionalities are provided by subsystems that are protected from each other.
Forcing kernel physical addresses through the TLB might incur a performance hit by polluting the TLB.

Multiple-Block Ahead Branch Predictors

Your benchmarks are statically linked Ultrix binaries. Expiences show that the branch behavior of dynamically linked binaries are very different.
Again, the use of SPEC benchmarks are questionable.

Analysis of Branch Prediction Via Data Compression

2-bit-counter may be better than the Markov model. Therefore you can't claim the Markov model is an upper bound.

Value Locality and Load Value Prediction

Floating point values are less predictable. FP operations tend to have more ILP too. So this technique is less useful. Others say that they have looked at FP values independently. Some bits but not all bits of the FP values are still very predictable.
Might want to look into exploiting the zero values. Many predictable values are zeros.
In an MP, coherence might require the invalidation of the prediction tables. This could get tricky.

Reducing Network Latency Using Subpages in a Global Memory Environment

The base case seems a very poor implementation. It seems that you could get a lot of improvement with simply doing a better pipelining job on the sender and the AN2 network controller.
How complicated is the implementation for architectures that have an intrinsic 4K page size. The 2K subpages are not complicated to implement at all. The main concern is TLB coverage. The rest is just simple hacking.
(This is a question I asked Hank in person afterwards) The 2K subpages are not quite pages. Although you are getting the benefit of pipelining, you could still be transferring unnecessary data because all the subpages within a real page are always demanded in even when they are not all needed. So why not just use 2K pages and eliminate this subpage hack? I realize doing so might introduce TLB coverage problem, but that's a known problem with known solutions, ie., you could have a 2K page size and some super page size. Why bother with a complicated three-level solution of subpage/page/superpage when a two-level solution should suffice. This could be true. One concern I have other than the TLB coverage problem is the page fault overhead which could be increased by a 2K page size.

Improving Cache Performance with Balanced Tag and Data Paths

The study assumes blocking cache, a limitation for a more realistic memory system.

Randy Wang ( rywang.public@gmail.com )

This page created with Netscape Navigator Gold (last updated on October 10, 1996)