Technical Sessions
The following are questions people asked after the presentations. Again,
they are not complete for the reasons I listed on the front page. Questions
and comments by audience are in bold-italic. Answers by the presenters
are in normal type.
The Case for a Single-Chip Multiprocessor
- How about OS? They could behave differently. Was that included
in the measurements? The OS was involved in the pmake benchmark.
The total timing included OS activities but there was no separate study
that attempted to isolate OS performance.
- The choice of benchmarks was questionable...
- How does this compare to the less conventional researches conducted
elsewhere (such as M-Machine, for example, I think)? This approach
requires less interaction and coordination among the processors, thus it
requires less real estate for control.
- Studies have shown that the real performance bottleneck (at least
for databases and OLTP environments) is the off-chip bandwidth. How does
your approach work for these workloads? This approach requires
the applications to be reprogrammed to explicitly use multiple threads
(either by the compiler or the programmer). The multiple threads mean that
the OLTP programs will have a better chance of tolerating latency.
Synchronization and Communication in the T3E
Multiprocessor
- It seems that more steps are involved for a remote memory operations
compared to the T3D. What effect does that have on the latency? The
use of "E-registers" introduced one drawback for the T3E. Its
latency is actually worse than its predecessor, the T3D.
The Rio File Cache: Surviving Operating System
Crashes
- How does this prevent the user from mmap'ing
the file and then screw it up? It doesn't. RIO doesn't prevent
the user from screwing himself. It prevents the kernel from screwing the
file system.
- This seems awfully similar to the micro-kernel
idea. You have a small core kernel and the rest of the functionalities
are provided by subsystems that are protected from each other.
- Forcing kernel physical addresses through
the TLB might incur a performance hit by polluting the TLB.
Multiple-Block Ahead Branch Predictors
- Your benchmarks are statically linked Ultrix
binaries. Expiences show that the branch behavior of dynamically linked
binaries are very different.
- Again, the use of SPEC benchmarks are questionable.
Analysis of Branch Prediction Via Data Compression
- 2-bit-counter may be better than the Markov
model. Therefore you can't claim the Markov model is an upper bound.
Value Locality and Load Value Prediction
- Floating point values are less predictable.
FP operations tend to have more ILP too. So this technique is less useful.
Others say that they have looked at FP values independently. Some bits
but not all bits of the FP values are still very predictable.
- Might want to look into exploiting the
zero values. Many predictable values are zeros.
- In an MP, coherence might require the invalidation
of the prediction tables. This could get tricky.
Reducing Network Latency Using Subpages in
a Global Memory Environment
- The base case seems a very poor implementation. It seems that
you could get a lot of improvement with simply doing a better pipelining
job on the sender and the AN2 network controller.
- How complicated is the implementation for architectures that
have an intrinsic 4K page size. The 2K subpages are not complicated
to implement at all. The main concern is TLB coverage. The rest is just
simple hacking.
- (This is a question I asked Hank in person afterwards) The 2K
subpages are not quite pages. Although you are getting the benefit of pipelining,
you could still be transferring unnecessary data because all the subpages
within a real page are always demanded in even when they are not all needed.
So why not just use 2K pages and eliminate this subpage hack? I realize
doing so might introduce TLB coverage problem, but that's a known problem
with known solutions, ie., you could have a 2K page size and some super
page size. Why bother with a complicated three-level solution of subpage/page/superpage
when a two-level solution should suffice. This could be true. One
concern I have other than the TLB coverage problem is the page fault overhead
which could be increased by a 2K page size.
Improving Cache Performance with Balanced Tag
and Data Paths
- The study assumes blocking cache, a limitation
for a more realistic memory system.
Randy
Wang ( rywang.public@gmail.com
)
This page created with Netscape Navigator Gold
(last updated on October 10, 1996)