Second NOW/Cluster Workshop
I thought the last panel session was one of the more interesting ones.
The panelists reflected on the past, present, and the future of cluster
computing research. All seemed to agree that clustered SMP's are the future.
We have made significant progress but much work still needs to be done
in the areas of fault tolerance, system administration, and programming
model. We need to address these problems in order to meet the challenges
mounted by SMP's.
Arvind (MIT)
- Two to four node clusters are not interesting. They will be wiped out
by SMP's. The larger systems are the more interesting ones.
- Clusters are here to stay. Important applications (such as databases,
OLTP's, and web servers) need incremental growth and clusters are the only
viable alternative.
- The research challenge is how to make a large cluster look like an
SMP. One thing we need to provide is a high level multi-threaded programming
model so programmers don't have to reason about clusters.
- Addressing fault tolerance is important.
David Wood (University of Wisconsin)
- Before NOW:
- the goal was utilizing idle workstations. eg. Condor.
- "turn-key" applications. eg. Tiger.
- lousy network.
- lousy portability.
- lousy administration.
- NOW:
- everything is still lousy.
- but we have made some progress in the areas of DSM, networking, co-op
caching, and secheduling.
- Challenges:
- the question is: which of the following two alternatives is better
--
- SMP's plus "zero-admin" clients, or
- desktop workstations.
- SMP's are getting bigger and faster. Your best SAN is still a very
lousy ICN.
- future is clusters of SMP's.
- Prospects:
- DSM support, making apps portable and reliable.
- very fast network (< 1us latency).
- user administration.
Ajei Gopal (IBM Power Parallel)
The following are the most important problems we need to address for
successful cluster computing.
- Availability: we need to recover all the failed components:
the OS, the network, the app, a difficult task for SMP's.
- Scalability.
- Easy programming model.
- Seamless growth.
- Better system administration.
- Disaster protection (geographical dispersal, for example, is almost
impossible for a very tightly coupled system).
Thorsten von Eicken (Cornell University)
- Progress and future research:
- We have learned how to deliver the wire performance to the applications.
- We still lack higher level tools to manage:
- fault tolerance,
- load balancing, and
- sys. admin.
- A lot of what we do depends on the definition of clusters:
- original point was to give us 1.5-2 years of edge against the best
MPP's.
- proposed solution was commodity hardware and software.
- in reality, clusters probably mean the following:
- 90% of the systems are 2-node (failover) systems.
- 10% of the systems are small scale clusters (I forgot the exact terms
he used).
- 0.0001% of the systems are true MPP's in disguise.
- Open Issues -- depend on the definition of the clusters and our goals.
- If we are interested in "MPP's in disguise", then we need
to examine:
- what fraction are commodity components.
- parallel programming model
- NIC design
- language support
- If we are interested in harvesting idle cycles, then let's talk about
applications.
- If we are interested in back-office applications, then fault tolerance
is important.
Greg Papadopoulos (Sun Microsystems)
- Good News:
- we have found a NOW killer app. -- the scalable network service (referring
to the extension of the Inktomi experience. the idea is that any network
service demands incremental growth and the software running there is largely
invisible to the end user).
- we are engaged in competitions in the right directions (referring to,
for example, the battle of lowest latency).
- convergence on memory semantics (referring to some common characteristics
provided by cluster interconnects, such as non-coherent read/write, strong
ordering, and error models).
- significant movement to multi-user and multi-process direction.
- Bad News:
- end user experience? ie., not much experience with new media.
- I/O?
- processor/memory? which growth curve are we attached to?
- we are building I/O channels. everything we do is limited by PCI:
- latency and bandwidth.
- visibility of memory transactions.
- 2-10 us today is like 30-50 us in 1990. in about six more years, we
should expect 200 ns latency?
- we are feeling the squeeze from both directions:
- real DSM machines attack from above, eg. SGI Spider
- ethernet + PCs attack from bottom
- Issues:
- high availability / fault tolerance.
- not a problem (for the communication layer).
- fundamentally an end-to-end problem.
- quality of service (availability is part of it).
- management.
- need better OS view of resource management (I think he was referring
to some issues brought up in some of the previous panels, such as Alan's
point that AM2's paging endpoints could use a well defined OS interface).
Randy
Wang ( rywang.public@gmail.com
)
This page created with Netscape Navigator Gold
(last updated on October 10, 1996)