Embedded Chip-Level Integrated Parallel SupErcomputer (ECLIPSE) is an
architectural framework for general purpose chip multiprocessors (CMP) and
multiprocessor systems on chip (MP-SOC), but is extendable also to multichip
constellations [Forsell02]. It lends many ideas from our early work on the
Instruction-Level Parallel Shared Memory (IPSM) machine originally reported in
[Forsell97] as well as earlier PRAM realization research [Ranade91,
Leppänen96] and network on chip (NOC) research [Jantsch03].
Unfortunately, the original ECLIPSE architecture is only able to support the
exclusive read exclusive write (EREW) PRAM model which is not able to match
the performance of multioperation concurrent read concurrent write (MCRCW)
PRAM, but requires logarithmically longer execution times for a large number
of parallel computational problems even though optimal parallel algorithms are
used. In addition, it fails to support efficient execution of low-TLP
functionalities because for organizational reasons it features a relatively
high minimum number of threads per processor, dropping the utilization of a
core to as low as the reciprocal of that value in the case of a functionality
having only one thread.
Our recent proposal for a universal general purpose CMP is the TOTAL ECLIPSE
architecture that realizes the arbitrary MCRCW PRAM model and supports NUMA
execution for processor-wise thread bunches making execution of low-TLP
functionalities as efficient as with standard sequential processors using the
NUMA convention [Forsell10, Forsell11]. The REPLICA architecture is an
improved version of the TOTAL ECLIPSE architecture that implements the
PRAM-NUMA model of computation with support for full NUMA operation, a better
memory system employing local memories and virtual/off-chip memory system, I/O
system, support for native floating point operations, improved communication
network, improved memory modules with halved operating frequency, and various
architectural techniques that reduce power consumption.
A REPLICA consists of P Tp-threaded (constituting total T = PTρ threads)
F-functional unit MBTAC processor cores with dedicated instruction memory and
local data memory modules, P Tρ-line step caches and scratchpads attached to
processors, P fast data memory modules, and a high-bandwidth multimesh
interconnection network (see Figure 1).
Fig.1 An early view of the REPLICA architecture.
New architectural techniques and ideas to be employed in REPLICA include, but
are not limited to
implements an easy-to-program strong MCRCW PRAM model of computation via
multithreaded high-throughput computing
threads of within processors can be combined to mimic the NUMA model to
support sequential/NUMA legacy code
truly scalable latency hiding via high-throughput computing and high-bandwidth
interconnect
efficient wave synchronization dropping the cost of synchronization from
O(100) down to O(1/100)
concurrent memory access for advanced parallel algorithms
multioperations for computing prefixes and reductions in constant time
virtual instruction-level parallelism exploitation
pipeline hazard elimination
memory hashing for eliminating hot spots in intercommunication
implicitly synchronous multithreaded execution
REFERENCES
[Forsell97] M. Forsell, Implementation of Instruction-Level and Thread-Level
Parallelism in Computers, Dissertations 2, Department of Computer Science,
University of Joensuu, Joensuu, 1997.
[Forsell02] M. Forsell, A Scalable High-Performance Computing Solution for
Network on Chips, IEEE Micro 22, 5 (September-October 2002), 46-55.
[Forsell10] M. Forsell, TOTAL ECLIPSE—An Efficient Architectural Realization
of the Parallel Random Access Machine, In Parallel and Distributed Computing
Edited by Alberto Ros, IN-TECH, Vienna, 2010, 39-64.
[Forsell11] M. Forsell, A PRAM-NUMA Model of Computation for Addressing
Low-TLP Workloads, International Journal of Networking and Computing 1, 1
(January 2011), 21-35.
[Jantsch03] Networks on Chip edited by A. Jantsch and H. Tenhunen, Kluver
Academic Publishers, Boston, 2003, 173-192.
[Leppänen10] V. Leppänen, M. Penttonen and M. Forsell, Layouts for Sparse
Networks Supporting Throughput Computing, In the proceedings of the 2010
International Conference on Parallel and Distributed Processing Techniques and
Applications (PDPTA’10), July 12-15, 2010, Las Vegas, USA, 443-449.
[Ranade91] A. Ranade, How to Emulate Shared Memory, Journal of Computer and
System Sciences 42, (1991), 307-326.
Fig. 2. Early block diagrams of Mc-way double acyclic multimesh network (top), superswitch (middle), and switch element (bottom) for a 64-processor REPLICA CMP.