IBM Research
Achieving Strong Scaling On Blue Gene/L: Case
Study with NAMD

Sameer Kumar,
Blue Gene Software Group,
IBM T J Watson Research Center,
Yorktown Heights, NY
sameerk@us.ibm.com

© 2005 IBM Corporation

IBM Research

Outline
? Motivation

? NAMD and Charm++
? BGL Techniques
– Problem mapping
– Overlap of communication with computation
– Grain size
– Load-balancing
– Communication optimizations

? Summary

2

© 2005 IBM Corporation

IBM Research
Blue Gene/L

© 2005 IBM Corporation

System

Blue Gene/L

64 Racks, 64x32x32
Rack

IBM Research 32 Node Cards

Node Card

180/360 TF/s
32 TB

(32 chips 4x4x2)
16 compute, 0-2 IO cards
2.8/5.6 TF/s
512 GB
Compute Card

2 chips, 1x2x1
90/180 GF/s
16 GB

Chip

2 processors

2.8/5.6 GF/s
4 MB

5.6/11.2 GF/s
1.0 GB
© 2005 IBM Corporation

IBM Research

Application Scaling
? Weak
– Problem size increases with processors

? Strong
– Constant problem size
– Linear to sub-linear decrease in computation time with
processors
– Cache performance
– Communication overhead
• Communication to computation ratio

5

© 2005 IBM Corporation

IBM Research

Scaling on Blue Gene/L
? Several applications have demonstrated weak
scaling

? Strong scaling on a large number of benchmarks
still needs to be achieved

6

© 2005 IBM Corporation

IBM Research
NAMD and Charm++

© 2005 IBM Corporation

IBM Research

NAMD: A Production MD program
NAMD
? Fully featured program
? NIH-funded development
? Distributed free of charge
(thousands downloads so far)

? Binaries and source code
? Installed at NSF centers
? User training and support
? Large published simulations
(e.g., aquaporin simulation
featured in keynote)

8

© 2005 IBM Corporation

Acquaporin Simulation

IBM Research

NAMD, CHARMM27, PME
NpT ensemble at 310 or 298 K
1ns equilibration, 4ns production
Protein: ~ 15,000 atoms
Lipids (POPE): ~ 40,000
atoms
Water:
~ 51,000
atoms
Total:
~ 106,000
atoms
3.5 days / ns - 128 O2000 CPUs
11 days / ns - 32 Linux CPUs
.35 days/ns–512 LeMieux CPUs

F. Zhu, E.T., K. Schulten, FEBS Lett. 504, 212 (2001)
M. Jensen, E.T., K. Schulten, Structure 9, 1083 (2001)

9

© 2005 IBM Corporation

IBM Research

Molecular Dynamics in NAMD
? Collection of [charged] atoms, with bonds
– Newtonian mechanics
– Thousands of atoms (10,000 - 500,000)

? At each time-step
– Calculate forces on each atom
• Bonds:
• Non-bonded: electrostatic and van der Waal’s
– Short-distance: every timestep
– Long-distance: using PME (3D FFT)
– Multiple Time Stepping : PME every 4 timesteps

– Calculate velocities and advance positions

? Challenge: femtosecond time-step, millions needed!

10

© 2005 IBM Corporation

IBM Research

NAMD Benchmarks

BPTI
3K atoms

Estrogen Receptor
36K atoms (1996)
11

ATP Synthase
327K atoms
(2001)
© 2005 IBM Corporation

IBM Research

Parallel MD: Easy or Hard?
? Easy

12

? Hard

– Tiny working data

– Sequential timesteps

– Spatial locality

– Very short iteration time

– Uniform atom density

– Full electrostatics

– Persistent repetition

– Fixed problem size

– Multiple time-stepping

– Dynamic variations

© 2005 IBM Corporation

IBM Research

NAMD Computation
? Application data divided into data objects called
patches
– Sub-grids determined by cutoff

? Computation performed by migratable computes
– 13 computes per patch pair and hence much more
parallelism
– Computes can be further split to increase parallelism

13

© 2005 IBM Corporation

IBM Research

NAMD

? Scalable molecular dynamics simulation
? 2 types of objects: patches and computes, to expose more
parallelism

? Requires more careful load balancing
14

© 2005 IBM Corporation

IBM Research

Communication to Computation Ratio

? Scalable
– Constant with number of processors
– In practice grows at a very small rate

15

© 2005 IBM Corporation

IBM Research

Charm++ and Converse
obj

obj

obj

obj

obj
User View

System implementation

Interface

obj

Scheduler

Send Msg Q Recv Msg Q

Network
? Charm++: object-based asynchronous message-driven parallel programming paradigm
? Converse: communication layer for Charm++
– Send, recv, progress, on node level

16

© 2005 IBM Corporation

IBM Research
Optimizing NAMD on Blue Gene/L

© 2005 IBM Corporation

IBM Research

Single Processor Performance
? Worked with IBM Toronto for 3 weeks
– Inner loops slightly altered to enable software pipelining
– Aliasing issues resolved through the use of
#pragma disjoint (*ptr1, *ptr2)
– 40% serial speedup
– Current best performance is with 440

? Continued efforts with Toronto to get good 440d
performance

18

© 2005 IBM Corporation

IBM Research

NAMD on BGL
? Advantages
– Both application and hardware are 3D grids
– Large 4MB L3 cache
• On large number of processors NAMD will run from L3

– Higher bandwidth for short messages
• Midpoint of peak bandwidth achieved quickly

– Six outgoing links from each node
– No OS Daemons

19

© 2005 IBM Corporation

IBM Research

NAMD on BGL
? Disadvantages
– Slow embedded CPU
– Small memory per node
– Low bisection bandwidth
– Hard to scale full electrostatics
– Limited support for overlap of computation and
communication
• No cache coherence

20

© 2005 IBM Corporation

IBM Research

BGL Parallelization
? Topology driven problem mapping
? Load-balancing schemes
? Overlap of computation and communication
? Communication optimizations

21

© 2005 IBM Corporation

IBM Research

Problem Mapping

Y
Y
Z

Z
X
Application Data Space

22

X
Processor Grid

© 2005 IBM Corporation

IBM Research

Problem Mapping
Z

X

Y

Z
Y
Application Data Space

23

X
Processor Grid

© 2005 IBM Corporation

IBM Research

Problem Mapping

X

Y
Y

24

Z

Z

X

Application Data Space

Processor Grid

© 2005 IBM Corporation

IBM Research

Problem Mapping

Data Objects
Compute Objects

Cutoff-driven

Y
Z
X
Processor Grid

25

© 2005 IBM Corporation

IBM Research

Two Away Computation
? Each data object (patch) is split along a dimension
– Patches now interact with neighbors of neighbors
– Makes application more fine grained
– Improves load balancing
– Messages of smaller size sent to more processors
– Improves torus bandwidth

26

© 2005 IBM Corporation

IBM Research

Two Away X

27

© 2005 IBM Corporation

IBM Research

Load Balancing Steps
Regular
Timesteps

Instrumented
Timesteps

28

Detailed, aggressive Load
Balancing

Refinement Load
Balancing

© 2005 IBM Corporation

IBM Research

Load-balancing Metrics
? Balancing load
? Minimizing communication hop-bytes
– Place computes close to patches
– Biased through placement of proxies on near neighbors

? Minimizing number of proxies
– Effects connectivity of each data object

29

© 2005 IBM Corporation

IBM Research

Overlap of Computation and Communication

? Each FIFO has 4 packet buffers
? Progress engine should be called every 4400
cycles

? Overhead of about 200 cycles
– 5 % increase in computation

? Remaining time can be used for computation

30

© 2005 IBM Corporation

IBM Research

Network Progress Calls
? NAMD makes progress engine calls from the
compute loops
– Typical frequency is10000 cycles, dynamically tunable
for ( i = 0; i < (i_upper SELF(- 1)); ++i ){

void CmiNetworkProgress() {

CmiNetworkProgress();

new_time = rts_get_timebase();

const CompAtom &p_i = p_0[i];

if(new_time < lastProgress + PERIOD) {

//……………………………

lastProgress = new_time;

//Compute Pairlists

return;
}

for (k=0; k
IBM Research

MPI Scalability
? Charm++ MPI Driver
– Iprobe based implementation
– Higher progress overhead of MPI_Test
– Statically pinned FIFOs for point to point communication

32

© 2005 IBM Corporation

IBM Research

Charm++ Native Driver
? BGX Message Layer (developed by George Almasi)
– Lower progress overhead
– Active messages
• Easily design complex communication protocols

– Dynamic FIFO mapping
– Low overhead remote memory access
– Interrupts
– Charm++ BGX driver was developed by Chao Huang
over this summer

33

© 2005 IBM Corporation

IBM Research

BG/L Msglayer
Msg Queues

Messages
SpadMessage

TreeMessage

TorusMessage
po
st

1
Deterministically
routed packet

2


e ts
pack

I0 0 1 2 H
I1 0 1 2 H
R0 x+ x- y+ y- z+ z- H
R1 x+ x- y+ y- z+ z- H

FIFO
pinning

n-1
Scratchpad
Msq queue

Templates

Coll. network FIFO
Torus FIFOs

0

TorusPacket

Dynamically
routed packet

Advance loop

Torus
Msq queue

Packets
TreePacket

Collective
Msq queue

Network

TorusDirectMessage

Dispatching
Torus pkt. registry

0
1
2

p
Coll. pkt. disp.

( This slide is taken from G. Alm?si’s talk on the “new” msglayer. )

34

© 2005 IBM Corporation

IBM Research

Optimized Multicast

? pinFifo Algorithms
– Decide which of the 6 FIFOs to use when send msg to {x,y,z,t}
– Cones, Chessboard

? Dynamic FIFO mapping
– A special send queue that msg can go from whichever FIFO that is not full

35

© 2005 IBM Corporation

IBM Research

Communication Pattern in PME

108
procs

36

108 procs

© 2005 IBM Corporation

IBM Research

PME
? Plane decomposition for 3D-FFT
? PME objects placed close to patch objects on the
torus

? PME optimized through an asynchronous all-to-all
with dynamic FIFO mapping

37

© 2005 IBM Corporation

IBM Research
Performance Results

© 2005 IBM Corporation

IBM Research

BGX Message layer vs MPI
? Fully non-blocking version performed below par on MPI
– Polling overhead high for a list of posted receives

? BGX message layer works well with asynchronous communication
# Nodes

APoA1 Benchmark

Cutoff
Msglayer

with PME
MPI*

Msglayer

MPI*

4

2250

2250

32

314

316

356

128

85

91.6

103

512

22.7

23.8

26.7

27.8

1024

13.2

13.9

14.4

17.3

2048

7.9

8.1

9.7

10.2

4096

4.8

4.9

6.8

7.3

371

NAMD Co-Processor Mode Performance (ms/step)

Message layer has sender side blocking communication here
39

© 2005 IBM Corporation

IBM Research

Blocking vs Overlap

Cutoff
# Nodes

with PME

Blocking Sender

Non-Blocking

Blocking Sender

Non-Blocking

32

314

313

356

347

128

85

82

103

97.2

512

22.7

21.7

26.7

23.7

1024

13.2

11.9

14.4

13.8

2048

7.9

7.3

9.7

8.6

4096

4.8

4.3

6.8

6.2

8192

-

3.7

-

-

APoA1 Benchmark in Co-Processor Mode
40

© 2005 IBM Corporation

IBM Research

Effect of Network Progress

(Projections timeline of a 1024-node run without aggressive network progress)

? Network progress not aggressive enough: communication gaps eat up utilization

41

© 2005 IBM Corporation

IBM Research

Effect of Network Progress (2)

(Projections timeline of a 1024-node run with aggressive network progress)

? More frequent advance closes gaps

42

© 2005 IBM Corporation

IBM Research

Step Time (ms)

Virtual Node Mode

Processors

APoA1 step time with PME
43

© 2005 IBM Corporation

IBM Research

Step Time (ms)

Spring vs Now

Processors

APoA1 step time with PME
44

© 2005 IBM Corporation

IBM Research
Summary

© 2005 IBM Corporation

IBM Research

Summary
? Demonstrated good scaling to 4k processors for
the APoA1 with a speedup of 2100
– Still working on 8k results

? ATPase scales well to 8k processors with a
speedup of 4000+

46

© 2005 IBM Corporation

IBM Research

Lessons Learnt
? Eager messages lead to contention
? Rendezvous messages don’t perform well with mid size
messages

? Topology optimizations are a big winner
? Overlap of computation and communication is possible
– Overlap however makes compute load less predictable

? Lack of operating system daemons leads to massive scaling

47

© 2005 IBM Corporation

IBM Research

Future Plans
? Experiment with new communication protocols
– Remote memory access
– Adaptive eager
– Fast asynchronous collectives

? Improve load-balancing
– Newer distributed strategies
– Heavy processors dynamically unload to neighbors

? Pencil decomposition for PME
? Using the double hummer

48

© 2005 IBM Corporation






Чтобы не видеть здесь видео-рекламу достаточно стать зарегистрированным пользователем.
Чтобы не видеть никакую рекламу на сайте, нужно стать VIP-пользователем.
Это можно сделать совершенно бесплатно. Читайте подробности тут.