Parallelization for Model Checking of Probabilistic Systems with GPU Computing

Presenter Notes

Introduction

Motivation :

learn about High Performance Computing.
work with real problems.
parallelization of data structures and algorithms which are not embarrassingly parallel.

Challenge

To adapt MTBDD data structure in order to be profitable in GPU architecture. Why ? MTBDD is a rooted, directed acyclic graph. A symbolic data structure which is apparently difficult to decompose in subproblems with coalesced memory access (GPU performance requirements).

Goal

To improve performance of DTMC and CTMC system verification on PRISM model checker without a big impact in memory usage.

Presenter Notes

Verification of probabilistic systems

(Discrete-Time Markov Chains and Continuos-Time Markov Chains)

Presenter Notes

Jacobi Method and generic algorithm

Parallelization advantage $\to$ Iteration k+1 depends on iteration k
Drawbacks $\to$ stores 2 solution vectors, more iteration for convergence than G-S.

Method

$$\mathbf{A} \underline{x} = \underline{b} \;\; \to \;\; \mathbf{A} = \mathbf{D} + \mathbf{R} \;\;\;\; where \;\;\;\; \mathbf{D} = Diagonal$$

In each iteration

$$\underline{x}^{k+1} = \mathbf{D}^{-1} (\underline{b} - \mathbf{R} \underline{x}^{k})$$ $$max ( \frac{| \underline{x}^{k+1} - \underline{x}^{k} |}{\underline{x}^{k}} ) < epsilon \;\; or \;\; max ( | \underline{x}^{k+1} - \underline{x}^{k} | ) < epsilon$$

Algorithm

Matrix - vector product - ($\underline{res} = \mathbf{R} \underline{x}^k$)
vector - vector operations - ($\mathbf{D}^{-1} (\underline{b} - \underline{res})$)
max reduction

Presenter Notes

GPU computational performance

CPU vs. GPU performance

Old figures ---> new NVIDIA Keplar architecture over 1TFLOPS (floating operations per second) on double presicion

Presenter Notes

PU computational performance

CPU vs. GPU performance

Presenter Notes

GPGPU - A Scalable Programming Model

3 key abstractions

a hierarchy of thread groups
shared memory
barrier synchronization

Those guide the programmer to:

Divide de problem in coarse sub-problem that can be solved independently in parallel by blocks of threads
Divide each sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block using share memory and synchronization primitives

Also enable automatic scalability, i.e. each block of threads can be scheduled on any of the available multiprocessors within a GPU, in any order, concurrently or sequentially

Presenter Notes

CUDA: Programming model as C

Define a especial kind of function call kernel, that, when called, is executed N times in parallel by N differents CUDA threads (programmer defines N)

Thread Hierarchy :
- Thread are organized in a, 1-dim, 2-dim or 3-dim block of threads.
- Blocks are organized in a, 1-dim, 2-dim or 3-dim grid of blocks.
- Thread blocks are required to execute independently : any order, in parallel or in series (allow scalability)
- Threads within a block can cooperate using some (fast) shared memory and by synchronizing their execution to coordinate memory accesses.

Presenter Notes

CUDA: Programming model as C

Memory Hierarchy

Memory hierarchy

Presenter Notes

Hardware implementation

Implements many independients cores or Streaming Multiprocessors (SMs).
In a kernel execution blocks are distributed between availables SMs.
Block of threads are divided in groups of 32 parallel threads called warps.
Thread in a warps executes the same instruction at the same time (SIMD).
If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path.
Different warps execute independently.
Warp context is mantained by the multiprocesor during the entire lifetime of the warp.
Switching from one execution context to another has no cost (multiprocesor scheduler)
Memory access latency can be hidden with calculations.

Presenter Notes

Performance Guidelines :

Maximize parallel execution to achieve maximum utilization
- many warps inside SMs
Optimize memory usage to achieve maximum memory throughput
- coalesced memory access

Presenter Notes

BDD and MTBDD

(rooted, directed acyclic graph)

An MTBDD represents a function $f:\mathbb{B}^{n} \to \mathbb{R}$ where variables are ordered: $x_1, \dots , x_n \in \mathbb{B}$, $x_1 < .. < x_n$

MTBDD example

Presenter Notes

BDD and MTBDD

The reason that MTBDDs can often provide compact storage is because they are stored in a reduced form.

if node m and m' are idential (var(m) == var(m'), then(m) == then(m'), else(m) == else(m')) only one copy of the node is stored. Call sharing of nodes
if a node m satisfies then(m) == else(m), it is removed and any incoming edges from levels above are redirected to its unique child. Call skipped level

It results in a cononical form.

Presenter Notes

Matrix and vector representation

Map matrix or vector indices into a their boolean representation.

Define,

$enc : Integer \to \mathbb{B}^{n}$ is an encoding function and

$f : \mathbb{B}^{n} \times \mathbb{B}^{n} \to \mathbb{R}$ is an MTBDD function

Then,

$M(i, j) = f(enc(i), enc(j)) \;\;\; \forall i,j : 0 \leq i,j < 2^{n}$

Take as matrix size the first power of two greater than the real dimension. Complete the rest of entraces with zeros.
size of the final structure depends on the order of the variables.
if matrix representation interleaves row and columns variables, e.i $f(r_0, c_0, r_1, c_1 ..., r_n, c_n)$, the resulting MTBDD is more compact and has an easier access pattern.

Presenter Notes

Matrix and vector representation

Example

Presenter Notes

Hybrid Engine

The transition matrix is represented with an MTBDD

Vectors are represented with arrays

Problem to solve
- product of MTBDD matrix and array vector
Algorithm :
- DFS from root to each terminal node, construct row and column indices and get the terminal value. "(r, c) = v"
- res[r] := res[r] + v . b[c]

Presenter Notes

Offset-labelled MTBDD

(First improvement (PRISM))

Eliminates stady-state
- those generates zeros on iteration vectors which never change
- improvemen : make solution vector more compact / less memory usage

Algorithm :

add skipped levels
add an offset in each node.
- For row node, offset will indicate how many reachable rows there are in the top submatrix
- For column node, offset will indicate how many reachable columns there are in the left submatrix

Presenter Notes

Offset-labelled MTBDD

(First improvement (PRISM))

Ejemplo grafico

Presenter Notes

Matrix injection on offset-labelled MTBDD

(Second improvement (PRISM))

MTBDD is a recursive structure. Each node is an MTBDD.

In a Matrix MTBDD, each node is a sub-matrix of the original matrix.
In the sub-matrix, indices are relative to the matrix, adding an offset depending on the path taken from the root to the injected matrix.

Add explicit sub-matrix as a cache copy of nodes of a specific level.
In matrix - vector product, stop in those nodes instead terminal nodes.
Matrices are injected in all the nodes of a specific level depending on memory usage.

Presenter Notes

Finally, my work

Problem

Try to divide the problem in parallel independent tasks, each task must be solved by a block of threads.

Posible solution: one task for each path between root and inj. submatrix (Path instance). Store an explicit representation of the path.
- Task representation : (row, column, submatrix)
  - row and column are integer which represent the relative row and column
  - submatrix is a pointer.
- Posible number of task ?
  - First level -> One task, One matrix (explicit engine) -> A lot of memory.
  - Last level -> ~$3 . 10^9$ task , ~ aprox $100$ matrices of size $1 \times 1$ -> A lot of memory.
- Other levels?

Presenter Notes

Finally, my work

Memory Usage

The final amount of memory is the sum of task representation and injected matrix representation.

I believe this was the "main observation of the work".

Presenter Notes

New improvement - vector partition

Injected matrices are in COO representation $\to$ vector of tuple (r, c, value).
Divide the result vector in slices of fixed size [256].
Each task processes a slice of the result vector (Group of path instances).
- Task representation : group of path instances which fall inside of the slice.
- New kind of path instance : (row, column, *sm, offset_begin, offset_end)
  - offset_begin is the first element in sm vector of tuples which falls in the slice.
  - offset_end is the first element in sm vector of tuples which doesn't fall in the slice.

Presenter Notes

Example

Presenter Notes

Memory requirementes for this approach

offset-labelled MTBDD $\to$ same memory than the hybrid engine
2 * iterative vectors $\to$ same memory than the hybrid engine
diagonal vector $\to$ same memory than the hybrid engine
b vector $\to$ same memory than the hybrid engine, in general it is zero!

New memory requirements :

injected matrices + path instances

where to inject matrices ?

lavel with minimum memory usage $\to$ around 4 time faster, use not too much memory
some lavel down (2, 3 o 4) $\to$ avg 10x faster, up to 18x in some examples, less than 10% of the iteration vector size in memory usage.

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

Motivation :

Challenge

Goal

Method

In each iteration

Algorithm

3 key abstractions

Those guide the programmer to:

Example

(First improvement (PRISM))

Algorithm :

(First improvement (PRISM))

(Second improvement (PRISM))

Problem

"some level down the level of minimum memory usage."

Table of Contents

Help