1 \chapter{Controlling the code generation
}
2 Several options of the
\faust compiler allow to control the generated C++ code. By default the computations are done sample by sample in a single loop. But the compiler can also generate
\textit{vector
} and
\textit{parallel
} code.
5 \section{Vector Code generation
}
6 Modern C++ compilers are able to do autovectorization, that is to use SIMD instructions to speedup the code. These instructions can typically operate in parallel on short vectors of
4 simple precision floating point numbers thus leading to a theoretical speedup of $
\times4$.
7 Autovectorization of C/C++ programs is a difficult task. Current compilers are very sensitive to the way the code is arranged. In particular too complex loops can prevent autovectorization. The goal of the vector code generation is to rearrange the C++ code in a way that facilitates the autovectorization job of the C++ compiler. Instead of generating a single sample computation loop, it splits the computation into several simpler loops that communicates by vectors.
9 The vector code generation is activated by passing the
\lstinline!--vectorize! (or
\lstinline!-vec!) option to the
\faust compiler. Two additional options are available:
\lstinline!--vec-size <n>! controls the size of the vector (by default
32 samples) and
\lstinline!--loop-variant
0/
1! gives some additional control on the loops.
11 To illustrate the difference between scalar code and vector code, let's take the computation of the RMS (Root Mean Square) value of a signal. Here is the
\faust code that computes the Root Mean Square of a sliding window of
1000 samples:
14 // Root Mean Square of n consecutive samples
15 RMS(n) = square : mean(n) : sqrt ;
20 // Mean of n consecutive samples of a signal
21 // (uses fixpoint to avoid the accumulation of
23 mean(n) = float2fix : integrate(n) :
26 // Sliding sum of n consecutive samples
27 integrate(n,x) = x - x@n : +~_ ;
29 // Convertion between float and fix point
30 float2fix(x) = int(x*(
1<<
20));
31 fix2float(x) = float(x)/(
1<<
20);
33 // Root Mean Square of
1000 consecutive samples
37 The compute() method generated in scalar mode is the following:
40 virtual void compute (int count,
44 float* input0 = input
[0];
45 float* output0 = output
[0];
46 for (int i=
0; i<count; i++)
{
47 float fTemp0 = input0
[i
];
48 int iTemp1 = int(
1048576*fTemp0*fTemp0);
49 iVec0
[IOTA&
1023] = iTemp1;
50 iRec0
[0] = ((iVec0
[IOTA&
1023] + iRec0
[1])
51 - iVec0
[(IOTA-
1000)&
1023]);
52 output0
[i
] = sqrtf(
9.536744e-10f *
61 The
\lstinline!-vec! option leads to the following reorganization of the code:
63 virtual void compute (int fullcount,
68 int* iRec0 = &iRec0_tmp
[4];
69 for (int index=
0; index<fullcount; index+=
32)
71 int count = min (
32, fullcount-index);
72 float* input0 = &input
[0][index
];
73 float* output0 = &output
[0][index
];
74 for (int i=
0; i<
4; i++)
75 iRec0_tmp
[i
]=iRec0_perm
[i
];
77 for (int i=
0; i<count; i++)
{
78 iYec0
[(iYec0_idx+i)&
2047] =
79 int(
1048576*input0
[i
]*input0
[i
]);
82 for (int i=
0; i<count; i++)
{
83 iRec0
[i
] = ((iYec0
[i
] + iRec0
[i-
1]) -
84 iYec0
[(iYec0_idx+i-
1000)&
2047]);
87 for (int i=
0; i<count; i++)
{
88 output0
[i
] = sqrtf((
9.536744e-10f *
92 iYec0_idx = (iYec0_idx+count)&
2047;
93 for (int i=
0; i<
4; i++)
94 iRec0_perm
[i
]=iRec0_tmp
[count+i
];
99 While the second version of the code is more complex, it turns out to be much easier to vectorize efficiently by the C++ compiler. Using Intel icc
11.0, with the exact same compilation options:
\texttt{-O3 -xHost -ftz -fno-alias -fp-model fast=
2}, the scalar version leads to a throughput performance of
129.144 MB/s, while the vector version achieves
359.548 MB/s, a speedup of x2.8 !
103 \includegraphics[scale=
0.75]{images/compiler-stack
}
104 \caption{\faust's stack of code generators
}
109 The vector code generation is built on top of the scalar code generation (see figure
\ref{fig:stack
}). Every time an expression needs to be compiled, the compiler checks if it requires a separate loop or not. It applies some simple rules for that. Expressions that are shared (and are complex enough) are good candidates to be compiled in a separate loop, as well as recursive expressions and expressions used in delay lines.
111 The result is a directed graph in which each node is a computation loop (see Figure
\ref{fig:loopgraph
}). This graph is stored in the klass object and a topological sort is applied to it before printing the code.
115 \includegraphics[scale=
0.75]{graphs/loopgraph2
}
116 \caption{The result of the -vec option is a directed acyclic graph (DAG) of small computation loops
}
117 \label{fig:loopgraph
}
121 \section{Parallel Code generation
}
123 The parallel code generation is activated by passing either the
\lstinline!--openMP! (or
\lstinline!-omp!) option or the
\lstinline!--scheduler! (or
\lstinline!-sch!) option. It implies the
\lstinline!-vec! options as the parallel code generation is built on top of the vector code generation.
126 \subsection{The OpenMP code generator
}
130 \includegraphics[scale=
0.5,angle=-
90]{images/openmp-model
}
131 \caption{OpenMP is based on a fork-join model
}
135 The
\lstinline!--openMP! (or
\lstinline!-omp!) option given to the
\faust compiler will insert appropriate OpenMP directives in the C++ code. OpenMP (http://wwww.openmp.org) is a well established API that is used to explicitly define direct multi-threaded, shared memory parallelism. It is based on a fork-join model of parallelism (see figure
\ref{fig:openmp
}).
136 Parallel regions are delimited by
\lstinline!#pragma omp parallel! constructs. At the entrance of a parallel region a team of parallel threads is activated. The code within a parallel region is executed by each thread of the parallel team until the end of the region.
141 // the code here is executed simultaneously by
142 // every thread of the parallel team
147 In order not to have every thread doing redundantly the exact same work, OpemMP provides specific
\textit{work-sharing
} directives. For example
\lstinline!#pragma omp sections! allows to break the work into separate, discrete sections, each section being executed by one thread:
169 \subsection{Adding OpenMP directives
}
170 As said before the parallel code generation is built on top of the vector code generation. The graph of loops produced by the vector code generator is topologically sorted in order to detect the loops that can be computed in parallel. The first set $S_0$ (loops $L1$, $L2$ and $L3$ in the DAG of Figure
\ref{fig:loopgraph
}) contains the loops that don't depend on any other loops, the set $S_1$ contains the loops that only depend on loops of $S_0$, (that is loops $L4$ and $L5$), etc..
172 As all the loops of a given set $S_n$ can be computed in parallel, the compiler will generate a
\lstinline!sections! construct with a
\lstinline!section! for each loop.
188 If a given set contains only one loop, then the compiler checks to see if the loop can be parallelized (no recursive dependencies) or not. If it can be parallelized, it generates:
195 otherwise it generates a
\lstinline!single! construct so that only one thread will execute the loop:
203 \subsection{Example of parallel OpenMP code
}
204 To illustrate how
\faust uses the OpenMP directives, here is a very simple example, two
1-pole filters in parallel connected to an adder (see figure
\ref{fig:parfilter
} the corresponding block-diagram):
207 filter(c) = *(
1-c) : + ~ *(c);
208 process = filter(
0.9), filter(
0.9) : +;
213 \includegraphics[width=
8cm
]{images/filter2
}
214 \caption{two filters in parallel connected to an adder
}
215 \label{fig:parfilter
}
218 The corresponding compute() method obtained using the -omp option is the following:
221 virtual void compute (int fullcount,
225 float fRec0_tmp
[32+
4];
226 float fRec1_tmp
[32+
4];
227 float* fRec0 = &fRec0_tmp
[4];
228 float* fRec1 = &fRec1_tmp
[4];
229 #pragma omp parallel firstprivate(fRec0,fRec1)
231 for (int index =
0; index < fullcount;
234 int count = min (
32, fullcount-index);
235 float* input0 = &input
[0][index
];
236 float* input1 = &input
[1][index
];
237 float* output0 = &output
[0][index
];
240 for (int i=
0; i<
4; i++)
241 fRec0_tmp
[i
]=fRec0_perm
[i
];
242 for (int i=
0; i<
4; i++)
243 fRec1_tmp
[i
]=fRec1_perm
[i
];
249 for (int i=
0; i<count; i++)
{
250 fRec0
[i
] = ((
0.1f * input1
[i
])
251 + (
0.9f * fRec0
[i-
1]));
254 for (int i=
0; i<count; i++)
{
255 fRec1
[i
] = ((
0.1f * input0
[i
])
256 + (
0.9f * fRec1
[i-
1]));
261 for (int i=
0; i<count; i++)
{
262 output0
[i
] = (fRec1
[i
] + fRec0
[i
]);
267 for (int i=
0; i<
4; i++)
268 fRec0_perm
[i
]=fRec0_tmp
[count+i
];
269 for (int i=
0; i<
4; i++)
270 fRec1_perm
[i
]=fRec1_tmp
[count+i
];
278 This code requires some comments:
281 \item The parallel construct
\lstinline!#pragma omp parallel! is the fundamental construct that starts parallel execution. The number of parallel threads is generally the number of CPU cores but it can be controlled in several ways.
283 \item Variables external to the parallel region are shared by default. The pragma
\lstinline!firstprivate(fRec0,fRec1)! indicates that each thread should have its private copy of fRec0 and fRec1. The reason is that accessing shared variables requires an indirection and is quite inefficient compared to private copies.
285 \item The top level loop
\lstinline!for (int index =
0;...)...! is executed by all threads simultaneously. The subsequent work-sharing directives inside the loop will indicate how the work must be shared between the threads.
287 \item Please note that an implied barrier exists at the end of each work-sharing region. All threads must have executed the barrier before any of them can continue.
289 \item The work-sharing directive
\lstinline!#pragma omp single! indicates that this first section will be executed by only one thread (any of them).
291 \item The work-sharing directive
\lstinline!#pragma omp sections! indicates that each corresponding
\lstinline!#pragma omp section!, here our two filters, will be executed in parallel.
293 \item The loop construct
\lstinline!#pragma omp for! specifies that the iterations of the associated loop will be executed in parallel. The iterations of the loop are distributed across the parallel threads. For example, if we have two threads, the first one can compute indices between
0 and count/
2 and the other one between count/
2 and count.
295 \item Finally
\lstinline!#pragma omp single! in section
3 indicates that this last section will be executed by only one thread (any of them).
299 \subsection{The scheduler code generator
}
300 With the
\lstinline!--scheduler! (or
\lstinline!-sch!) option given to the
\faust compiler, the computation graph is cut into separated computation loops (called "tasks"), and a "Work Stealing Scheduler" is used to activate and execute them following their dependencies. A pool of worked threads is created and each thread uses it's own local WSQ (Work Stealing Queue) of tasks. A WSQ is a special queue with a Push operation, a "private" LIFO Pop operation and a "public" FIFO Pop operation.
302 Starting from a ready task, each thread follows the dependencies, possibly pushing ready sub-tasks into it's own local WSQ. When no more tasks can be activated on a given computation path, the thread pops a task from it's local WSQ. If the WSQ is empty, then the thread is allowed to "steal" tasks from other threads WSQ.
304 The local LIFO Pop operation allows better cache locality and the FIFO steal Pop "larger chuck" of work to be done. The reason for this is that many work stealing workloads are divide-and-conquer in nature, stealing one of the oldest task implicitly also steals a (potentially) large subtree of computations that will unfold once that piece of work is stolen and run.
306 Compared to the OpenMP model (-omp) the new model is worse for simple
\faust programs and usually starts to behave comparable or sometimes better for "complex enough"
\faust programs. In any case, since OpenMP does not behave so well with GCC compilers (only quite recent versions like GCC
4.4 start to show some improvements), and is unusable on OSX in real-time contexts, this new scheduler option has it's own value. We plan to improve it adding a "pipelining" idea in the future.
308 \subsection{Example of parallel scheduler code
}
309 To illustrate how
\faust generates the scheduler code, here is a very simple example, two
1-pole filters in parallel connected to an adder (see figure
\ref{fig:parfilter
} the corresponding block-diagram):
312 filter(c) = *(
1-c) : + ~ *(c);
313 process = filter(
0.9), filter(
0.9) : +;
317 When
\lstinline!-sch! option is used, the content of the additional
\textit{architecture/scheduler.h
} file is inserted in the generated code. It contains code to deal with WSQ and thread management. The
\lstinline'compute()' and
\lstinline'computeThread()' methods are the following:
320 virtual void compute (int fullcount,
326 this->output = output;
328 for (fIndex =
0; fIndex < fullcount; fIndex +=
32)
{
329 fFullCount = min (
32, fullcount-fIndex);
331 // Initialize end task
332 fGraph.InitTask(
1,
1);
333 // Only initialize tasks with inputs
334 fGraph.InitTask(
4,
2);
336 fThreadPool.SignalAll(fDynamicNumThreads -
1);
338 while (!fThreadPool.IsFinished())
{}
340 StopMeasure(fStaticNumThreads,
343 void computeThread (int cur_thread)
{
344 float* fRec0 = &fRec0_tmp
[4];
345 float* fRec1 = &fRec1_tmp
[4];
350 int count = fFullCount;
351 // Init input and output
352 FAUSTFLOAT* input0 = &input
[0][fIndex
];
353 FAUSTFLOAT* input1 = &input
[1][fIndex
];
354 FAUSTFLOAT* output0 = &output
[0][fIndex
];
355 int task_list_size =
2;
356 int task_list
[2] =
{2,
3};
357 taskqueue.InitTaskList(task_list_size, task_list, fDynamicNumThreads, cur_thread, tasknum);
358 while (!fIsFinished)
{
360 case WORK_STEALING_INDEX:
{
361 tasknum = TaskQueue::GetNextTask(cur_thread);
364 case LAST_TASK_INDEX:
{
372 for (int i=
0; i<
4; i++) fRec0_tmp
[i
]=fRec0_perm
[i
];
374 for (int i=
0; i<count; i++)
{
375 fRec0
[i
] = ((
1.000000e-01f * (float)input1
[i
]) + (
0.9f * fRec0
[i-
1]));
378 for (int i=
0; i<
4; i++) fRec0_perm
[i
]=fRec0_tmp
[count+i
];
380 fGraph.ActivateOneOutputTask(taskqueue,
4, tasknum);
386 for (int i=
0; i<
4; i++) fRec1_tmp
[i
]=fRec1_perm
[i
];
388 for (int i=
0; i<count; i++)
{
389 fRec1
[i
] = ((
1.000000e-01f * (float)input0
[i
]) + (
0.9f * fRec1
[i-
1]));
392 for (int i=
0; i<
4; i++) fRec1_perm
[i
]=fRec1_tmp
[count+i
];
394 fGraph.ActivateOneOutputTask(taskqueue,
4, tasknum);
400 for (int i=
0; i<count; i++)
{
401 output0
[i
] = (FAUSTFLOAT)(fRec1
[i
] + fRec0
[i
]);
404 tasknum = LAST_TASK_INDEX;