CRI, MINES ParisTech - Faustine.git/blob

   1 \chapter{Controlling the code generation}
   2 Several options of the \faust compiler allow to control the generated C++ code. By default the computations are done sample by sample in a single loop. But the compiler can also generate \textit{vector} and \textit{parallel} code.
   3
   4
   5 \section{Vector Code generation}
   6 Modern C++ compilers are able to do autovectorization, that is to use SIMD instructions to speedup the code. These instructions can typically operate in parallel on short vectors of 4 simple precision floating point numbers thus leading to a theoretical speedup of $\times4$.
   7 Autovectorization of C/C++ programs is a difficult task. Current compilers are very sensitive to the way the code is arranged. In particular too complex loops can prevent autovectorization. The goal of the vector code generation is to rearrange the C++ code in a way that facilitates the autovectorization job of the C++ compiler. Instead of generating a single sample computation loop, it splits the computation into several simpler loops that communicates by vectors.
   8
   9 The vector code generation is activated by passing the \lstinline!--vectorize! (or \lstinline!-vec!) option to the \faust compiler. Two additional options are available:  \lstinline!--vec-size <n>! controls the size of the vector (by default 32 samples) and \lstinline!--loop-variant 0/1! gives some additional control on the loops.
  10
  11 To illustrate the difference between scalar code and vector code, let's take the computation of the RMS (Root Mean Square) value of a signal.  Here is the \faust code that computes the Root Mean Square of a sliding window of 1000 samples:
  12 \label{rms}
  13 \begin{lstlisting}
  14 // Root Mean Square of n consecutive samples
  15 RMS(n) = square : mean(n) : sqrt ;
  16
  17 // Square of a signal
  18 square(x) = x * x ;
  19
  20 // Mean of n consecutive samples of a signal
  21 // (uses fixpoint to avoid the accumulation of
  22 // rounding errors)
  23 mean(n) = float2fix : integrate(n) :
  24           fix2float : /(n);
  25
  26 // Sliding sum of n consecutive samples
  27 integrate(n,x) = x - x@n : +~_ ;
  28
  29 // Convertion between float and fix point
  30 float2fix(x) = int(x*(1<<20));
  31 fix2float(x) = float(x)/(1<<20);
  32
  33 // Root Mean Square of 1000 consecutive samples
  34 process = RMS(1000) ;
  35 \end{lstlisting}
  36
  37 The compute() method generated in scalar mode is the following:
  38
  39 \begin{lstlisting}
  40 virtual void compute (int count,
  41                       float** input,
  42                       float** output)
  43 {
  44   float* input0 = input[0];
  45   float* output0 = output[0];
  46   for (int i=0; i<count; i++) {
  47     float fTemp0 = input0[i];
  48     int iTemp1 = int(1048576*fTemp0*fTemp0);
  49     iVec0[IOTA&1023] = iTemp1;
  50     iRec0[0] = ((iVec0[IOTA&1023] + iRec0[1])
  51                     - iVec0[(IOTA-1000)&1023]);
  52     output0[i] = sqrtf(9.536744e-10f *
  53                        float(iRec0[0]));
  54     // post processing
  55     iRec0[1] = iRec0[0];
  56     IOTA = IOTA+1;
  57   }
  58 }
  59 \end{lstlisting}
  60
  61 The \lstinline!-vec! option leads to the following reorganization of the code:
  62 \begin{lstlisting}
  63 virtual void compute (int fullcount,
  64                       float** input,
  65                       float** output)
  66 {
  67   int     iRec0_tmp[32+4];
  68   int*    iRec0 = &iRec0_tmp[4];
  69   for (int index=0; index<fullcount; index+=32)
  70   {
  71     int count = min (32, fullcount-index);
  72     float* input0 = &input[0][index];
  73     float* output0 = &output[0][index];
  74     for (int i=0; i<4; i++)
  75       iRec0_tmp[i]=iRec0_perm[i];
  76     // SECTION : 1
  77     for (int i=0; i<count; i++) {
  78       iYec0[(iYec0_idx+i)&2047] =
  79                int(1048576*input0[i]*input0[i]);
  80     }
  81     // SECTION : 2
  82     for (int i=0; i<count; i++) {
  83       iRec0[i] = ((iYec0[i] + iRec0[i-1]) -
  84                iYec0[(iYec0_idx+i-1000)&2047]);
  85     }
  86     // SECTION : 3
  87     for (int i=0; i<count; i++) {
  88       output0[i] = sqrtf((9.536744e-10f *
  89                  float(iRec0[i])));
  90     }
  91     // SECTION : 4
  92     iYec0_idx = (iYec0_idx+count)&2047;
  93     for (int i=0; i<4; i++)
  94       iRec0_perm[i]=iRec0_tmp[count+i];
  95   }
  96 }
  97 \end{lstlisting}
  98
  99 While the second version of the code is more complex, it turns out to be much easier to vectorize efficiently by the C++ compiler. Using Intel icc 11.0, with the exact same compilation options: \texttt{-O3 -xHost -ftz -fno-alias -fp-model fast=2}, the scalar version leads to a throughput performance of 129.144  MB/s, while the vector version achieves 359.548  MB/s, a speedup of x2.8 !
 100
 101 \begin{figure}[htb]
 102   \centering
 103   \includegraphics[scale=0.75]{images/compiler-stack}
 104   \caption{\faust's stack of code generators}
 105   \label{fig:stack}
 106 \end{figure}
 107
 108
 109 The vector code generation is built on top of the scalar code generation (see figure \ref{fig:stack}). Every time an expression needs to be compiled, the compiler checks if it requires a separate loop or not. It applies some simple rules for that. Expressions that are shared (and are complex enough) are good candidates to be compiled in a separate loop, as well as recursive expressions and expressions used in delay lines.
 110
 111 The result is a directed graph in which each node is a computation loop (see Figure \ref{fig:loopgraph}). This graph is stored in the klass object and a topological sort is applied to it before printing the code.
 112
 113 \begin{figure}[htb]
 114   \centering
 115   \includegraphics[scale=0.75]{graphs/loopgraph2}
 116   \caption{The result of the -vec option is a directed acyclic graph (DAG) of small computation loops}
 117   \label{fig:loopgraph}
 118 \end{figure}
 119
 120
 121 \section{Parallel Code generation}
 122
 123 The parallel code generation is activated by passing either the \lstinline!--openMP! (or \lstinline!-omp!) option or the \lstinline!--scheduler! (or \lstinline!-sch!) option. It implies the \lstinline!-vec! options as the parallel code generation is built on top of the vector code generation.
 124
 125
 126 \subsection{The OpenMP code generator}
 127
 128 \begin{figure}[htb]
 129   \centering
 130   \includegraphics[scale=0.5,angle=-90]{images/openmp-model}
 131   \caption{OpenMP is based on a fork-join model}
 132   \label{fig:openmp}
 133 \end{figure}
 134
 135 The \lstinline!--openMP! (or \lstinline!-omp!) option given to the \faust compiler will insert appropriate OpenMP directives in the C++ code. OpenMP (http://wwww.openmp.org) is a well established API that is used to explicitly define direct multi-threaded, shared memory parallelism. It is based on a fork-join model of parallelism (see figure \ref{fig:openmp}).
 136 Parallel regions are delimited by \lstinline!#pragma omp parallel! constructs. At the entrance of a parallel region a team of parallel threads is activated. The code within a parallel region is executed by each thread of the parallel team until the end of the region.
 137
 138 \begin{lstlisting}
 139 #pragma omp parallel
 140 {
 141   // the code here is executed simultaneously by
 142   // every thread of the parallel team
 143   ...
 144 }
 145 \end{lstlisting}
 146
 147 In order not to have every thread doing redundantly the exact same work, OpemMP provides specific \textit{work-sharing} directives. For example \lstinline!#pragma omp sections! allows to break the work into separate, discrete sections, each section being executed by one thread:
 148
 149 \begin{lstlisting}
 150 #pragma omp parallel
 151 {
 152   #pragma omp sections
 153   {
 154     #pragma omp section
 155     {
 156       // job 1
 157     }
 158     #pragma omp section
 159     {
 160       // job 2
 161     }
 162     ...
 163   }
 164
 165   ...
 166 }
 167 \end{lstlisting}
 168
 169 \subsection{Adding OpenMP directives}
 170 As said before the parallel code generation is built on top of the vector code generation. The graph of loops produced by the vector code generator is topologically sorted in order to detect the loops that can be computed in parallel. The first set $S_0$ (loops $L1$, $L2$ and $L3$ in the DAG of Figure \ref{fig:loopgraph}) contains the loops that don't depend on any other loops, the set $S_1$ contains the loops that only depend on loops of $S_0$, (that is loops $L4$ and $L5$), etc..
 171
 172 As all the loops of a given set $S_n$ can be computed in parallel, the compiler will generate a \lstinline!sections! construct with a \lstinline!section! for each loop.
 173 \begin{lstlisting}
 174   #pragma omp sections
 175   {
 176     #pragma omp section
 177     for (...) {
 178       // Loop 1
 179     }
 180     #pragma omp section
 181     for (...) {
 182       // Loop 2
 183     }
 184     ...
 185   }
 186 \end{lstlisting}
 187
 188 If a given set contains only one loop, then the compiler checks to see if the loop can be parallelized (no recursive dependencies) or not. If it can be parallelized, it generates:
 189 \begin{lstlisting}
 190   #pragma omp for
 191   for (...) {
 192    // Loop code
 193   }
 194 \end{lstlisting}
 195 otherwise it generates a \lstinline!single! construct so that only one thread will execute the loop:
 196 \begin{lstlisting}
 197   #pragma omp single
 198   for (...) {
 199    // Loop code
 200   }
 201 \end{lstlisting}
 202
 203 \subsection{Example of parallel OpenMP code}
 204 To illustrate how \faust uses the OpenMP directives, here is a very simple example, two 1-pole filters in parallel connected to an adder (see figure \ref{fig:parfilter} the corresponding block-diagram):
 205
 206 \begin{lstlisting}
 207 filter(c) = *(1-c) : + ~ *(c);
 208 process = filter(0.9), filter(0.9) : +;
 209 \end{lstlisting}
 210
 211 \begin{figure}[htb]
 212   \centering
 213   \includegraphics[width=8cm]{images/filter2}
 214   \caption{two filters in parallel connected to an adder}
 215   \label{fig:parfilter}
 216 \end{figure}
 217
 218 The corresponding compute() method obtained using the -omp option is the following:
 219 \begin{lstlisting}
 220
 221 virtual void compute (int fullcount,
 222                       float** input,
 223                       float** output)
 224 {
 225   float   fRec0_tmp[32+4];
 226   float   fRec1_tmp[32+4];
 227   float*  fRec0 = &fRec0_tmp[4];
 228   float*  fRec1 = &fRec1_tmp[4];
 229   #pragma omp parallel firstprivate(fRec0,fRec1)
 230   {
 231     for (int index = 0; index < fullcount;
 232                                 index += 32)
 233     {
 234       int count = min (32, fullcount-index);
 235       float* input0 = &input[0][index];
 236       float* input1 = &input[1][index];
 237       float* output0 = &output[0][index];
 238       #pragma omp single
 239       {
 240         for (int i=0; i<4; i++)
 241           fRec0_tmp[i]=fRec0_perm[i];
 242         for (int i=0; i<4; i++)
 243           fRec1_tmp[i]=fRec1_perm[i];
 244       }
 245       // SECTION : 1
 246       #pragma omp sections
 247       {
 248         #pragma omp section
 249         for (int i=0; i<count; i++) {
 250           fRec0[i] = ((0.1f * input1[i])
 251                    + (0.9f * fRec0[i-1]));
 252         }
 253         #pragma omp section
 254         for (int i=0; i<count; i++) {
 255           fRec1[i] = ((0.1f * input0[i])
 256                    + (0.9f * fRec1[i-1]));
 257         }
 258       }
 259       // SECTION : 2
 260       #pragma omp for
 261       for (int i=0; i<count; i++) {
 262         output0[i] = (fRec1[i] + fRec0[i]);
 263       }
 264       // SECTION : 3
 265       #pragma omp single
 266       {
 267         for (int i=0; i<4; i++)
 268           fRec0_perm[i]=fRec0_tmp[count+i];
 269         for (int i=0; i<4; i++)
 270           fRec1_perm[i]=fRec1_tmp[count+i];
 271       }
 272     }
 273   }
 274 }
 275
 276 \end{lstlisting}
 277
 278 This code requires some comments:
 279
 280 \begin{enumerate}
 281 \item The parallel construct \lstinline!#pragma omp parallel! is the fundamental construct that starts parallel execution. The number of parallel threads is generally the number of CPU cores but it can be controlled in several ways.
 282
 283 \item Variables external to the parallel region are shared by default. The pragma \lstinline!firstprivate(fRec0,fRec1)! indicates that each thread should have its private copy of fRec0 and fRec1. The reason is that accessing shared variables requires an indirection and is quite inefficient compared to private copies.
 284
 285 \item The top level loop \lstinline!for (int index = 0;...)...! is executed by all threads simultaneously. The subsequent work-sharing directives inside the loop will indicate how the work must be shared between the threads.
 286
 287 \item Please note that an implied barrier exists at the end of each work-sharing region. All threads must have executed the barrier before any of them can continue.
 288
 289 \item The work-sharing directive \lstinline!#pragma omp single! indicates that this first section will be executed by only one thread (any of them).
 290
 291 \item The work-sharing directive \lstinline!#pragma omp sections! indicates that each corresponding \lstinline!#pragma omp section!, here our two filters, will be executed in parallel.
 292
 293 \item The loop construct \lstinline!#pragma omp for! specifies that the iterations of the associated loop will be executed in parallel. The iterations of the loop are distributed across the parallel threads. For example, if we have two threads, the first one can compute indices between 0 and count/2 and the other one between count/2 and count.
 294
 295 \item Finally \lstinline!#pragma omp single!  in section 3 indicates that this last section will be executed by only one thread (any of them).
 296
 297 \end{enumerate}
 298
 299 \subsection{The scheduler code generator}
 300  With the \lstinline!--scheduler! (or \lstinline!-sch!) option given to the \faust compiler, the computation graph is cut into separated computation loops (called "tasks"), and a "Work Stealing Scheduler" is used to activate and execute them following their dependencies. A pool of worked threads is created and each thread uses it's own local WSQ (Work Stealing Queue) of tasks. A WSQ is a special queue with a Push operation, a "private" LIFO Pop operation and a "public" FIFO Pop operation.
 301
 302 Starting from a ready task, each thread follows the dependencies, possibly pushing ready sub-tasks into it's own local WSQ. When no more tasks can be activated on a given computation path, the thread pops a task from it's local WSQ. If the WSQ is empty, then the thread is allowed to "steal" tasks from other threads WSQ.
 303
 304 The local LIFO Pop operation allows better cache locality and the FIFO steal Pop "larger chuck" of work to be done. The reason for this is that many work stealing workloads are divide-and-conquer in nature, stealing one of the oldest task implicitly also steals a (potentially) large subtree of computations that will unfold once that piece of work is stolen and run.
 305
 306 Compared to the OpenMP model (-omp) the new model is worse for simple \faust  programs and usually starts to behave comparable or sometimes better for "complex enough" \faust  programs. In any case, since OpenMP does not behave so well with GCC compilers (only quite recent versions like GCC 4.4 start to show some improvements), and is unusable on OSX in real-time contexts, this new scheduler option has it's own value.  We plan to improve it adding a "pipelining" idea in the future.
 307
 308 \subsection{Example of parallel scheduler code}
 309 To illustrate how \faust generates the scheduler code, here is a very simple example, two 1-pole filters in parallel connected to an adder (see figure \ref{fig:parfilter} the corresponding block-diagram):
 310
 311 \begin{lstlisting}
 312 filter(c) = *(1-c) : + ~ *(c);
 313 process = filter(0.9), filter(0.9) : +;
 314 \end{lstlisting}
 315
 316
 317 When \lstinline!-sch! option is used, the content of the additional \textit{architecture/scheduler.h} file is inserted in the generated code. It contains code to deal with WSQ and thread management. The \lstinline'compute()' and \lstinline'computeThread()' methods are the following:
 318 \begin{lstlisting}
 319
 320 virtual void compute (int fullcount,
 321                       float** input,
 322                       float** output)
 323 {
 324         GetRealTime();
 325         this->input = input;
 326         this->output = output;
 327         StartMeasure();
 328         for (fIndex = 0; fIndex < fullcount; fIndex += 32) {
 329                 fFullCount = min (32, fullcount-fIndex);
 330                 TaskQueue::Init();
 331                 // Initialize end task
 332                 fGraph.InitTask(1,1);
 333                 // Only initialize tasks with inputs
 334                 fGraph.InitTask(4,2);
 335                 fIsFinished = false;
 336                 fThreadPool.SignalAll(fDynamicNumThreads - 1);
 337                 computeThread(0);
 338                 while (!fThreadPool.IsFinished()) {}
 339         }
 340         StopMeasure(fStaticNumThreads,
 341                 fDynamicNumThreads);
 342 }
 343 void computeThread (int cur_thread) {
 344         float*  fRec0 = &fRec0_tmp[4];
 345         float*  fRec1 = &fRec1_tmp[4];
 346         // Init graph state
 347         {
 348                 TaskQueue taskqueue;
 349                 int tasknum = -1;
 350                 int count = fFullCount;
 351                 // Init input and output
 352                 FAUSTFLOAT* input0 = &input[0][fIndex];
 353                 FAUSTFLOAT* input1 = &input[1][fIndex];
 354                 FAUSTFLOAT* output0 = &output[0][fIndex];
 355                 int task_list_size = 2;
 356                 int task_list[2] = {2,3};
 357                 taskqueue.InitTaskList(task_list_size, task_list, fDynamicNumThreads, cur_thread, tasknum);
 358                 while (!fIsFinished) {
 359                         switch (tasknum) {
 360                                 case WORK_STEALING_INDEX: {
 361                                         tasknum = TaskQueue::GetNextTask(cur_thread);
 362                                         break;
 363                                 }
 364                                 case LAST_TASK_INDEX: {
 365                                         fIsFinished = true;
 366                                         break;
 367                                 }
 368                                 // SECTION : 1
 369                                 case 2: {
 370                                         // LOOP 0x101111680
 371                                         // pre processing
 372                                         for (int i=0; i<4; i++) fRec0_tmp[i]=fRec0_perm[i];
 373                                         // exec code
 374                                         for (int i=0; i<count; i++) {
 375                                                 fRec0[i] = ((1.000000e-01f * (float)input1[i]) + (0.9f * fRec0[i-1]));
 376                                         }
 377                                         // post processing
 378                                         for (int i=0; i<4; i++) fRec0_perm[i]=fRec0_tmp[count+i];
 379
 380                                         fGraph.ActivateOneOutputTask(taskqueue, 4, tasknum);
 381                                         break;
 382                                 }
 383                                 case 3: {
 384                                         // LOOP 0x1011125e0
 385                                         // pre processing
 386                                         for (int i=0; i<4; i++) fRec1_tmp[i]=fRec1_perm[i];
 387                                         // exec code
 388                                         for (int i=0; i<count; i++) {
 389                                                 fRec1[i] = ((1.000000e-01f * (float)input0[i]) + (0.9f * fRec1[i-1]));
 390                                         }
 391                                         // post processing
 392                                         for (int i=0; i<4; i++) fRec1_perm[i]=fRec1_tmp[count+i];
 393
 394                                         fGraph.ActivateOneOutputTask(taskqueue, 4, tasknum);
 395                                         break;
 396                                 }
 397                                 case 4: {
 398                                         // LOOP 0x101111580
 399                                         // exec code
 400                                         for (int i=0; i<count; i++) {
 401                                                 output0[i] = (FAUSTFLOAT)(fRec1[i] + fRec0[i]);
 402                                         }
 403
 404                                         tasknum = LAST_TASK_INDEX;
 405                                         break;
 406                                 }
 407                         }
 408                 }
 409         }
 410 }
 411
 412 \end{lstlisting}
 413