* pre-processing
+ inlining, full_unroll, dead code elim, flatten_code
+ image reuse
+ scalar deps with proper effects
+ check => ok => deadcode (#define)
+ while-unrolling should check the two last stages for convergence!
- bug: partial eval is over-optimistic when pointers are involved

* input
+ may skip unrelated statements?
+ deal with |= ||= and similars
+ detect redundant operations? replace them with copies?
+ freia_aipo_cast: should it be managed as a copy? as an option
+ aliasing pointers img1 = img2 is not really managed
  but are detected
- images used in the computations must all be "freia_data2d *"
- what about full expressions: aipo_x() || aipo_y() || aipo_z()? atomizer?
- check that there are references where expected?
- handle different type of images (e.g. distinct BPP)
  issue: it is not clear how the image bpp can be derived from the source
  or if it is a purely dynamic property. BPP 8 is not supported by Terapix.
- check/remove more implicit hypothesis?
- image operations not in aipo/cipo: freia_common_draw_{line,rect}
  should it be in aipo or cipo? should I consider this operation?

* optimizations
+ remove useless images (computed but not used...) see licensePlate
+ take advantage of commutative operators to remove operations
+ copy to remove in freia_46 and freia_47
+ remove redundant measures, esp for terapix
+ other simple or complex algebraic optimization ideas?
  maybe at least A xor A => A = 0 would be useful?
- sup/inf(a, b, b) -> copy
+ check optimality of tests cases (for spoc)
+ redundancy: max coord includes max
- min commutes with dilate & max with erode, or vice-versa ?
  there is no tree rewriting in the dag, just simple operation optimizations
- should keep AIPO/software calls if accelerator calls are slower?
  I would I know that?
+ erode/dilate {000 010 000} -> copy
+ erode/dilate {000 000 000} -> cst(?)
- convol 3x3 or others? copy/cst
- copy(a,a) -> nope

* code
+ cleanup data structures?
+ could separate DAG optimization as a separate phase? AIPO -> AIPO
- record scalar deps in dag?
- split first on scalar deps, as it is true for any accelerator?
  not that simple with limited depth?
- remove type ctxcontent?
- should make elementary ops absdiff(x,y) = abs(diff(x, y)) (no.. unsigned)
- and then match back if necessary to available low level ops.

* output
+ symmetry (flipping)
+ compaction
+ handle wiring
+ tell pipsdbm about generated files
+ pipe overflow
+ show DAG!
+ select node shape depending on hardware?
+ add more comments to generated code
+ improve generated comments
+ if only one image is used as input, put it anyway on both sides?
  may help some schedules if there are multiple successors.
+ use commutator to detect more redundancy
+ must not handle copies through the pipeline...
+ should also detect "duplicate" measurements with a copy of the scalars
+ should detect "included" measurements (maxcoord includes max)?
  (done indirectly by the code generation, but could be done on the DAG?)
+ remove dead code on dag optimization
+ DAG dump should differentiate input/output images if same...
- add checks in generated code (img size and so)?
- parametric img size/depth? same as code?
- the generated code is just a file to pips. ok???
- should move allocs out of loops? hmmm...

* post-processing
+ remove malloc/free if not used
+ cleanup declarations
+ it seems that some unused image are not removed (license_plate_copies)
+ should cleanup dead code in generated code (ret |= 0; ...)
+ there may be useless copies in the output...
  see antibio external after copies?
- spoc hwac calls the accelerator for copies? so it shows

* validation
+ check all AIPO
+ add more (elementary) tests?
+ add more application-level tests (promised: 5 CMM, 1 TRT)

* known bugs
+ allocs may be in the middle of the calls
+ license_plate_copies takes too much time to compile? 1mn -> 4.5s
+ reuse of images in some cases when nodes are reordered... SSA?
+ wrong copy extraction order in vs_core_2, and copies not really removed
- should turn = in image copies? shuffle is at least checked
- WW scalar dependencies may result in split that could be avoided?

* spoc target
+ check for not implemented functions, and skip them! convolution,
   other?
- convolution could be implemented at cpp level?
  would need constant array partial eval?

* terapix target
+ must manage memory allocation in tiles
+ memory for measures?
+ I/O tiles : read before write
+ double buffering with additionnal tiles max(in-tiles,out-tiles)
+ in place operators
+ shadow 4-way declaration
+ handling of various parameters...
+ detect that an argument is used several times...
+ sequence extraction shouldn't include ops not implemented for the
  target? no managed later on.
+ use global max length/ops/critical path as a decision driver?
  beware: do not extract input, even if live?!
+ extension: interface with SG terapix microcode generation
  hard constraint: 3 "pointers" are available; border management?
  is there anyway to somehow get the values of these pointers in a register
  and extract the six parameters independently?
+ complete measure ops! there is a needed initialization (setcst)...
+ check operators with constant arguments.
+ skip non implemented ops? fast_correlation?
+ cleanup declarations in the generated code, when not used?
+ dynamic optimal tile size computation instead of static?
+ split along connected components?
- implement "enumerate" dag cutting strategy
- it would be great if the terapix simulator timings are ok for a paper...
- wrong placement of tmp images under REPEAT for antibio

* OpenCL target
+ whole image computations are okay: see freia_aipo_compiler
+ compilation benefit: only freia DAG optimizations
+ expensive memory accesses => aggregate computations would help?
+ could there be other benefits from a compiler?
+ use on-demand image transfers in "clever" runtime? (MB)
+ merge one erode/dilate/convol with subsequent arith ops?
+ generate kernel-specialized erode/dilate/convol
+ merge two or more erode/dilate/convol on the same input?
+ merge reductions with same input kernel ops (anr999)?
+ generate some ILP: load before use?
+ possibly generate tiling on the height dimension
+ merge: E+min (freia_24? no - kernel is unknown)
+ mergeable: separate by connected components?
+ that is the place to deal with the one reduction constraint
- borders: one int & masks? would also help convolution with an associated constant array?
- above: should accept more reductions in mergeable? need fixing runtime
- detect constant transposed kernels (retina, burner)?
- MA: what about starpu http://runtime.bordeaux.inria.fr/StarPU/
- could use terapix-like tiling to help with cache?

* hardware generation target
- use optimized expression DAG(s) for hardware generation?

* general purpose software generation target
+ use optimized DAG
- generate tiled code which mimics spoc with its delay lines?
- also reuse automatic image boundary management with constants?
- what if something different should be done on boundaries?

* multi target ?
- choose the best hardware depending on the compilations: criterion?
- possibly add dynamic reconfigurations.
- quite uneasily to do with the current implementation
- could do something simple greedy first, spoc first, implemented first...