* pre-processing + inlining, full_unroll, dead code elim, flatten_code + image reuse + scalar deps with proper effects + check => ok => deadcode (#define) + while-unrolling should check the two last stages for convergence! - bug: partial eval is over-optimistic when pointers are involved * input + may skip unrelated statements? + deal with |= ||= and similars + detect redundant operations? replace them with copies? + freia_aipo_cast: should it be managed as a copy? as an option + aliasing pointers img1 = img2 is not really managed but are detected - images used in the computations must all be "freia_data2d *" - what about full expressions: aipo_x() || aipo_y() || aipo_z()? atomizer? - check that there are references where expected? - handle different type of images (e.g. distinct BPP) issue: it is not clear how the image bpp can be derived from the source or if it is a purely dynamic property. BPP 8 is not supported by Terapix. - check/remove more implicit hypothesis? - image operations not in aipo/cipo: freia_common_draw_{line,rect} should it be in aipo or cipo? should I consider this operation? * optimizations + remove useless images (computed but not used...) see licensePlate + take advantage of commutative operators to remove operations + copy to remove in freia_46 and freia_47 + remove redundant measures, esp for terapix + other simple or complex algebraic optimization ideas? maybe at least A xor A => A = 0 would be useful? - sup/inf(a, b, b) -> copy + check optimality of tests cases (for spoc) + redundancy: max coord includes max - min commutes with dilate & max with erode, or vice-versa ? there is no tree rewriting in the dag, just simple operation optimizations - should keep AIPO/software calls if accelerator calls are slower? I would I know that? + erode/dilate {000 010 000} -> copy + erode/dilate {000 000 000} -> cst(?) - convol 3x3 or others? copy/cst - copy(a,a) -> nope * code + cleanup data structures? + could separate DAG optimization as a separate phase? AIPO -> AIPO - record scalar deps in dag? - split first on scalar deps, as it is true for any accelerator? not that simple with limited depth? - remove type ctxcontent? - should make elementary ops absdiff(x,y) = abs(diff(x, y)) (no.. unsigned) - and then match back if necessary to available low level ops. * output + symmetry (flipping) + compaction + handle wiring + tell pipsdbm about generated files + pipe overflow + show DAG! + select node shape depending on hardware? + add more comments to generated code + improve generated comments + if only one image is used as input, put it anyway on both sides? may help some schedules if there are multiple successors. + use commutator to detect more redundancy + must not handle copies through the pipeline... + should also detect "duplicate" measurements with a copy of the scalars + should detect "included" measurements (maxcoord includes max)? (done indirectly by the code generation, but could be done on the DAG?) + remove dead code on dag optimization + DAG dump should differentiate input/output images if same... - add checks in generated code (img size and so)? - parametric img size/depth? same as code? - the generated code is just a file to pips. ok??? - should move allocs out of loops? hmmm... * post-processing + remove malloc/free if not used + cleanup declarations + it seems that some unused image are not removed (license_plate_copies) + should cleanup dead code in generated code (ret |= 0; ...) + there may be useless copies in the output... see antibio external after copies? - spoc hwac calls the accelerator for copies? so it shows * validation + check all AIPO + add more (elementary) tests? + add more application-level tests (promised: 5 CMM, 1 TRT) * known bugs + allocs may be in the middle of the calls + license_plate_copies takes too much time to compile? 1mn -> 4.5s + reuse of images in some cases when nodes are reordered... SSA? + wrong copy extraction order in vs_core_2, and copies not really removed - should turn = in image copies? shuffle is at least checked - WW scalar dependencies may result in split that could be avoided? * spoc target + check for not implemented functions, and skip them! convolution, other? - convolution could be implemented at cpp level? would need constant array partial eval? * terapix target + must manage memory allocation in tiles + memory for measures? + I/O tiles : read before write + double buffering with additionnal tiles max(in-tiles,out-tiles) + in place operators + shadow 4-way declaration + handling of various parameters... + detect that an argument is used several times... + sequence extraction shouldn't include ops not implemented for the target? no managed later on. + use global max length/ops/critical path as a decision driver? beware: do not extract input, even if live?! + extension: interface with SG terapix microcode generation hard constraint: 3 "pointers" are available; border management? is there anyway to somehow get the values of these pointers in a register and extract the six parameters independently? + complete measure ops! there is a needed initialization (setcst)... + check operators with constant arguments. + skip non implemented ops? fast_correlation? + cleanup declarations in the generated code, when not used? + dynamic optimal tile size computation instead of static? + split along connected components? - implement "enumerate" dag cutting strategy - it would be great if the terapix simulator timings are ok for a paper... - wrong placement of tmp images under REPEAT for antibio * OpenCL target + whole image computations are okay: see freia_aipo_compiler + compilation benefit: only freia DAG optimizations + expensive memory accesses => aggregate computations would help? + could there be other benefits from a compiler? + use on-demand image transfers in "clever" runtime? (MB) + merge one erode/dilate/convol with subsequent arith ops? + generate kernel-specialized erode/dilate/convol + merge two or more erode/dilate/convol on the same input? + merge reductions with same input kernel ops (anr999)? + generate some ILP: load before use? + possibly generate tiling on the height dimension + merge: E+min (freia_24? no - kernel is unknown) + mergeable: separate by connected components? + that is the place to deal with the one reduction constraint - borders: one int & masks? would also help convolution with an associated constant array? - above: should accept more reductions in mergeable? need fixing runtime - detect constant transposed kernels (retina, burner)? - MA: what about starpu http://runtime.bordeaux.inria.fr/StarPU/ - could use terapix-like tiling to help with cache? * hardware generation target - use optimized expression DAG(s) for hardware generation? * general purpose software generation target + use optimized DAG - generate tiled code which mimics spoc with its delay lines? - also reuse automatic image boundary management with constants? - what if something different should be done on boundaries? * multi target ? - choose the best hardware depending on the compilations: criterion? - possibly add dynamic reconfigurations. - quite uneasily to do with the current implementation - could do something simple greedy first, spoc first, implemented first...