[Mono-dev] Mono generates inefficient vectorized code

Tue Apr 13 20:10:13 EDT 2010

Please do, Sergei I am also very much interested in the code.

Rodrigo Kumpera wrote:
> Hi Sergei,
>
> I'm glad to hear about your improvements. Can you share the code?
>
> I believe this is not the best approach. Mono.Simd was never intended 
> to be a variable width simd API. Making such proposition
> makes coding over it significantly harder.
>
> My suggestion is to implement both scalar replacement and then force 
> inlining of all Mono.Simd operations.
>
> For example:
>
> Vector4f a,b,c;
> ...
> a = b + c;
>
> SR would replace it with:
> float a0,a1,a2,a3,b0....
>
> a0 = b0 + c0;
> a1 = b1 + c1;
> ...
>
> This will have acceptable performance and result in equivalent 
> execution semantics, which is a much more usable model.
>
> Scalar replacement requires two major changes in the JIT. First we 
> need to convert all valuetype operations to use a higher level IR
> without explicit memory operations. Second, with this new IR, we can 
> scalar replace all vector types that have no memory ops over them. 
> IOW, something like:
>
> Right now "a  = new Vector4f (1,2,3,4)" generates an IR similar to this:
>
> ldaddr R10 <- R8
> storer4_membase [R10 + 0], 1
> storer4_membase [R10 + 4], 2
> storer4_membase [R10 + 8], 3
> storer4_membase [R10 + 12], 4
>
> Which imposes that the vector type must be in memory. If we generate 
> something like:
>
> vzero R8
> storer4_field [x] R8, 1
> storer4_field [y] R8, 2
> storer4_field [z] R8, 3
> storer4_field [w] R8, 4
>
> This new IR has no memory ops over the vector type, so we can scalar 
> replace it to something like:
>
> r4_const R11, 0
> r4_const R12, 0
> r4_const R13, 0
> r4_const R14, 0
>
> r4_const R11, 1
> r4_const R12, 2
> r4_const R13, 3
> r4_const R14, 4
>
> The first four stores will be removed by the DCE pass.
>
> I have a WIP patch to do the first part of the transformation. It's 
> based on a 3 months old trunk and has a bunch of bugs, so it requires 
> quite some work before it's functional. I can send it to you, if you 
> want to continue working on it.
>
>
> On Tue, Apr 13, 2010 at 12:01 PM, Sergei Dyshel 
> <qyron.private at gmail.com <mailto:qyron.private at gmail.com>> wrote:
>
>     Hello Rodrigo,
>     Regarding your question unfortunately I cannot apply for GSoC due
>     to time and other constraints.
>
>     With your tips I managed to extend linear scan on to vector
>     registers and now SIMD code runs much faster. Thank you!
>
>     My next (:]) question is about "scalarization", i.e. running
>     programs with SIMD intrinsics on non-SIMD platforms (just
>     simulating this with -O=-simd). Current implementation in Mono
>     simply treats vectors as vtypes and passes them by value using
>     stack, thus doing a lot of superfluous memory copies. Therefore
>     "scalarized" code runs slow, way behind code without vector
>     intrinsics. 
>
>     A better solution I'm thinking of is to "reduce" vector size to 1,
>     i.e. interpret Mono.Simd datatypes as corresponding scalar types.
>     For example:
>     Vector4i a;
>     Vector4i b;
>     Vector4i c = op_addition (a, b); 
>     will be transformed to something like:
>     int a;
>     int b;
>     int c = op_addition (a,b);
>
>     of course not any code allows such transformation (it must not use
>     hard-coded SIMD size but use some kind of get-vector-size
>     intrinsics). I tried some test by manually replacing assembly and
>     it showed great results. But now I want to do transformation
>     inside the JIT. 
>
>     Can you please help me to find corresponding place in JIT where I
>     can do the transformation? I tried searching through
>     'method-to-ir.c' but could realize where exactly vtypes can be
>     transformed to scalar types.
>     Thanks!
>     -- 
>     Regards,
>     Sergei Dyshel
>
>
>
>     On Thu, Apr 8, 2010 at 18:08, Rodrigo Kumpera <kumpera at gmail.com
>     <mailto:kumpera at gmail.com>> wrote:
>
>         Hi Sergei,
>
>         On Thu, Apr 8, 2010 at 11:59 AM, Sergei Dyshel
>         <qyron.private at gmail.com <mailto:qyron.private at gmail.com>> wrote:
>
>             Hello Rodrigo,
>             Just picking up this conversation we had some time ago. I
>             was asking why JIT does unneeded loads and stores and you
>             answered that this behavior is because of lack of global
>             reg allocator. I understand it so that any vreg which is
>             used in different basic blocks is "promoted" to "memory
>             variable" and hence gets loaded and stored each time.
>             Then I asked why bare "global" 'ints' are treated
>             differently (and more effectively) and you said that there
>             are callee-saved iregs so there is a specialized allocator
>             for them.
>             Can you please point at the relevant place in code?
>
>
>         Look into liveness.c / linear_scan.c. 
>         In liveness.c look for mono_analyze_liveness
>         In linear_scan.c look for mono_linear_scan
>
>
>
>             On Altivec we have callee-saved vector registers too. Is
>             it possible to use the same trick with them , in order to
>             remove unnecessary loads/stores?
>
>          
>         Yes, it might be possible to do so, not sure how much work it
>         will be thou.
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Mono-devel-list mailing list
> Mono-devel-list at lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-devel-list
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.ximian.com/pipermail/mono-devel-list/attachments/20100413/548cf7db/attachment-0001.html