[Mono-dev] Mono generates inefficient vectorized code

Thu Apr 15 12:15:43 EDT 2010

Hello Rodrigo,

> I'm glad to hear about your improvements. Can you share the code?

Of course I will share my code, but I need to do it through my IP
department.

I believe this is not the best approach. Mono.Simd was never intended to be
> a variable width simd API. Making such proposition coding over it
> significantly harder.
>
Certainly, but actually our team is actually trying to develop a variable
width SIMD API. For example following loop:
for (int i = 0; i < 1024; i++)
    c[i] = a[i] + b[i] + i;

can be "portably" vectorized to something like:
for (int i = 0; i < 1024; i += vector_size())
    vector tmp = load_aligned(a[i]) + load_aligned(b[i]) +
vector_uniform(i);
    store_aligned(c[i], tmp);

in this example vector_size can changed to any size (like 16 bytes for SSE,
or 32 for future AVX). So for scalarization we just need to reduce it to 1
and then apply transformation I wrote about.
Of course,  for Mono.Simd in it's current form your solution is the only one
possible.
I would greatly appreciate if you could send me your patch. I think it's
closely related to what I need to do.
Thank you very much.

-- 
Regards,
Sergei Dyshel

On Tue, Apr 13, 2010 at 19:18, Rodrigo Kumpera <kumpera at gmail.com> wrote:

> Hi Sergei,
>
> I'm glad to hear about your improvements. Can you share the code?
>
> I believe this is not the best approach. Mono.Simd was never intended to be
> a variable width simd API. Making such proposition
> makes coding over it significantly harder.
>
> My suggestion is to implement both scalar replacement and then force
> inlining of all Mono.Simd operations.
>
> For example:
>
> Vector4f a,b,c;
> ...
> a = b + c;
>
> SR would replace it with:
> float a0,a1,a2,a3,b0....
>
> a0 = b0 + c0;
> a1 = b1 + c1;
> ...
>
> This will have acceptable performance and result in equivalent execution
> semantics, which is a much more usable model.
>
> Scalar replacement requires two major changes in the JIT. First we need to
> convert all valuetype operations to use a higher level IR
> without explicit memory operations. Second, with this new IR, we can scalar
> replace all vector types that have no memory ops over them. IOW, something
> like:
>
> Right now "a  = new Vector4f (1,2,3,4)" generates an IR similar to this:
>
> ldaddr R10 <- R8
> storer4_membase [R10 + 0], 1
> storer4_membase [R10 + 4], 2
> storer4_membase [R10 + 8], 3
> storer4_membase [R10 + 12], 4
>
> Which imposes that the vector type must be in memory. If we generate
> something like:
>
> vzero R8
> storer4_field [x] R8, 1
> storer4_field [y] R8, 2
> storer4_field [z] R8, 3
> storer4_field [w] R8, 4
>
> This new IR has no memory ops over the vector type, so we can scalar
> replace it to something like:
>
> r4_const R11, 0
> r4_const R12, 0
> r4_const R13, 0
> r4_const R14, 0
>
> r4_const R11, 1
> r4_const R12, 2
> r4_const R13, 3
> r4_const R14, 4
>
> The first four stores will be removed by the DCE pass.
>
> I have a WIP patch to do the first part of the transformation. It's based
> on a 3 months old trunk and has a bunch of bugs, so it requires quite some
> work before it's functional. I can send it to you, if you want to continue
> working on it.
>
>
> On Tue, Apr 13, 2010 at 12:01 PM, Sergei Dyshel <qyron.private at gmail.com>wrote:
>
>> Hello Rodrigo,
>> Regarding your question unfortunately I cannot apply for GSoC due to time
>> and other constraints.
>>
>> With your tips I managed to extend linear scan on to vector registers and
>> now SIMD code runs much faster. Thank you!
>>
>> My next (:]) question is about "scalarization", i.e. running programs with
>> SIMD intrinsics on non-SIMD platforms (just simulating this with -O=-simd).
>> Current implementation in Mono simply treats vectors as vtypes and passes
>> them by value using stack, thus doing a lot of superfluous memory copies.
>> Therefore "scalarized" code runs slow, way behind code without vector
>> intrinsics.
>>
>> A better solution I'm thinking of is to "reduce" vector size to 1, i.e.
>> interpret Mono.Simd datatypes as corresponding scalar types. For example:
>> Vector4i a;
>> Vector4i b;
>> Vector4i c = op_addition (a, b);
>> will be transformed to something like:
>> int a;
>> int b;
>> int c = op_addition (a,b);
>>
>> of course not any code allows such transformation (it must not use
>> hard-coded SIMD size but use some kind of get-vector-size intrinsics). I
>> tried some test by manually replacing assembly and it showed great results.
>> But now I want to do transformation inside the JIT.
>>
>> Can you please help me to find corresponding place in JIT where I can do
>> the transformation? I tried searching through 'method-to-ir.c' but could
>> realize where exactly vtypes can be transformed to scalar types.
>> Thanks!
>> --
>> Regards,
>> Sergei Dyshel
>>
>>
>>
>> On Thu, Apr 8, 2010 at 18:08, Rodrigo Kumpera <kumpera at gmail.com> wrote:
>>
>>> Hi Sergei,
>>>
>>> On Thu, Apr 8, 2010 at 11:59 AM, Sergei Dyshel <qyron.private at gmail.com>wrote:
>>>
>>>> Hello Rodrigo,
>>>> Just picking up this conversation we had some time ago. I was asking why
>>>> JIT does unneeded loads and stores and you answered that this behavior is
>>>> because of lack of global reg allocator. I understand it so that any vreg
>>>> which is used in different basic blocks is "promoted" to "memory variable"
>>>> and hence gets loaded and stored each time.
>>>> Then I asked why bare "global" 'ints' are treated differently (and more
>>>> effectively) and you said that there are callee-saved iregs so there is a
>>>> specialized allocator for them.
>>>> Can you please point at the relevant place in code?
>>>>
>>>
>>> Look into liveness.c / linear_scan.c.
>>> In liveness.c look for mono_analyze_liveness
>>> In linear_scan.c look for mono_linear_scan
>>>
>>>
>>>
>>>> On Altivec we have callee-saved vector registers too. Is it possible to
>>>> use the same trick with them , in order to remove unnecessary loads/stores?
>>>>
>>>
>>> Yes, it might be possible to do so, not sure how much work it will be
>>> thou.
>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.ximian.com/pipermail/mono-devel-list/attachments/20100415/208854b0/attachment-0001.html