[Mono-dev] Mono generates inefficient vectorized code

Tue Apr 13 12:18:25 EDT 2010

Hi Sergei,

I'm glad to hear about your improvements. Can you share the code?

I believe this is not the best approach. Mono.Simd was never intended to be
a variable width simd API. Making such proposition
makes coding over it significantly harder.

My suggestion is to implement both scalar replacement and then force
inlining of all Mono.Simd operations.

For example:

Vector4f a,b,c;
...
a = b + c;

SR would replace it with:
float a0,a1,a2,a3,b0....

a0 = b0 + c0;
a1 = b1 + c1;
...

This will have acceptable performance and result in equivalent execution
semantics, which is a much more usable model.

Scalar replacement requires two major changes in the JIT. First we need to
convert all valuetype operations to use a higher level IR
without explicit memory operations. Second, with this new IR, we can scalar
replace all vector types that have no memory ops over them. IOW, something
like:

Right now "a  = new Vector4f (1,2,3,4)" generates an IR similar to this:

ldaddr R10 <- R8
storer4_membase [R10 + 0], 1
storer4_membase [R10 + 4], 2
storer4_membase [R10 + 8], 3
storer4_membase [R10 + 12], 4

Which imposes that the vector type must be in memory. If we generate
something like:

vzero R8
storer4_field [x] R8, 1
storer4_field [y] R8, 2
storer4_field [z] R8, 3
storer4_field [w] R8, 4

This new IR has no memory ops over the vector type, so we can scalar replace
it to something like:

r4_const R11, 0
r4_const R12, 0
r4_const R13, 0
r4_const R14, 0

r4_const R11, 1
r4_const R12, 2
r4_const R13, 3
r4_const R14, 4

The first four stores will be removed by the DCE pass.

I have a WIP patch to do the first part of the transformation. It's based on
a 3 months old trunk and has a bunch of bugs, so it requires quite some work
before it's functional. I can send it to you, if you want to continue
working on it.

On Tue, Apr 13, 2010 at 12:01 PM, Sergei Dyshel <qyron.private at gmail.com>wrote:

> Hello Rodrigo,
> Regarding your question unfortunately I cannot apply for GSoC due to time
> and other constraints.
>
> With your tips I managed to extend linear scan on to vector registers and
> now SIMD code runs much faster. Thank you!
>
> My next (:]) question is about "scalarization", i.e. running programs with
> SIMD intrinsics on non-SIMD platforms (just simulating this with -O=-simd).
> Current implementation in Mono simply treats vectors as vtypes and passes
> them by value using stack, thus doing a lot of superfluous memory copies.
> Therefore "scalarized" code runs slow, way behind code without vector
> intrinsics.
>
> A better solution I'm thinking of is to "reduce" vector size to 1, i.e.
> interpret Mono.Simd datatypes as corresponding scalar types. For example:
> Vector4i a;
> Vector4i b;
> Vector4i c = op_addition (a, b);
> will be transformed to something like:
> int a;
> int b;
> int c = op_addition (a,b);
>
> of course not any code allows such transformation (it must not use
> hard-coded SIMD size but use some kind of get-vector-size intrinsics). I
> tried some test by manually replacing assembly and it showed great results.
> But now I want to do transformation inside the JIT.
>
> Can you please help me to find corresponding place in JIT where I can do
> the transformation? I tried searching through 'method-to-ir.c' but could
> realize where exactly vtypes can be transformed to scalar types.
> Thanks!
> --
> Regards,
> Sergei Dyshel
>
>
>
> On Thu, Apr 8, 2010 at 18:08, Rodrigo Kumpera <kumpera at gmail.com> wrote:
>
>> Hi Sergei,
>>
>> On Thu, Apr 8, 2010 at 11:59 AM, Sergei Dyshel <qyron.private at gmail.com>wrote:
>>
>>> Hello Rodrigo,
>>> Just picking up this conversation we had some time ago. I was asking why
>>> JIT does unneeded loads and stores and you answered that this behavior is
>>> because of lack of global reg allocator. I understand it so that any vreg
>>> which is used in different basic blocks is "promoted" to "memory variable"
>>> and hence gets loaded and stored each time.
>>> Then I asked why bare "global" 'ints' are treated differently (and more
>>> effectively) and you said that there are callee-saved iregs so there is a
>>> specialized allocator for them.
>>> Can you please point at the relevant place in code?
>>>
>>
>> Look into liveness.c / linear_scan.c.
>> In liveness.c look for mono_analyze_liveness
>> In linear_scan.c look for mono_linear_scan
>>
>>
>>
>>> On Altivec we have callee-saved vector registers too. Is it possible to
>>> use the same trick with them , in order to remove unnecessary loads/stores?
>>>
>>
>> Yes, it might be possible to do so, not sure how much work it will be
>> thou.
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.ximian.com/pipermail/mono-devel-list/attachments/20100413/fd9ba2b1/attachment.html