[Mono-dev] Mono generates inefficient vectorized code
Jerry Maine - KF5ADY
crashfourit at gmail.com
Tue Apr 13 20:10:13 EDT 2010
Please do, Sergei I am also very much interested in the code.
Rodrigo Kumpera wrote:
> Hi Sergei,
>
> I'm glad to hear about your improvements. Can you share the code?
>
> I believe this is not the best approach. Mono.Simd was never intended
> to be a variable width simd API. Making such proposition
> makes coding over it significantly harder.
>
> My suggestion is to implement both scalar replacement and then force
> inlining of all Mono.Simd operations.
>
> For example:
>
> Vector4f a,b,c;
> ...
> a = b + c;
>
> SR would replace it with:
> float a0,a1,a2,a3,b0....
>
> a0 = b0 + c0;
> a1 = b1 + c1;
> ...
>
> This will have acceptable performance and result in equivalent
> execution semantics, which is a much more usable model.
>
> Scalar replacement requires two major changes in the JIT. First we
> need to convert all valuetype operations to use a higher level IR
> without explicit memory operations. Second, with this new IR, we can
> scalar replace all vector types that have no memory ops over them.
> IOW, something like:
>
> Right now "a = new Vector4f (1,2,3,4)" generates an IR similar to this:
>
> ldaddr R10 <- R8
> storer4_membase [R10 + 0], 1
> storer4_membase [R10 + 4], 2
> storer4_membase [R10 + 8], 3
> storer4_membase [R10 + 12], 4
>
> Which imposes that the vector type must be in memory. If we generate
> something like:
>
> vzero R8
> storer4_field [x] R8, 1
> storer4_field [y] R8, 2
> storer4_field [z] R8, 3
> storer4_field [w] R8, 4
>
> This new IR has no memory ops over the vector type, so we can scalar
> replace it to something like:
>
> r4_const R11, 0
> r4_const R12, 0
> r4_const R13, 0
> r4_const R14, 0
>
> r4_const R11, 1
> r4_const R12, 2
> r4_const R13, 3
> r4_const R14, 4
>
> The first four stores will be removed by the DCE pass.
>
> I have a WIP patch to do the first part of the transformation. It's
> based on a 3 months old trunk and has a bunch of bugs, so it requires
> quite some work before it's functional. I can send it to you, if you
> want to continue working on it.
>
>
> On Tue, Apr 13, 2010 at 12:01 PM, Sergei Dyshel
> <qyron.private at gmail.com <mailto:qyron.private at gmail.com>> wrote:
>
> Hello Rodrigo,
> Regarding your question unfortunately I cannot apply for GSoC due
> to time and other constraints.
>
> With your tips I managed to extend linear scan on to vector
> registers and now SIMD code runs much faster. Thank you!
>
> My next (:]) question is about "scalarization", i.e. running
> programs with SIMD intrinsics on non-SIMD platforms (just
> simulating this with -O=-simd). Current implementation in Mono
> simply treats vectors as vtypes and passes them by value using
> stack, thus doing a lot of superfluous memory copies. Therefore
> "scalarized" code runs slow, way behind code without vector
> intrinsics.
>
> A better solution I'm thinking of is to "reduce" vector size to 1,
> i.e. interpret Mono.Simd datatypes as corresponding scalar types.
> For example:
> Vector4i a;
> Vector4i b;
> Vector4i c = op_addition (a, b);
> will be transformed to something like:
> int a;
> int b;
> int c = op_addition (a,b);
>
> of course not any code allows such transformation (it must not use
> hard-coded SIMD size but use some kind of get-vector-size
> intrinsics). I tried some test by manually replacing assembly and
> it showed great results. But now I want to do transformation
> inside the JIT.
>
> Can you please help me to find corresponding place in JIT where I
> can do the transformation? I tried searching through
> 'method-to-ir.c' but could realize where exactly vtypes can be
> transformed to scalar types.
> Thanks!
> --
> Regards,
> Sergei Dyshel
>
>
>
> On Thu, Apr 8, 2010 at 18:08, Rodrigo Kumpera <kumpera at gmail.com
> <mailto:kumpera at gmail.com>> wrote:
>
> Hi Sergei,
>
> On Thu, Apr 8, 2010 at 11:59 AM, Sergei Dyshel
> <qyron.private at gmail.com <mailto:qyron.private at gmail.com>> wrote:
>
> Hello Rodrigo,
> Just picking up this conversation we had some time ago. I
> was asking why JIT does unneeded loads and stores and you
> answered that this behavior is because of lack of global
> reg allocator. I understand it so that any vreg which is
> used in different basic blocks is "promoted" to "memory
> variable" and hence gets loaded and stored each time.
> Then I asked why bare "global" 'ints' are treated
> differently (and more effectively) and you said that there
> are callee-saved iregs so there is a specialized allocator
> for them.
> Can you please point at the relevant place in code?
>
>
> Look into liveness.c / linear_scan.c.
> In liveness.c look for mono_analyze_liveness
> In linear_scan.c look for mono_linear_scan
>
>
>
> On Altivec we have callee-saved vector registers too. Is
> it possible to use the same trick with them , in order to
> remove unnecessary loads/stores?
>
>
> Yes, it might be possible to do so, not sure how much work it
> will be thou.
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Mono-devel-list mailing list
> Mono-devel-list at lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-devel-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.ximian.com/pipermail/mono-devel-list/attachments/20100413/548cf7db/attachment-0001.html
More information about the Mono-devel-list
mailing list