AS3 – Fast memory access without Alchemy

AS3 – Fast memory access without Alchemy

With the Flash Player 10, Adobe added a new set of instructions allowing to compile C/C++ in a way the AVM2 could execute. By wrapping a little bit of glue code in C, Alchemy allows to reuse some of the numerous open source C libraries available.

And when you appreciate the speed of Alchemy-compiled C code, you can wonder how it can possibly be so much faster than AS3. Unfair!

What makes Alchemy code so fast? The main secret is a faster memory management, because obviously C/C++ is all about pointers & malloc’ing. ByteArray in AS3 is kind of slow so Adobe had to hack the AVM2 to remove this bottleneck or Alchemy would have been pointless.

And the good news for us AS3 geeks is that it is possible to use this fast memory in AS3…

 

Fast memory in AS3 implementations

A few smart people have managed to expose this feature to “regular” Flash languages.

First achievement, end of 2008, Nicolas Cannasse gives a short description of the technical details and quickly after that his haXe language gets a nicely integrated fast memory API.

One year later, Joa Ebert hacks fast memory in AS3, and release it as a feature of TDSI, an AS3 bytecode optimizer tool.

Likewise, a week ago, Burak Kalayci (you know, the ASV guy) has released Azoth a little tool specialized in enabling a fast memory API too.

Update June 2010: Joa & others are working on Apparat, a serious general purpose bytecode optimizer which includes TDSI. A recent update of Apparat fixes some issues I found with TDSI.

Apparat vs Azoth vs haXe

All have a similar API using static methods:

  1. select a target ByteArray for fast access (don’t change the target ByteArray during computation)
  2. read/write numbers.

Both Apparat and Azoth act as an optional post compilation step: the SWF is decompiled, optimized and recompiled. They can easily be automated to be executed after each compilation.

Joa’s approach using Apparat is interesting: this optimization fits naturally in within its generic optimization engine.

Azoth, being completely specialized to this optimization, is much faster than TDSI.

Finally haXe: memory API is built in the compiler so there’s no additional step. Oh and compilation speed is just insanely faster than with Flex SDK’s.

Some real stats now

As TDSI and Azoth processing is optional I also measured the non-optimized SWFs timings: the numbers are surprising.

The test is very basic with low calculations: a ByteArray is filled with a gradient thing and then copied in a BitmapData.

haXe version is honestly identical to AS3’s. Generated bytecode appeared to be quite similar (only slightly shorter) and no methods were inlined.

Download the tests sources and binaries here.

Timings*:

  Debug player Release player (ActiveX)
Direct ByteArray access 1010ms 543ms
Azoth, non optimized 1428ms 1184ms
Azoth, optimized 133ms 49ms
TDSI, old version non optimized > 16s 3856ms
Apparat, non optimized 2214ms 646ms
Apparat, optimized 133ms 49ms
haXe, direct ByteArray access 984ms 555ms
haXe, optimized 115ms 51ms

* “Release” compilation, Windows 7 32bits, core2 E6600 2.4GHz. Timings identical using Flex 3 or Flex 4 SDK.

Some conclusions

The 50ms timing in this test (writing about 6.5 million ints in memory) means you could draw pixel by pixel at 40fps a 1920×1024 bitmap. Not bad I guess.

Please note: this test was entirely focused on memory access, with almost no calculations so it should only give you an idea of what you can save for one particular operation (memory access). It doesn’t tell anything meaningful on other compiler optimization aspects.

haXe first:

haXe, almost a one man project, provides this memory API since FP10 is out.

haXe compiled code is generally more optimized than Adobe’s but this test does not let haXe show its strength in this regard: if you want to be blown away by the fast memory API + haXe’s awesome compiler optimizations read this post about an haXe version of an Alchemy experiment.

Azoth is a handy little tool:

Azoth is specialized in this task and it’s very fast – you know you’re not using a Java tool.

It does a great job at providing a decent performance even when running the non optimized version (unlike TDSI) – I believe this is a real “sell point” of this tool, especially if you are working with Flash CS which hardly offers post-compilation automation.

Sadly this is a Windows only tool, so this excludes a large group of potential users. Being a command line application it would be reasonable to port it to MacOS/Linux, but I kind of doubt this will happen – it’s apparently just a (nice) little side project from Burak.

Apparat wins as being the more practical:

Apparat has improved a lot and is becoming the tool of choice for code optimization (inlining, fast math functions, etc.) – it offers many practical optimization tools, like the MemoryPool which makes it easy to manage little memory chunks inside the unique Fast Memory buffer.

The memory API fallback code is now pretty decent, thanks to the pre-optimized SWC shipping with the tool instead of raw code.

Does anyone at Adobe work on the compiler?

Although I can agree that this memory API is not a major feature to add, these tools prove it is really usable and it would only represent a very small addition to the existing compiler. It could/should have been added in the Flex 4 SDK already.

As an example the latest Flex 4 SDK does not even optimize numeric calculations (like: var len:int = 512 * 256 * 4;). WTF!

Honestly this somewhat worries me. No compiler improvement, marginal VM speed improvement: Silverlight VM is already way faster than Flash and it won’t be long until Javascript in the browser catches up.

Applications of the Fast Memory APIs

Admittedly most Flashers are not going to use it but some geeks may find a few uses:

  • Pixel by pixel image manipulation, for example fancy bitmap effects – see Azoth sample bitmap animation,
  • Data crunching – like tons of 3D particles,
  • Binary data reading/writing – binary file formats, image encoders/decoders.

Links to get your hands dirty:

  1. Alex says:

    Download link isn’t working.

  2. Philippe says:

    Sorry for the broken link ;)

  3. Mem's says:

    “The 50ms timing in this test means you could draw pixel by pixel at 40fps a 1920×1024 bitmap”: do you mean 20fps for 50ms per frames ?

  4. Philippe says:

    @Mem 50ms timing is for writing 6.5 million pixels in memory – nearly 4 times 1920×1024 pixels.

  5. pleclech says:

    Thanks for your post, the TDSI alchemy memory fallback have been fix in the last Apparat version

    http://code.google.com/p/apparat/

    Patrick.

  6. Dominic says:

    It’s sad we have to go through a bunch of different tools to get basic memory access…
    And the only time I’ve ever used alchemy, the code was actually slower than pure AS3.
    Flash is a little disapointing lately in my opinion.

    Nice post though, thanks for sharing.

  7. layola says:

    hi,great this project.
    and I want to ask some question ,how to rewrite this code use this project?

    canvas.lock();
    canvas.applyFilter(canvas, rect, new Point(), new BlurFilter(2, 2));
    canvas.colorTransform(rect, ct);
    for (var i:uint = 0; i < particles.length ; i++) {
    particles[i].x += particles[i].vx;
    particles[i].y += particles[i].vy;
    if (particles[i].x WIDTH) particles[i].vx *= -1;
    if (particles[i].y HEIGHT) particles[i].vy *= -1;
    canvas.setPixel32(particles[i].x, particles[i].y, particles[i].color);
    }
    canvas.unlock();
    I don’t know how to convert setPixel32 to setPixels…

  8. Joa Ebert says:

    The Memory API has been optimized. In fact it is now using the Alchemy opcodes even if you do not use TDSI.

  9. Philippe says:

    @Joa that’s basically what I guessed and meant to say with: “[TDSI] memory API fallback code is now pretty decent, thanks to the pre-optimized SWC”

  10. Matt Bolt says:

    >> Does anyone at Adobe work on the compiler?

    The answer is no. It looks like ASC is only closely monitored by only a small handful of devs. Most of the changes made are specific to bug fixes. However, they tend to try and “fix” the issues in MXMLC before ever touching ASC, which sure, it covers the bug, but also explains why it takes so long to compile :-P

  11. Rinick says:

    Thank you for these information
    I find you use ( i / 8 ) <>3) << 8 and now the "Azoth optimized" version only takes 22ms

posted on 2011-03-01 13:10  jiahuafu  阅读(599)  评论(0编辑  收藏  举报

导航