<?xml version="1.0"?>
<rss version="0.91">
  <channel>
    <title>Squeak People diary for willembryce</title>
    <description>Squeak People diary for willembryce</description>
    <link>http://people.squeakfoundation.org/person/willembryce/</link>
    <item>
      <title>5 Oct 2008</title>
      <pubDate>Sun, 05 Oct 2008 18:37:16 -0700</pubDate>
      <link>http://people.squeakfoundation.org/person/willembryce/diary.html?start=89</link>
      <description>&lt;b&gt; An update on Exupery's progress&lt;/b&gt;

&lt;p&gt; The three things that have been added since the last release are:
 &lt;li&gt; More x86 addressing modes 
 &lt;li&gt; Generic primitives (calling out to the interpreter's primitives) 
 &lt;li&gt; Eliot's closure bytecodes and Freetype support merged into the VM 

&lt;p&gt; More x86 addressing modes and generic primitives have helped out
with performance.

&lt;p&gt; The addressing modes help out in send/return code which accesses
VM variables. Previously the address of the variable had to be loaded
into a register, then that used to access it. Now it can be done in
a single instruction.

&lt;p&gt; Generic primitive support improves the number of cases Exupery
can increase performance. If a primitive is called now that Exupery
doesn't know how to compile, it can now dispatch to it via a PIC which
is much faster than running through the interpreters lookup code.

&lt;p&gt; Eliot's closure bytecodes have been merged in because the Pharo
developers would like to start using them. I'd like to see proper
closure support. At the moment the VM will run closure code but
Exupery will not be able to compile it. That will take another VM
change, and an Exupery compiler change to free up the &quot;unused&quot; slot in
MethodContext.

&lt;p&gt; The subpixel changes to bitblt have been merged into the VM, and
the exupery-dev project on the Universes modified to load the packages
required to build the FreeType plugin. I've got makefiles that work,
they're not ideal but I ran into autoconf/automake issues that
prevented using a better solution.


&lt;p&gt; Also, I think I know what one of the remaining bugs is. When returning
into compiled code from interpreted code the compiled context is not
being marked as a root. Contexts are real objects, and need to keep
the garbage collector book-keeping. Any objects in old space must be
marked as roots if they point to new space.  To avoid write barrier
checks on context writes (all stack operations, and temporary
assignments), all contexts are marked as roots if they're in old space
when they're entered. Exupery's code does the marking in the return
bytecode because there are less returns than points to return to.
Every send, or potential send has an entry point, which is much more
than the number of return bytecodes.

&lt;p&gt; Next, I'll investigate this potential bug then fix it if it's real.

</description>
    </item>
    <item>
      <title>10 Aug 2008</title>
      <pubDate>Sun, 10 Aug 2008 11:52:37 -0700</pubDate>
      <link>http://people.squeakfoundation.org/person/willembryce/diary.html?start=88</link>
      <description>&lt;b&gt; Exupery now supports literal indirect addressing &lt;/b&gt;

&lt;p&gt; The new benchmarks are:
&lt;pre&gt;
arithmaticLoopBenchmark 490 compiled 100 ratio: 4.900
bytecodeBenchmark 857 compiled 299 ratio: 2.866
sendBenchmark 784 compiled 443 ratio: 1.770
doLoopsBenchmark 455 compiled 434 ratio: 1.048
pointCreation 498 compiled 506 ratio: 0.984
largeExplorers 181 compiled 185 ratio: 0.978
compilerBenchmark 278 compiled 271 ratio: 1.026
Cumulative Time 450.292 compiled 283.126 ratio 1.590
&lt;/pre&gt;

&lt;p&gt; The overall ratio has increased from 1.54 to 1.59 and the compiler benchmark has increased from 0.98 to 1.03. The compiler benchmark is beating the interpreter again. It used to on the Athlon, but the Core 2 is much more efficient at running the interpreter.

&lt;p&gt; What's happened to the code is sequences like:
&lt;pre&gt;
(mov 2400 eax)
(mov #rootTableCount ecx)
(cmp eax (ecx))
&lt;/pre&gt;
are now:
&lt;pre&gt;
(mov 2400 eax)
(cmp eax (#rootTableCount))
&lt;/pre&gt;
the optimal encoding would be:
&lt;pre&gt;
(cmp 2400 (#rootTableCount))
&lt;/pre&gt;

&lt;p&gt; There's still plenty of room to improve performance by improving instruction selection but the gains are likely to be smaller for each individual improvement.

&lt;p&gt; Next up ^true, ^false, and ^ nil primitives. These only handle methods that just return true, false, or nil. Bytecodes are used to do ^ true from inside a method that does more work. These primitives exist to avoid creating a context for such trivial methods.</description>
    </item>
    <item>
      <title>8 Aug 2008</title>
      <pubDate>Fri, 08 Aug 2008 23:26:29 -0700</pubDate>
      <link>http://people.squeakfoundation.org/person/willembryce/diary.html?start=87</link>
      <description>&lt;b&gt;Planning for Exupery 0.15&lt;/b&gt;

&lt;p&gt; The two areas most in need of improvement before a 1.0 now are run
time performance and reliability. Hopefully 0.15 will lead to a decent
improvement in both. First runtime performance as the end of 0.14
involved a decent round of testing and debugging.

&lt;p&gt; Here's some benchmarks:
&lt;pre&gt;
  arithmaticLoopBenchmark  417 compiled  94 ratio: 4.436
  bytecodeBenchmark        725 compiled 262 ratio: 2.767
  sendBenchmark            692 compiled 403 ratio: 1.717
  doLoopsBenchmark         389 compiled 385 ratio: 1.010
  pointCreation            423 compiled 426 ratio: 0.993
  largeExplorers           198 compiled 199 ratio: 0.995
  compilerBenchmark        245 compiled 249 ratio: 0.984
  Cumulative Time          401 compiled 260 ratio 1.542
&lt;/pre&gt;

&lt;p&gt; The primary goal is to improve the last two benchmarks, the
two macro benchmarks. Both benchmarks use a profiler to decide
what to compile, the goal is to compile enough methods to make
a difference reasonably quickly so the benchmark doesn't take
too long to run.

&lt;p&gt; Here's the profile for compilerBenchmark:
&lt;pre&gt;
CPU: Core 2, speed 3005.67 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
Counted INST_RETIRED.ANY_P events (number of instructions retired) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        samples  %        image name               app name                 symbol name
4122385  62.5654  4860169  58.3687  squeak                   squeak                   interpret
447635    6.7937  715498    8.5928  anon (tgid:6321 range:0xb1c91000-0xb7bf0000) squeak                   (no symbols)
224375    3.4053  412666    4.9560  squeak                   squeak                   exuperyCreateContext
157809    2.3951  269117    3.2320  squeak                   squeak                   exuperyIsNativeContext
126427    1.9188  230642    2.7699  squeak                   squeak                   allocateheaderSizeh1h2h3doFillwith
96316     1.4618  107350    1.2892  squeak                   squeak                   sweepPhase
87506     1.3281  47304     0.5681  squeak                   squeak                   lookupMethodInClass
53014     0.8046  71782     0.8621  squeak                   squeak                   markAndTrace
52262     0.7932  84112     1.0102  squeak                   squeak                   exuperySetupMessageSend
51999     0.7892  76301     0.9163  squeak                   squeak                   exuperyCallMethod
50920     0.7728  84568     1.0156  squeak                   squeak                   instantiateContextsizeInBytes
47231     0.7168  31047     0.3729  no-vmlinux               no-vmlinux               (no symbols)
42841     0.6502  52130     0.6261  MiscPrimitivePlugin      MiscPrimitivePlugin      primitiveStringHash
42560     0.6459  77447     0.9301  squeak                   squeak                   activateNewMethod
&lt;/pre&gt;

&lt;p&gt; Only 14% of the time is going into code compiled by Exupery and it's
helper functions. 62% of the time is still in the main interpreter
loop. Interestingly the ratio between time in native code and time in
exuperyCreateContext is the same as the send benchmark so it's likely
that the native code is mostly in send processing. Either being called
from interpreted code or sending to compiled code.  The native code is
executing 1.6 instructions per cycle, the CPU maxes out at 4
instructions per cycle.1.6 instructions per cycle would be excellent
for an Athlon but Cores are more efficient, it's still good though.

&lt;p&gt; Half of the time spent in exuperySetupMessage send is going to
dispatching to unhandled primitives, the other half will be
going to sends to interpreted code.

&lt;p&gt; There's a few obvious things to do to improve performance:
 &lt;li&gt;Implement more addressing modes
 &lt;li&gt;Natively compile calls to C primitives
 &lt;li&gt;Implement the ^true, ^false, and ^ nil primitives
 &lt;li&gt;Remove jumps to jumps.

&lt;p&gt; Implementing more addressing modes looks the most promising. It should
speed up most of the benchmarks as all but the bytecode benchmark
spend significant time in code that suffers badly from a single
missing addressing mode especially object creation code and
send/return code. 

&lt;p&gt; The current send optimisation is PICs which only work when sending
from compiled code to compiled code. Sends to and from interpreted
code are about the same speed or a little slower than interpreted to
interpreted sends. It's true that this can be avoided by compiling
more methods so most sends are compiled to compiled but it's much
easier to decide what to compile if compiling anything is likely
to lead to a speed improvement and not risk a speed loss.

&lt;p&gt; Compiling the call to the primitive function into native code will
allow the primitives to be dispatched via PIC instead of needing to go
through exuperySetupMessageSend. Half of the calls to
exuperySetupMessageSend in the compiler benchmark are for primitives,
in the large explorers benchmark three quarters of the calls are for
primitives. That time will disappear. Evaluating blocks uses a
primitive send which takes a large proportion of the block dispatch
time. 

&lt;p&gt; There's a handful of primitives that are implemented inside the main
interpret loop. ^ true, ^ false, and ^ nil are some of them. They
often show up when they fail to inline as Exupery can not yet compile
them. If code uses them, then compiling it will cause a large time
loss due to using a full primitive dispatch compared with the
interpreter. Given how simple they are implementing them makes sense.

&lt;p&gt; Exupery can create code that jumps directly to an unconditional
jump. This does happen in some inner loops. The jumps should be
modified to go to the target jump's destination. Jumping to a jump
makes the CPU's front end's life difficult. In the compiler benchmark
only for 9% of the time are the reservation stations full, which
indicates that for most of the time the front end can not keep up with
instruction execution.

&lt;p&gt; Here's an example of the kind of code that's commonly generated with
addressing mode problems. This example is from the method return
sequence. Every compiled method goes through a block like this when
returning:
&lt;pre&gt;
  (block24
    (mov #nilObj eax)
    (mov (eax) eax)
    (mov eax (8 ecx))
    (mov #activeContext eax)
    (mov ebx (eax))
    (mov #youngStart eax)
    (mov #activeContext ebx)
    (mov (ebx) ebx)
    (cmp (eax) ebx)
    (jumpUnsignedGreaterEqualThan block25)
    (mov #activeContext eax)
    (mov (eax) eax)
    (mov (eax) ebx)
    (mov 1073741824 eax)
    (and ebx eax)
    (jnz block25)
    (mov 2400 eax)
    (mov #rootTableCount ecx)
    (cmp eax (ecx))
    (jumpSignedGreaterEqualThan block26)
    (jmp block27)
   )
&lt;/pre&gt;

&lt;p&gt; The problem is instructions like &quot;(mov #nilObj eax)&quot; the address
should be encoded in the memory access that uses it. There's no
need to move an address into a register before using it. There's
other problems besides not handling literal indirect addressing but
the literal indirect problem is the largest.by a long shot.

&lt;p&gt; I'm going to add literal indirect addressing first as it's harder to
estimate what it'll do to overall performance but it is a problem for
almost all the benchmarks.

&lt;p&gt; It would also be worthwhile improving the profiling tools. It should
be relatively easy to get oprofile to show the compiled method names
instead of lumping all compiled code into the &quot;anon&quot; memory bucket.
It would also be worthwhile and easy to write some code to read the
oprofile files and compute the ratios rather than calculate them by
hand. 
</description>
    </item>
    <item>
      <title>30 Jul 2008</title>
      <pubDate>Wed, 30 Jul 2008 21:32:12 -0700</pubDate>
      <link>http://people.squeakfoundation.org/person/willembryce/diary.html?start=86</link>
      <description>&lt;b&gt; [ANN] Exupery 0.14 is released &lt;/b&gt;

&lt;p&gt; The major improvement is to the speed of register allocation. I've
fixed a couple of major performance problems so register allocation
takes 50% of compilation time even for the largest methods. Register
allocation now appears to take roughly linear time.

&lt;p&gt; Installation instructions are &lt;a href=&quot;http://wiki.squeak.org/squeak/3945&quot; &gt;here.&lt;/a&gt;

&lt;p&gt; There's still plenty of room to improve compilation time. Every stage
copies the entire intermediate graph to produce the input to the next
stage which is redundant, most stages only change a few places. The
register allocator's liveness analyser still uses Sets to represent
which variables are live rather than bit vectors.

&lt;p&gt; This release can compile cascades, the last missing core language
feature. Exupery still can only compile a handful of the core
primitives, it only compiles #at: for pointer objects. Cascades were
added because they were used more in 3.10, the choice was either
delete 2 system tests or add cascades. I delayed the release to add
cascades.

&lt;p&gt; There's a few bug fixes of old bugs but this release is not noticeably
more reliable than the previous release. It should be a bit more
reliable though, especially when running the new Exupery VM. The new
Exupery VM just has a single bug fix in it.

&lt;p&gt; Here's the benchmarks:
&lt;pre&gt;
  arithmaticLoopBenchmark 414 compiled  94 ratio: 4.404
  bytecodeBenchmark       726 compiled 264 ratio: 2.750
  sendBenchmark           707 compiled 454 ratio: 1.557
  doLoopsBenchmark        388 compiled 398 ratio: 0.975
  pointCreation           433 compiled 423 ratio: 1.024
  largeExplorers          257 compiled 258 ratio: 0.996
  compilerBenchmark       248 compiled 249 ratio: 0.996 
  Cumulative Time         419 compiled 275 ratio  1.519

&lt;p&gt;   ExuperyBenchmarks&amp;gt;&amp;gt;arithmeticLoop              105ms
  SmallInteger&amp;gt;&amp;gt;benchmark                        362ms
  InstructionStream&amp;gt;&amp;gt;interpretExtension:in:for: 6051ms
  Average 612.691
&lt;/pre&gt;

&lt;p&gt; The key benchmark for this release is the last one, compiling
interpretExtension:in:for: which now only takes 6 seconds. With
previous version of the compiler it used to take over 2 minutes
to compile.

&lt;p&gt; The change in times in the other benchmarks are mostly due to
me upgrading from an Athlon 64 2.2GHz to a Core 2 3.0GHz. The
register allocator should be slightly more efficient especially
when compiling send heavy code.

&lt;p&gt; Reliability and compile time performance has dominated the
last few releases. Now run time performance and reliability are
the biggest issues. Exupery still can crash after about an hour's
active use depending on what's being done.
</description>
    </item>
    <item>
      <title>15 Jun 2008</title>
      <pubDate>Sun, 15 Jun 2008 21:41:40 -0700</pubDate>
      <link>http://people.squeakfoundation.org/person/willembryce/diary.html?start=85</link>
      <description>&lt;b&gt;  Exupery update preparing for the 0.14 release&lt;/b&gt;

&lt;p&gt; The Exupery 0.14 release is in final testing now. The major
improvement is to register allocator performance. The register
allocator should also produce slightly better code due to the
changes. There's also now support for cascades, the last major missing
language feature.

&lt;p&gt; The main gain in this release is register allocation is much faster on
large methods. Now it takes about 50% of the compilation time for all
methods, it used to take almost all the time for large methods. This
means compilation time is now roughly linear with method size.

&lt;p&gt; The major improvements for 0.14 were done by mid April. The release
has been delayed due to upgrading to 3.10. Exupery's testing uncovered
two bugs in 3.10 and four tests were failing due to changes in the
base image.

&lt;p&gt; Two tests have been deleted as they were trying to compile methods
that have been removed. Those methods were involved in bugs found by
the stress test. There should be unit tests that cover any changes to
the compiler required to fix them.

&lt;p&gt; Two tests were failing because of cascades. Cascades have been added
so they pass again. The other option was to either delete the tests or
release with failing tests on 3.10.

&lt;p&gt; All that's left to do before releasing is run the stress test again
and do some Seaside development with Exupery running. Both were done
successfully before upgrading to the latest squeak-dev 3.10 images.
</description>
    </item>
    <item>
      <title>3 Mar 2008</title>
      <pubDate>Mon, 03 Mar 2008 23:51:57 -0700</pubDate>
      <link>http://people.squeakfoundation.org/person/willembryce/diary.html?start=84</link>
      <description>&lt;b&gt; London Smalltalk Meeting this Saturday&lt;/b&gt;

&lt;p&gt; The meeting is this Saturday. There's a good list of people attending,
if you'd like to attend please add your name to the wiki page so 
building security will know to let you in.

&lt;p&gt; If there's anything you'd like to present, please feel free to add
your name to the list.

&lt;p&gt; The wiki page is &lt;a href=&quot;http://www.xpdeveloper.net/xpdwiki/Wiki.jsp?page=SmalltalkUK20080308&quot; &gt;here&lt;/a&gt;.

&lt;p&gt; Discussion and our monthly meetings are organised on the UK Smalltalk
mailing list &lt;a href=&quot;http://lists.squeakfoundation.org/mailman/listinfo/uksmalltalk&quot; &gt;here&lt;/a&gt;.
</description>
    </item>
    <item>
      <title>18 Feb 2008</title>
      <pubDate>Mon, 18 Feb 2008 22:46:26 -0700</pubDate>
      <link>http://people.squeakfoundation.org/person/willembryce/diary.html?start=83</link>
      <description>&lt;b&gt;Reminder London Smalltalk Meeting Saturday 8th of March&lt;/b&gt;

&lt;p&gt; There will be a Smalltalk meeting in London on Saturday the 8th of
March. We plan to start at 10am with the morning focussed on
beginners. Feel free to show up later, normally the afternoon
will start around 1pm. 

&lt;p&gt; There's already about 20 people signed up on the &lt;a href=&quot;http://www.xpdeveloper.net/xpdwiki/Wiki.jsp?page=SmalltalkUK20080308&quot; &gt;wiki page&lt;/a&gt;, please add
your name if you're interested in coming or presenting soemthing so
building security will know to let you in.

&lt;p&gt; RSPV required as it's being hosted in corporate offices.

&lt;p&gt; Discussion or suggestions is best placed on the UK Smalltalk &lt;a href=&quot;http://lists.squeakfoundation.org/mailman/listinfo/uksmalltalk&quot; &gt;mailing list&lt;/a&gt;.</description>
    </item>
    <item>
      <title>27 Jan 2008</title>
      <pubDate>Sun, 27 Jan 2008 12:03:24 -0700</pubDate>
      <link>http://people.squeakfoundation.org/person/willembryce/diary.html?start=82</link>
      <description>&lt;b&gt;Smalltalk UK meeting on Saturday 8th March&lt;/b&gt;

&lt;p&gt; We're organising a Smalltalk event in London. This is going to be on a
Saturday afternoon, we'll probably head to a pub afterwards. We'll aim
to get a room with a projector.

&lt;p&gt; Details are &lt;a href=&quot;http://www.xpdeveloper.net/xpdwiki/Wiki.jsp?page=SmalltalkUK20080308&quot; &gt;here.&lt;/a&gt;
	
RSPV required as it's being hosted in corporate offices. Please add
your name if you're interested in either attending or presenting.

&lt;p&gt; &lt;p&gt; Discussion will be on the UK Smalltalk &lt;a href=&quot;http://lists.squeakfoundation.org/mailman/listinfo/uksmalltalk&quot; &gt;mailing list&lt;/a&gt;.</description>
    </item>
    <item>
      <title>3 Jan 2008</title>
      <pubDate>Thu, 03 Jan 2008 21:58:12 -0700</pubDate>
      <link>http://people.squeakfoundation.org/person/willembryce/diary.html?start=81</link>
      <description>&lt;b&gt; The next Saturday London Smalltalk Gathering &lt;/b&gt;

&lt;p&gt; We're thinking about organising the next Saturday London Smalltalk
gathering on the 8th of March. If there's any good reason to pick
a better date please tell us before we start booking rooms. We'll
aim to have a room with a projector so people can give talks. As
always we'll probably migrate to a pub for dinner then a few beers
afterwards

&lt;p&gt; Discussion will be on the UK Smalltalk &lt;a href=&quot;http://lists.squeakfoundation.org/mailman/listinfo/uksmalltalk&quot; &gt;mailing list&lt;/a&gt;.</description>
    </item>
    <item>
      <title>17 Dec 2007</title>
      <pubDate>Mon, 17 Dec 2007 22:47:48 -0700</pubDate>
      <link>http://people.squeakfoundation.org/person/willembryce/diary.html?start=80</link>
      <description>&lt;b&gt; Thinking about Exupery 0.14&lt;/b&gt;

&lt;p&gt; The primary goal for the next releases will be making the following
benchmarks more compelling. I've added a compile time benchmark as
there are a few performance bugs in the compiler that should be
removed.

&lt;p&gt; &lt;pre&gt;
  arithmaticLoopBenchmark 1396 compiled  128 ratio: 10.906
  bytecodeBenchmark       2111 compiled  460 ratio:  4.589
  sendBenchmark           1637 compiled  668 ratio:  2.451
  doLoopsBenchmark        1081 compiled  715 ratio:  1.512
  pointCreation           1245 compiled 1317 ratio:  0.945
  largeExplorers           728 compiled  715 ratio:  1.018
  compilerBenchmark        483 compiled  489 ratio:  0.988
  Cumulative Time         1125 compiled  537 ratio   2.093

&lt;p&gt;   ExuperyBenchmarks&amp;gt;&amp;gt;arithmeticLoop                249ms
  SmallInteger&amp;gt;&amp;gt;benchmark                         1112ms
  InstructionStream&amp;gt;&amp;gt;interpretExtension:in:for: 113460ms
  Average                                         3155.360
&lt;/pre&gt;

&lt;p&gt; First, I'll get the register allocator to allocate each section of
method separately. After that, I'll probably do some work on further
optimising the register allocator but I might work on improving the
generated native code.

&lt;p&gt; Register allocating each section separately will both allow for better
and faster allocation. It will make it easy to avoid dealing with
registers and interference from other sections of the code and will
reduce the size of the problem. Colouring register allocation written
well should be on average n log n time but the performance bugs will
raise that to probably n^2.

&lt;p&gt; It's possible that just allocating each section of the method
separately will be enough to bring allocation down to a reasonable
time. It should definitely help for the larger methods but is unlikely
to do anything for the arithmaticLoop and will only help the bytecode
benchmark slightly. Compiling quicker will make it easier to run more
extensive tests.
</description>
    </item>
  </channel>
</rss>
