J2ME Game Optimization Secrets (part 4)

Low-Level Optimization


All programmers are familiar with the concepts of the sub-routine and the function – separate out common code to avoid duplicating it in several places around your app. Unfortunately, this commonly “Good” programming practice can impact performance because method calls involve a certain amount of overhead. The easiest way to reduce the amount of time it takes to call a method is by carefully selecting its declaration modifiers. Our programmer has been very cautious, and has made his work() and workMore() methods synchronized, in case some other thread might call them simultaneously. This is all well and good, but if we’re serious about performance, we often have to make sacrifices, and today we will be sacrificing safety.


Well, not really. We know for a fact that no-one else is going to be calling these methods, so we can de-synchronize them without too much worry. What else can we do? Take a look at this list of method types:


synchronized methods are the slowest, since an object lock has to be obtained
interface methods are the next slowest
instance methods are in the middle
final methods are faster
static methods are fastest
So we definitely shouldn’t be making anything synchronized, and it looks like we can even mark work() and workMore() as final static methods. Doing this cut 1% from the time the emulator spent inside run().


Another factor that affects the performance of method calls is the number of parameters passed into the method. We are calling workMore() 51712 times, passing an int array and two ints into the method and returning an int each time. In this, somewhat trivial, example it’s going to be easy to collapse the workMore() method into the the body of work() to avoid the call entirely. In the real world, that decision might be harder to make, especially if it means that code will be duplicated around your application. Test with the profiler and on the device to see how much difference it actually makes before taking that step. If you can’t remove the method altogether, try to reduce the number of parameters you’re passing into it. The more parameters, the greater the overhead.


  public final static int work( int[] n ) {
    divisor = 1;
    r = 0;
    for ( int j = 0 ; j < DIVISOR_COUNT ; j++ ) {
      for ( int i = 0 ; i < n.length ; i++ ) {
        r += n[i] * n[i] / divisor + n[i];
      }
      divisor *= 2;
    }
    return r;
  }


Wow! Removing the call to workMore() slashed the amount of time spent in run to 9.96%. From here on it will be uphill all the way. Let’s look at two general optimization techniques – strength reduction and loop unrolling.


Strength reduction is where you replace a relatively slow operation with a fast one that does exactly the same thing. The most common example is using the bitshift operators, which are equivalent to multiplication and division by a power of two. For example, x >> 2 is equivalent to x / 4 ( 2 to the power of 2 ) and x << 10 is equivalent to x * 1024 ( 2 to the power of 10 ). By an amazing coincidence, our divisor is always a power of 2 (isn’t that lucky!) so we can use bitshifting in place of division.


Unrolling loops reduces the overhead of flow control code by performing more than one operation each time through the loop, executing fewer iterations, or even removing the loop entirely. As our DIVISOR_COUNT is only 8, it’s going to be easy to unroll our outer loop:


public final static int work( int[] n ) {
  r = 0;
  for ( int i = 0 ; i < n.length ; i++ ) { r +=  n[i] * n[i]  + n[i]; }
  for ( int i = 0 ; i < n.length ; i++ ) { r +=  (n[i] * n[i] >> 1) + n[i]; }
  for ( int i = 0 ; i < n.length ; i++ ) { r +=  (n[i] * n[i] >> 2) + n[i]; }
  for ( int i = 0 ; i < n.length ; i++ ) { r +=  (n[i] * n[i] >> 3) + n[i]; }
  for ( int i = 0 ; i < n.length ; i++ ) { r +=  (n[i] * n[i] >> 4) + n[i]; }
  for ( int i = 0 ; i < n.length ; i++ ) { r +=  (n[i] * n[i] >> 5) + n[i]; }
  for ( int i = 0 ; i < n.length ; i++ ) { r +=  (n[i] * n[i] >> 6) + n[i]; }
  for ( int i = 0 ; i < n.length ; i++ ) { r +=  (n[i] * n[i] >> 7) + n[i]; }
  return r;
}


Two important points. First, you’ll notice that unrolling our loop requires us to duplicate some code. This is the last thing you want in J2ME, where developers are often fighting JAR bloat, but remember that the JARing process involves compression, and compression works best on repetitive data, so the above code might not make as big an impact on your jar size as you might think. Again, it’s all about the trade-off. Your code can be small, fast, easy to read. Pick any two. The second point is that the bitshift operators have a different precedence from / and *, so you will often have to put parentheses around statements that otherwise would not need them.


Unrolling the loop and using the bitshift operators shaved off slightly more than 1%. Not bad. Now let’s turn our attention to array access. Arrays are fast data structures in C, but for that reason they can also be dangerous – if your code starts accessing array elements beyond the end of the array, you’re over-writing sections of memory you shouldn’t be messing with and the consequences will most likely be dire.


In contrast, Java is a very safe language – running off the end of an array like that will simply throw an ArrayIndexOutOfBoundsException. The system checks for an invalid index value every time you access an array, which makes array access slower than in C. Again, there’s nothing we can do about the internals of array handling in Java, but we can work around it by making smart decisions. In the code above, for example, we’re accessing n[i] 24 times. We can omit a lot of those array accesses by storing the value of n[i] in a variable. A little high-level thought also reveals that we can rearrange things in a much smarter way like this…


  private static int divisor;
  private static int r;
  private static int ni;
  public final static int work( int[] n ) {
    r = 0;
    for ( int i = 0 ; i < n.length ; i++ )  {
      ni = n[i];
      r +=  ni * ni + ni;
      r +=  (ni * ni >> 1) + ni;
      r +=  (ni * ni >> 2) + ni;
      r +=  (ni * ni >> 3) + ni;
      r +=  (ni * ni >> 4) + ni;
      r +=  (ni * ni >> 5) + ni;
      r +=  (ni * ni >> 6) + ni;
      r +=  (ni * ni >> 7) + ni;
    }
    return r;
  }


We’ve eliminated a whole lot of looping and array accessing. Let’s now eliminate a whole lot of squaring using a technique known as common sub-expression elimination. We’re calculating the square of n[i] eight times each time through our loop. We can eliminate those repetitive calculations by working out the square once at the beginning:


  private static int r;
  private static int ni;
  private static int nis;
  public final static int work( int[] n ) {
    r = 0;
    for ( int i = 0 ; i < n.length ; i++ )  {
      ni = n[i];
      nis = ni * ni;
      r +=  nis  + ni;
      r +=  (nis >> 1) + ni;
      r +=  (nis >> 2) + ni;
      r +=  (nis >> 3) + ni;
      r +=  (nis >> 4) + ni;
      r +=  (nis >> 5) + ni;
      r +=  (nis >> 6) + ni;
      r +=  (nis >> 7) + ni;
    }
    return r;
  }


…which cuts down the time spent in run() to 6.18%. Not bad at all. Before we move on, let me say a couple more things about arrays. A little high-level optimization ( i.e. “thought” ) might reveal that an array isn’t necessarily the right data structure to use. Think about using a linked list or some other structure if that would increase performance. Secondly, if you are going to use an array and you ever need to copy its contents into another array, always use System.arraycopy(). It’s way faster than trying to write your own routine to do the same thing. Finally, arrays are generally higher performance than the java.util.Vector object. If you require the kind of functionality that a Vector offers, think about writing it yourself and run tests to make sure your code is faster.


Okay, we’re really running out of things to optimize. We just introduced another couple of variables to cache the array element and the value of that element squared. You may have been wondering why those variables have been declared outside the body of the method declaration. They’re outside because even declaring ints takes a little time and we should keep that out of our loop, right? Wrong!


Assume nothing. It’s true that we’re saving time by avoiding the int declarations, but this code may actually be slower than if those variables were declared locally, inside the method. That’s because local variables perform better, as the JVM takes longer to resolve variables declared outside a method. So let’s turn them into local variables.


Finally, we can tweak our for() loop slightly. Computers are generally better at comparing a number to zero than to any other non-zero number. That means we can turn our loop upside down and rewrite the method like this so we’re comparing against zero:


  public final static int work( int[] n ) {
    int r = 0;
    int ni;
    int nis;
    int i;
    for ( i = n.length ; –i >= 0 ; ) {
      ni = n[i];
      nis = ni * ni;
      r +=  nis  + ni;
      r +=  (nis >> 1) + ni;
      r +=  (nis >> 2) + ni;
      r +=  (nis >> 3) + ni;
      r +=  (nis >> 4) + ni;
      r +=  (nis >> 5) + ni;
      r +=  (nis >> 6) + ni;
      r +=  (nis >> 7) + ni;
    }
    return r;
  }


And that’s it! This code may be a little faster, but the profiler results are less conclusive. What is clear is that this method is tougher to read. There may be further room for improvement, but let’s instead take another look at our paint() method, to see if any of what we learned can be introduced there.


Remember what we learned about local variables? If you are forced to use an instance variable and you are referencing that variable multiple times inside a method, it might be worth introducing a local variable so that the JVM only has to resolve the reference once. You’ll introduce a declaration and an assignment which will slow things down, but as a rule of thumb, we’ll use that technique if an instance variable is referenced more than twice.


We can also use strength reduction in our paint() method to replace the modulo operator with a circular counter that uses a bit operator. This is only possible because the length of our random numbers cache array is (amazingly!) a power of two. Finally, we can juggle our comparisons to always compare with zero. Here’s our new and improved paint() method:


  public void paint(Graphics g) {
    StringBuffer sb = stringBuffer;
    Graphics ig = imageGraphics;
    char[] sc = stringChars;
    int sl;
    int ml = messageLength;
    int ril = ri;
    int iw = 0;
    g.setColor( COLOR_BG );
    g.fillRect( 0, 0, dw, dh );
    if ( frameTime – oldFrameTime != 0  ) {
      sb.delete( ml, stringLength );
      sb.append( (int)frameTime );
      sl = stringLength = sb.length();
      sb.getChars( ml, sl, sc, ml );
      iw = font.charsWidth( sc, 0, sl );
      ig.setColor( COLOR_BG );
      ig.fillRect( 0, 0, iw, ih );
      ig.setColor( COLOR_FG );
      ig.drawChars( sc, 0, sl, 0, 0, textAnchor );
    }
    for ( int i  = DRAW_COUNT ; –i >=0  ; ) {
      g.setClip( randomNumbersX[ril], randomNumbersY[ril], iw, ih );
      g.drawImage( stringImage, randomNumbersX[ril],
                   randomNumbersY[ril], textAnchor );
      ril++;
      ril &= 255;
    }
    ri = ril;
    oldFrameTime = frameTime;
  }


Again, we are at the end of the road with respect to profiler results. These changes don’t affect the number of times the graphics routines are called, so the difference will be minor at best. But when we combine all the changes we made to our work() method and load the new JAR onto the device, the difference is huge. My motorola i85s now completes the test in 14030ms – over twice as fast!


There is one last change to make to the code. I have left it to the end because it uses a method that is not particularly well documented and my experience has been that its behavior varies across implementations. Looking at the start() and run() methods on OCanvas, you can see that I’ve been using a separate animation thread. This is the traditional way of doing animation in Java. An issue with using this technique for games is that we are forced to pause every time we iterate through the loop to allow system events, such as key presses and command actions to be delivered. We call the wait() method in a synchronized block to make this happen. This is hardly optimized code. After all our hard work optimizing everything else, we’re actually stopping to do nothing right in the thick of things. What’s worse, it’s not easy to arrive at a good figure for WAIT_TIME. If we wait() too long, the game slows down. If we don’t wait() for long enough, key presses are missed and the game stops being responsive to user input.


J2ME provides a solution to this problem by using the Display.callSerially() method. The API states that callSerially( Runnable r ) “causes the Runnable object r to have its run() method called later, serialized with the event stream, soon after completion of the repaint cycle”. By using callSerially() we can miss out the call to wait() entirely. The system will ensure that our work() and paint() methods are called in synch with the user input routines, so our game will remain responsive. Here are the new methods…


  public void start() {
    started = frameStarted = System.currentTimeMillis();
    loopCounter = result = 0;
    finished = false;
    exitStatus = EXIT_DONE;
    run();
  }
  public void run() {
    frameTime = System.currentTimeMillis() – frameStarted;
    frameStarted = System.currentTimeMillis();
    if ( midlet.running && !finished ) {
      result += work( numbers );
      repaint();
      display.callSerially(this);
      loopCounter++;
      finished = ( loopCounter > LOOP_COUNT );
    }
    else {
      elapsed = System.currentTimeMillis() – started;
      midlet.exitCanvas( exitStatus );
    }
  }


…plus we have to declare and get a handle on the Display:


    Display display = Display.getDisplay( midlet );


Without the call to wait() my i85s now runs this code in 10180ms – around a 40% savings. You might expect a greater increase in performance, given that we just eliminated 100 calls that wait() 50ms, but remember this technique is also about responsiveness to user input.


Again, let me stress that this method of animation should be used with caution. I had trouble getting it to work on NOKIA, and all of their example code (even game code) uses the wait() technique instead. Even on Motorola I had problems using callSerially() if I added Command objects to the animated Canvas. Test carefully before you try this at home.


 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.