Articles

Reflecting on the JVM Class File Format

The latest release of Whiley (v0.3.16) includes (finally) a binary file format for the Whiley Intermediate Language (WYIL).  You can think WYIL files are to Whiley, as class files are to Java, or CIL files are to C#.  Furthermore, the existence of WYIL files implies there is a corresponding (albeit currently hypothetical) Whiley Virtual Machine (WYVM) which can execute them directly.  Developing this VM file format has been an interesting experience, and caused me to consider: what makes a good VM file format?

Without doubt, the JVM class file format was my main influence when developing the binary WYIL file format (rather than e.g. CIL or LLVM BitCode).  This is simply because I’ve spent a lot of time working with JVM class files and Java Bytecode in the past. But, there seem to be some obvious goals for any modern VM file format:
  1. Quick to decode and verify in order that the VM can start up as quickly as possible.
  2. Compact to minimise costs associated with e.g. network transportation.
  3. Scalable so that we are not restricted to some maximum file size.
  4. Extensible so that we can embed additional information as the language evolves, or for other non-computational purposes (e.g. embedding documentation or licenses).

In my opinion, the JVM class file format does a reasonable job — but, it clearly falls short in a number of areas.  The presence of attribute blocks in the format means it is very extensible.  However, many important values are encoded using fixed-width unsigned integer values (e.g. as u8, u16, etc).  This has led to annoying limitations on the maximum size of various things (e.g. that a code attribute cannot exceed 64KB).

Anyway, here are a few more interesting thoughts I had about JVM class files:

  • The format needed to be more resilient to large class file sizes. In particular, using fixed-width integers at key positions was an inherently bad idea.  LLVM Bitcode has a better solution where arbitrary sized integers are encoded using 4-bit nibbles where the high bit signals if another nibble follows or not.
  • The constant pool should have been stratified.  That is, where pool items of the same kind are located together rather than being potentially located anywhere in the pool.  This is useful as, in many situations, we know what kind of pool item to expect (e.g. a getfield bytecode always refers to a CONSTANT_Fieldref_Info pool item).  In such situations, smaller index values could be used (i.e. which are offset from the start of the group, rather than the whole pool).
  • Types should have been fully embedded into bytecodes.  Since only partial type information is embedded in JVM class files, the bytecode verifier must recover the missing information (which is expensive, and the reason why the StackMapTable is now a required attribute).
  • Bytecode branch offsets should not be in bytes. Storing branch offsets in bytes (e.g. for bytecodes such as goto) makes sense when bytecodes are interpreted “in place” — that is, when the interpreter does not expand them into an array of e.g. Bytecode objects, but instead interprets the raw bytes directly. But, I don’t think many interpreters would do this. That’s because decoding bytecodes is relatively expensive, and so it makes sense to carry this cost once and expand them into more easily managed objects. In such case, it makes much more sense for branch offsets to be given in terms of whole bytecodes, rather than individual bytes. This increases the range of a given branch, and also simplifies encoding since we avoid any difficult algorithms for computing branch offsets
  • Register-based bytecodes instead of stack-based bytecodes. More modern bytecode formats (e.g. LLVM BitCode and Dalvik Bytecode) tend to prefer register-based bytecodes over stack-based ones. This is partly because it’s generally easier to generate more efficient low-level machine code when starting from a register-based format. Personally, I also find that code for manipulating register-based bytecodes is simpler and more robust. This is because, with a stack-based format, one must always be very careful to maintain stack consistency.
  • Issues of architecture and implementation should not have leaked into the file format.  The most obvious example is the need for two register/stack slots when storing long and double primitives. Likewise, the need to word-align switch bytecodes is just plain wierd.

Anyway, those are my thoughts! And, I’m sure there are plenty of other design issues people have encountered …

6 comments to Reflecting on the JVM Class File Format

  • Kannan Goundan

    Why do you say that register-based bytecode formats lets you generate better native code? The translation between register-based and stack-based is straightforward.

    The linked article says that it made things easier for their peephole optimizer to see that two instructions didn’t interfere, but they didn’t get into much detail. I wonder if they could have gotten the same speedups by making a slightly smarter algorithm to detect whether two stack-based bytecode instructions interfered or not.

  • Hey Kannan,

    Stack-based bytecodes make it harder to apply certain optimisations. Consider constant propagation, for example. This is one I had problems with. If I want to replace one bytecode with another (because e.g. I know its result is actually a constant), I have to be very careful to maintain stack height consistency. In the special case of bytecodes compiled from Whiley source, this isn’t too hard as I know certain invariants which are guaranteed. But, WYIL should exist in its own right and solving constant propagation for a stack-based architecture in the general case is challenging. It’s not impossible though … it’s just harder.

  • André Pankraz

    Hi,

    converting Stack-based to register-based is trivial, introduce stack-registers s0…s_maxStack and push sets stack register s(top) and you are done. Thats really a none-argument.

    Is it possible to write more compact byte-code with register-based bytecode? yes…compare with Dalvik bytecode.
    for small register-numbers (normal case for stack-operations) you find highly compact representations like “ADD8 reg0 reg1 regTarget”, Java is more like “LOAD 0, LOAD 1, ADD, STORE target” – 1 byte to 4 bytes.

    The JVM Bytecode is terrible, the class files are often larger than the Source code, but the reason is not the stack-based code attribute, but the very high redundancy because all information are stored in separate class files and not in a package format like dalvik DEX.

    I would say the single innovative stuff in Dalvik is the DEX format. All other things are copied and adapted from the JVM. Even the register based commands are nearly 1:1 from JVM, mechanically translated into register-based and optimizations for size.

    Best regards,
    André

  • Hi Andre,

    stack-registers s0…s_maxStack and push sets stack register s(top) and you are done. Thats really a none-argument.

    I think you miss the point. Sure, I can convert from stack-based to register-based. But, if I prefer the register-based form, then I should start in that form rather than forcing my system to convert to it whenever a bytecode file is loaded.

    Also, there are advantages of the register-based approach which you don’t one doesn’t see in this conversion. For example, bytecodes can read registers in a non-sequential order. With a stack-based format, you must always read stuff of the stack in the order you put it there.

  • André Pankraz

    You can prefer what you want…I’m just saying that “converting” stack-presentation to register-presentation is like zero cost for the bytecode parser, it’s just a different interpretation of the stack. And because of this, the JVM can carry out all optimizations like with register-based bytecode.

    It’s more complicated to convert register based to stack based because of non-sequential and multiple-read. Then you need a register-alive analysis and some local rules.

    Why shouldn’t I see the advantages for non-sequential? I mentioned operations like ADD8 etc., that are possible because of this.

    Again – the major problem is not alone the code attribute but the separated class files. The shared constant pool, the shared inner classes information etc. etc. – they make DEX more compact.

    StackMapTable is a new addition because the bytecode verifier can use pre-computed stack frames to validate the types much faster (instead of a type-access-emulation loop). This information is seperated in Dalvik too.
    Why should it be much faster to embed this? It’s for bytecode verification, CPU machine code doesn’t even have this info.

  • Why shouldn’t I see the advantages for non-sequential?

    Sorry, I wasn’t being specific to you. I should have said: “which one doesn’t see in the conversion”.

    Yeah, Dalvik is one of the motivations behind going to a register based bytecode. And, definitely, the DEX are really good and I have already been thinking of my own similar notion. Just need a good name for it …

    D

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>