An Overview of Whiley's Testing Framework

Sunday, June 13th, 2010

Testing that the Whiley-to-Java Compiler (wyjc) functions correctly is something of a challenge. We must, as usual, check that the correct (JVM) code is generated, and that the type checker is accepting/rejecting the right programs. However, we must also test that the theorem prover is accepting / rejecting the right programs as well. This is complicated because the logic of the theorem prover is undecidable — meaning there are valid Whiley programs that cannot be proved as definitely correct. To deal with this, the compiler uses a three-valued approach: a whiley program can be shown as definitely incorrect, definitely correct or that it’s correctness is unknown. In the latter case, runtime checks are inserted at one or more places and, hence,the generated code may or may not fail at runtime.

Overview

To test the Whiley-to-Java Compiler, I use small input programs which are known to be either valid or invalid. Currently, I have 336 such programs, which are evenly divided across the two cases. Each is typically only a few lines long, although some are more complex. Every valid test produces output, and the expected output is provided for comparison purposes. Thus, the idea is fairly simple: firstly, check the valid programs compile and produce the right output: secondly, check the invalid programs are rejected. However, in practice, it’s a bit more complicated than this, for several reasons:

There are invalid programs which will compile (because of limitations in the theorem prover). We cannot check such programs are rejected at compile time and, hence, we must separate them from those that we expect should be rejected. This division represents the (current) undecidable boundary. As the theorem prover is improved, we would expect this boundary to shift somewhat (although there should always be some programs in this camp).
The default assumption is that it must be possible to compile every Whiley program with the theorem prover turned off. This is because, in theory at least, we cannot tell when the theorem prover will fail to make a decision and, hence, when the compiler will be required to insert a runtime check. So, it must be possible to turn every constraint in a Whiley program into a runtime check. In turn, this means we must test that every generated runtime check is correct. Therefore, we need to test both valid and invalid Whiley programs with the theorem prover turned off, to ensure the checks produced are correct.
There are a number of decidable aspects of checking whether a Whiley program is valid or not. For example, the Whiley compiler employs a traditional type checker — so we can always tell when a Whiley program uses types in an invalid manner. Similarly, the Whiley compiler employs a simple dataflow analysis for checking definite assignment. Again, this is guaranteed to identify Whiley programs which may use variables that have not been defined (similar to the approach used in Java).

So, as a result of all this, I categorise each valid/invalid test program into one of five categories:

Simple Rejection. This program should be rejected by the compiler without using the theorem prover. For example, it exhibits a simple type error, or accesses a variable which is not declared, or was not defined first.
Complex Rejection. This program should be rejected by the compiler, but requires use of the theorem prover. In this case, it is required that the theorem prover can prove the program incorrect.
Runtime Rejection. This program is invalid, but will not be rejected by the compiler because the theorem prover cannot make a decision. Instead, it is expected to produce an exception at runtime.
Definite Acceptance. This program should be accepted by the compiler without producing any runtime checks. In this case, it is required that the theorem prover can prove the program is correct.
Runtime Acceptance. This program is accepted by the compiler, but only with the generation of one or more runtime checks. This occurs because one or more constraints are too complex for the theorem prover to make a definite decision. Thus, one or more runtime checks will be included in the generated code. In this case, it is expected that these checks will pass at runtime (since the program is known to be correct).

Furthermore, for each program in categories 2 + 4, we can generate two actual tests. The first being the expected acceptance/rejection by the compiler using the theorem prover. The second has the theorem prover turned off, and aims to test the runtime check generation is working correctly. In total then, I currently have 554 individual tests spread across the five categories given above.

Discussion

One the main challenges which becomes evident above, is that it is difficult to know when the theorem prover will fail to produce an answer. In fact, the issue is further complicated by the use of a time out. The time out is necessary to prevent the theorem prover from looping forever looking for an answer (which it may or may not be able to find). Thus, splitting my input programs into the above five categories is not entirely trivial. If one is placed in either of the Runtime categories, then it may be that using a slightly faster machine, or a slightly longer time out, would mean the theorem prover does make a decision and, hence, it was in the wrong category. For the moment, I have split them based on how the theorem prover behaves on my various test machines. But, I suspect a better system will be needed eventually!

A second problem caused by the lack of any guarantee as to when the theorem prover will produce an answer is that of compliance. That is, how can we specify what a “compliant” Whiley compiler should, and should not be able to compile? Clearly, the boundary between what the theorem prover definitely will answer, versus what it will not, is going to change. Indeed, it’s changing at the moment on an almost daily basis as I extend and improve it. Given my own personal experiences with compilers (particularly C++) which are not compliant, this is a concern for me. There’s nothing more fustrating than a situation where a program compiles without problem under one compiler, but doesn’t under another. For the moment, I’m simply going to ignore this issue. But, at some point, it will need to be addressed. I expect we’ll end up with a solution along the following lines: a compliant compiler must use a compliant theorem prover; a compliant theorem prover is guaranteed to produce an answer for a well-defined, and fairly large, class of problems. This will represent a compromise, but it’s probably the best we can do. It will mean that, in some very rare cases, one compiler will reject a program, where another accepts it and produces a runtime check.

Finally, this represents the most test cases I have ever had on a project and, almost certainly, the number will steadily increase. This starts to become a problem, as it takes a long time to get through them all, which limits the compile-test cycle. Of course, one approach is to have two (or more) test suites. One aimed for the developer, which is smaller but takes less time; and, the complete suite, which is run nightly. This makes me wonder how best to select the developer’s suite from the complete suite. Some kind of tool which employs statistical methods to ensure a good sample would be useful. Similarly, one which which can (somehow) determine when two tests are really testing the same thing would be very handy in reducing the total number of tests under consideration. In the end, these issues are really for the future … but already i’m starting to think about how best to manage the test suite. I guess we’ll see how it plays out!