Assumptions
Assumptions can get you into all kinds of trouble. Making assumptions about people based on superficial attributes is called prejudice. The term prejudice means pre-judging, making a judgement without the facts. Prejudicial attitudes have wreaked all kinds of havoc over the ages, ranging from the subtle to the violent. Perhaps if people -- all of us -- spent some time gathering
some facts, replacing assumptions with knowledge, before making decisions or taking actions then the world might be a bit better place.
Although not nearly as socially (and legally) unacceptable as prejudices aimed at human beings, assumptions in the world of verification can cause significant problems. I am confident that everyday in every company engaged in electronic design and verification, verification engineers spend time dealing with the consequences of or trying to unwind assumptions -- whether they know it explicitly as such or not.
An assumption is a dependency. More specifically, it's an unwanted or unintended dependency. It's code, whether a single line, a small fragment, or an entire subsystem, that depends on attributes or behaviors of other pieces of code typically outside the domain of the assumption.
The simplest kind of assumption is a hardcoded value. While seemingly innocuous, hardcoded values are a rich source for bugs and time sunk in debugging. We've all seen code like this:
reg_write(32'hffff7200, 16'h602c);
Write a value to an address. It seems harmless enough. It probably worked just fine when it was written. However, code like this embodies assumptions and can be the cause of bugs, subtle and not so subtle, and which consume many hours or weeks of debugging.
Even though it worked initially, code that contains hard-coded constants may cease to be correct as the environment around it changes. For example, the address map may change. The register that previously was located at 32'hffff7200 may have moved elsewhere. Or maybe the sixteen bit register has been changed to two eight bit registers and a sixteen bit write will discard the high order eight bits. Or, maybe it turns out that address 32'hffff7200 is indeed a valid address in the new address map but writing to it does not cause the same activity as in the last design. So the test that performs this register write may silently do the wrong thing. Or the value 16'h602c may turn on and off certain bits that determine the behavior of the DUT. If the behaviors
or modes associated with those bits change then again the testbench can also silently do the wrong thing. The consequence can result in subtle things such as functional coverage being reduced or it could cause obvious incorrect results. In any case an engineer could spend serious debugging time identifying this particular statement as the source of the problems.
Timing assumptions are even more dangerous and harder to locate. Consider as an example a fragment of a sequence that puts a delay between transactions.
for(i = 0; i < MAX; i++) begin
generate_transaction();
#17;
end
What does the #17 delay mean? Is it related to the clock period? Does it have to do with the time required for the DUT to generate a response? There's no way to know exactly what #17 means from the code. Like with the register write, this delay probably worked fine when it was written. As things evolve it may cease to be correct. How will we know? Most likely we won't. The testbench will once again silently do the wrong thing. Some kind of scoreboard failure may occur of the design simply may not work, each of which can trigger many hours or days or weeks of debugging time.
An even more subtle and insidious assumption is the race condition. A race condition occurs when two threads operate at the same time and the order in which the threads execute affects the result. Consider the following simple example:
a = -1;
fork
begin
a += 3;
end
begin
if(a < 0) begin
// will this branch be taken?
end
end
join
When the fork is entered "a" has the value -1. There are two processes in the fork, one increments "a" by 3, and the other tests to see if "a" has a value less than zero. The order in which these two processes are initiated will change the result. If the incrementing process starts first then "a" will no longer be less than zero and when the testing process starts then the (a<0) branch will not be taken. If the processes start in the other order then the (a<0) branch will be taken.
SystemVerilog does not specify which process should start first. The order is arbitrary based on low-level implementation details that are not visible to the user writing SystemVerilog code and running simulations. The processes could start in any order. Let's say by a happy coincidence that the first process starts first and this happens to be exactly the order that you are expecting and so the program works. Yea! Check it in, let's move on to the next thing.
Of course, our example is quite trivial. The race condition is quite obvious to anyone who has spent any time writing SystemVerilog. In the real world the code will likely be more complex and the race not so obvious.
Some months later a new version of the simulator is released by the vendor and installed at your site. In this new version the processes happen to run in the other order for some unknown (and unknowable) reason. Suddenly there is a failure. Code that used to work no longer works. This is difficult to debug. Process ordering is not specified in the code nor explicitly visible in any debugging tool. Of course, here we can see the problem because we are discussing race conditions. However, when the failure occurs all you know is the symptoms of the failure. Now find the root cause. Good luck.
The fundamental problem is an assumption on the order that processes execute.
Assumptions are not always bad
Sometimes assumptions are warranted. It's not possible to remove all assumptions, the testbench must know something about the design. It may be that a reset must occur before applying stimulus so that the device will start from a know state. Perhaps it's OK to make assumptions about the location of registers or the number of peripherals. Perhaps it is always the case that the address phase precedes the data phase. Maybe in the the device architecture it is always illegal to write to address 32'h00000000.
If you must make an assumption it is important to know that and do it explicitly and not inadvertently. Further, you must identify and
isolate your assumptions. (More about identifying and isolating assumptions in another blog post.). By making your assumptions explicit
you are eliminating them as a source of bugs, making them available for review, and making it possible to change the assumption at some later time if necessary. The important thing is to make explicit any assumptions that are required. An explicit assumption is a constraint. Constraints can be identified, documented, and, most importantly, controlled.
Anything that is fixed represents an assumption. Hard-coded addresses, values, and delays are sources of assumptions, as are file locations,
device knowledge, and implicitly resolved race conditions.
Take care to identify and remove assumptions to keep your weekends free for more fun things than rooting out obscure bugs.