The Lost Art of Cycle Counting

lost art

Let’s assume we want a 1/2 second delay inserted into our program (at least as close as we can get given crystal accuracy). We could simply rely on the built-in delay functions, or we could roll our own. So let’s write our own delay function and examine the assembly code with the goal of determining how long it takes to execute.

Starting with a main loop of 5 machine cycles, we can easily determine it would need to execute in a loop of 1,600,000 times to give us a 1/2 second period. Yikes! Since 1,600,000 is far too large a value to fit into one 8-bit register for loop counting, we’ll nest several loops:

32 x 200 x 250 = 1,600,000.

So here’s our very basic delay code:

asm volatile (
  "    ldi r20, 32  \n\t"  //1 machine cycle
                           //outer loop
  "1:  ldi r21, 200 \n\t"  //1
                           //middle loop
  "2:  ldi r22, 250 \n\t"  //1
                           //inner loop
  "3:  nop          \n\t"  //1
  "    nop          \n\t"  //1
  "    dec r22      \n\t"  //1
  "    brne 3b      \n\t"  //2 on branch
                           //end inner loop
  "    dec r21      \n\t"  //1
  "    brne 2b      \n\t"  //2 on branch
                           //end middle loop
  "    dec r20      \n\t"  //1
  "    brne 1b      \n\t"  //2 on branch
                           //end outer loop
  ::: "r20", "r21", "r22"
);

The comments signify the number of machine cycles for each instruction. This data can be found in the Atmel 8-bit AVR Instruction Set document. Our cursory look at the above code shows it should take about 800,000 cycles to execute:

5 x 250 x 200 x 32 = 8,000,000 cycles

On a 16MHz clock, 8,000,000 cycles is precisely ½ second. However, upon closer examination we discover this is not the case. This delay function takes approximately 0.501 seconds to execute, which is 0.001 seconds longer than expected. Realizing that 0.001 seconds is about 16,000 instruction cycles, where did they come from?

We neglected to account for the cycles in the middle and outer loops!
5 x 250 x 200 x 32 = 8,000,000 (inner loop)
4 x 200 x 32 = 25,600 (middle loop)
4 x 32 = 128 (outer loop)
8,000,000 + 25,600 + 128 = 8,025,728 cycles

That accounts for some missing cycles, but it’s still not an exact accounting. It neglects the fact that the BRNE instruction uses a variable amount of cycles depending upon the result of the comparison (2 cycles per branch vs. 1 cycle to fall through).

So here is the nitty-gritty evaluation of our delay:
1. Load the registers (3 cycles)
2. Inner loop: 5 x 250 = 1,250 – 1 (branch fall-through) = 1,249 x 200 x 32 = 7,993,600
3. Middle loop: 4 x 200 = 800 – 1 = 799 x 32 = 25,568
4. Outer loop: 4 x 32 = 128 – 1 = 127
5. 7,993,600 + 25,568 + 127 + 3 = 8,019,298 total cycles

For a 16MHz AVR (like an Arduino), a machine cycle is 1/16,000,000 of a second in duration, or 62.5ns, which results in our delay lasting 0.5012 seconds.

So, how does an actual Arduino compare with all this cycle counting? The follow screen capture from a Saleae Logic Analyzer shows our delay lasts 0.5042 seconds. I guess my clock crystal is a little slow.

Saleae Capture

We will leave it up to you as an exercise to construct a more accurate ½ second delay function.

About Jim Eli

µC experimenter
This entry was posted in Uncategorized and tagged , , , , , , , . Bookmark the permalink.

1 Response to The Lost Art of Cycle Counting

  1. Christian says:

    Thanks for the article, nice experiment. Just some comment to improve it.

    I think it is much easier, because the 1 fall-through cycle and the 1 load per total loop cancel each other. Or in other words, in the life of the loop there is only one loading of the register and one fall-through.

    Each total loop takes n*(3+x) cycles, where n is the repetitions and x the number of cycles of the operations inside the loop.

    c_outer = 32*(3 + c_middle)
    c_middle = 200*(3 + c_inner)
    c_inner = 250*(3 + 2)

    total time = 32*(200*(250*(2 + 3) + 3) + 3) = 8019296

    This is pretty much the same as your result but off by 2. The 2 come because you count the loading of the registers in the inner and middle loop twice. They are already in the 3 cycles under (1.).

    By, the way, please don’t use = if you don’t mean to write an equation, use => or something like that.

Leave a comment