## The Lost Art of Cycle Counting Let’s assume we want a 1/2 second delay inserted into our program (at least as close as we can get given crystal accuracy). We could simply rely on the built-in delay functions, or we could roll our own. So let’s write our own delay function and examine the assembly code with the goal of determining how long it takes to execute.

Starting with a main loop of 5 machine cycles, we can easily determine it would need to execute in a loop of 1,600,000 times to give us a 1/2 second period. Yikes! Since 1,600,000 is far too large a value to fit into one 8-bit register for loop counting, we’ll nest several loops:

32 x 200 x 250 = 1,600,000.

So here’s our very basic delay code:

```asm volatile (
"    ldi r20, 32  \n\t"  //1 machine cycle
//outer loop
"1:  ldi r21, 200 \n\t"  //1
//middle loop
"2:  ldi r22, 250 \n\t"  //1
//inner loop
"3:  nop          \n\t"  //1
"    nop          \n\t"  //1
"    dec r22      \n\t"  //1
"    brne 3b      \n\t"  //2 on branch
//end inner loop
"    dec r21      \n\t"  //1
"    brne 2b      \n\t"  //2 on branch
//end middle loop
"    dec r20      \n\t"  //1
"    brne 1b      \n\t"  //2 on branch
//end outer loop
::: "r20", "r21", "r22"
);
```

The comments signify the number of machine cycles for each instruction. This data can be found in the Atmel 8-bit AVR Instruction Set document. Our cursory look at the above code shows it should take about 800,000 cycles to execute:

5 x 250 x 200 x 32 = 8,000,000 cycles

On a 16MHz clock, 8,000,000 cycles is precisely ½ second. However, upon closer examination we discover this is not the case. This delay function takes approximately 0.501 seconds to execute, which is 0.001 seconds longer than expected. Realizing that 0.001 seconds is about 16,000 instruction cycles, where did they come from?

We neglected to account for the cycles in the middle and outer loops!
5 x 250 x 200 x 32 = 8,000,000 (inner loop)
4 x 200 x 32 = 25,600 (middle loop)
4 x 32 = 128 (outer loop)
8,000,000 + 25,600 + 128 = 8,025,728 cycles

That accounts for some missing cycles, but it’s still not an exact accounting. It neglects the fact that the BRNE instruction uses a variable amount of cycles depending upon the result of the comparison (2 cycles per branch vs. 1 cycle to fall through).

So here is the nitty-gritty evaluation of our delay:
1. Load the registers (3 cycles)
2. Inner loop: 5 x 250 = 1,250 – 1 (branch fall-through) = 1,249 x 200 x 32 = 7,993,600
3. Middle loop: 4 x 200 = 800 – 1 = 799 x 32 = 25,568
4. Outer loop: 4 x 32 = 128 – 1 = 127
5. 7,993,600 + 25,568 + 127 + 3 = 8,019,298 total cycles

For a 16MHz AVR (like an Arduino), a machine cycle is 1/16,000,000 of a second in duration, or 62.5ns, which results in our delay lasting 0.5012 seconds.

So, how does an actual Arduino compare with all this cycle counting? The follow screen capture from a Saleae Logic Analyzer shows our delay lasts 0.5042 seconds. I guess my clock crystal is a little slow. We will leave it up to you as an exercise to construct a more accurate ½ second delay function. 