Arduino Inline Assembly Tutorial (Tables)

table

Often, the fastest way to compute something on an arduino is to not compute it all.

Huh?

For example, trigonometric functions are costly operations and can abruptly slow your application to the pace of a crawl. And many times, the result is computed with far more precision than needed for the situation. Most often you just want the periodic wave-like characteristics of sine or cosine, which can easily be approximated. With a trigonometric function, its easy to substitute a lookup-table populated with pre-computed values at discrete steps. If your program can handle the loss of precision, yet requires as much speed as possible, this alternative is a good option.

The Ivy League Microcontroller

Since the arduino’s ATMEL AVR μC is based upon the modified Harvard architecture, the data and program instructions are stored in different memory. The program instructions are stored in flash, while data is stored in SRAM. These separate pathways are primarily implemented to enhance performance, but it also prohibits executing program instructions from data memory. Yet it may seem paradoxical, data is allowed to be stored inside program memory (see this information on the use of the PROGMEM attribute).
Placing a table in SRAM is simple, and shouldn’t present problems for an inline programmer (especially at our stage!). Consequentially, in this tutorial, we will store a table inside program memory.

Did He Say Frogmen?

Placing the table into program memory is easy. It is accomplished via a C language floating-point array, incorporating the special keyword, “PROGMEM”. PROGMEM instructs the compiler to place this data into flash memory:

static const float PROGMEM SineTable[91] = {
  0.0, 0.017452, 0.034899, 0.052336, 0.069756,
. . .
  0.997564, 0.998630, 0.999391, 0.999848, 1.0
};

Previously, when accessing SRAM (data memory) we used the LDS instruction. However, accessing program memory requires the use of the LPM instruction. LPM is the mnemonic for Load from Program Memory, and it loads a data byte from flash program memory into a register.

Details

The Flash program memory is organized as 16 bits words, while the registers and SRAM are organized as eight bits bytes. The Z-register is used to access the program memory. This 16 bits register pair is used as a 16 bit pointer to the Program memory. The 15 most significant bits selects the word address in Program memory. Because of this, the word address is multiplied by two before it is put in the Z-register. However, the good news is that in the code presented below all of these details are transparent.

Table Legs

The function below first limits the input value to a range between 0-90. If the input is out-of-range, it returns the floating-point Not-A-Number (NAN) value. It then multiplies the input by 4 to produce an index into our table. We multiply by four because our table is populated with floating point numbers, each of which is 4-bytes long. The index is simply added to the (PROGMEM) address of the start of the table. The functions finishes by retrieving the 4-byte float value and returning.

Note, floating point support inside the inline assembler is scarce. In this function we treat the float variable transparently, like any 32-bit variable. We get away with this because we’re not performing any operation on the value.

float _Sine(uint16_t angle) {
  float tmp;

  asm (
    //validate angle >= 0 && angle <= 90
    "cpi  %A1, 90+1 \n" 
    "cpc  %B1, __zero_reg__ \n"
    "brcc _NaN      \n" //out of range

     //calculate table index
    "lsl  %A1       \n" //float is 4 bytes wide
    "rol  %B1       \n" //index = angle * 4
    "lsl  %A1       \n"
    "rol  %B1       \n"

    //add index to start of SineTable
    "add  r30, %A1  \n" 
    "adc  r31, %B1  \n"

    //get sine value (4-bytes)
    "lpm  %A0, Z+   \n" 
    "lpm  %B0, Z+   \n"
    "lpm  %C0, Z+   \n"
    "lpm  %D0, Z    \n"
    "ret            \n" //exit
    
    //return NAN
    "_NaN:              \n" 
    "ldi  %A0, lo8(%3)  \n" //NAN = 0x7fc00000
    "ldi  %B0, hi8(%3)  \n"
    "ldi  %C0, hlo8(%3) \n"
    "ldi  %D0, hhi8(%3) \n"
    : "=a" (tmp) : "r" (angle), "z" (SineTable), "F" (NAN)
  );
  return tmp;
}

The Full Table

#include <avr/pgmspace.h>

//max errror ~0.017452 [91*4=364 bytes]
static const float PROGMEM SineTable[91] = {
  0.0, 0.017452, 0.034899, 0.052336, 0.069756, 0.087156, 
  0.104528, 0.121869, 0.139173, 0.156434, 0.173648, 0.190809, 
  0.207912, 0.224951, 0.241922, 0.258819, 0.275637, 0.292372, 
  0.309017, 0.325568, 0.34202, 0.358368, 0.374607, 0.390731, 
  0.406737, 0.422618, 0.438371, 0.45399, 0.469472, 0.48481, 
  0.5, 0.515038, 0.529919, 0.544639, 0.559193, 0.573576, 
  0.587785, 0.601815, 0.615661, 0.62932, 0.642788, 0.656059, 
  0.669131, 0.681998, 0.694658, 0.707107, 0.71934, 0.731354, 
  0.743145, 0.75471, 0.766044, 0.777146, 0.788011, 0.798636, 
  0.809017, 0.819152, 0.829038, 0.838671, 0.848048, 0.857167, 
  0.866025, 0.87462, 0.882948, 0.891007, 0.898794, 0.906308, 
  0.913545, 0.920505, 0.927184, 0.93358, 0.939693, 0.945519, 
  0.951057, 0.956305, 0.961262, 0.965926, 0.970296, 0.97437, 
  0.978148, 0.981627, 0.984808, 0.987688, 0.990268, 0.992546, 
  0.994522, 0.996195, 0.997564, 0.99863, 0.999391, 0.999848, 1.0
};

What Are You Doing For Me?

After a cursory comparison test between the table _Sine() function and the arduino floating point sin() function, we can draw some basic conclusions. Even though the table itself consumes 364 (91 x 4 = 364) bytes of flash (on top of the function code), the arduino library sin() function (and it’s required peripheral floating point support) uses approximately 900 bytes more flash memory.

However, saving space wasn’t necessarily the goal of this exercise, speed was the primary concern. Comparing 1,000 calls to both functions yielded an average duration of 121.7uS per sin() vs. 2.92uS for _Sine(). One final but obvious concern is the precision of the result. This will need to be evaluated to determine if it is sufficient for your application.

Bigger, Better, More

Various modifications can expand and improve the accuracy of the table code, but are beyond the scope of this tutorial. However, here are some basic ideas.

The obvious methods is to expand the table to decrease the step interval. Another technique is to incorporate interpolation similar to the following pseudo code:

float _iSine(uint16_t angle) {
     uint16_t x1 = floor( angle );
     float y1 = SineTable[x1];
     float y2 = SineTable[x1 + 1];
     return y1 + ( y2 - y1 ) * ( x - x1 )
}

For full 0-360 angle coverage, do something like:

float Sine(uint16_t i) {
  while (i > 359)
    i -= 360;

  if (i < 90)
    return iSine(i);
  else if (i < 180)
    return iSine(179 - i);
  else if (i < 270)
    return (-1*iSine(i - 180));
  else if (i < 360)
    return (-1*iSine(359 - i));
}

Another easy expansion with the sine table is to calculate cosine and tangent values:

float _Cosine(uint16_t a) {
  return _Sine( a + 90 );
}
 
float _Tangent(uint16_t a) {
  return ( _Sine(a) / _Cosine(a) );
}

References

AVR 8-bit Instruction Set
AVR-GCC Inline Assembler Cookbook
Extended Asm – Assembler Instructions with C Expression Operands
Further information on addressing modes can be found in Section 2 of the AVR Instruction Set Manual
AVR108: Setup and Use of the LPM Instruction
Sine Lookup Table Generator

Also available as a book, with greatly expanded coverage!

[click on the image]

Code (error) updated: 1/25/2017

6 Responses to Arduino Inline Assembly Tutorial (Tables)

hoda says:

January 24, 2017 at 1:39 pm

hello again:)
I am running this code in arduino and when I try to Serial.printIn() the value of sin after it’s calculated I get the error message : register number above 15 required.
which is very odd for me since I copy pasted your code and no register below 15 is being used. Do you perhaps know the problem?

- Jim Eli says:
  
  January 24, 2017 at 5:41 pm
  Something odd is occurring here with the Serial code. You are correct, my inline assembly code is valid and compiles just fine by itself. But when Serial print is added it chokes. I will need to investigate and get back to you.
  
  While you wait, here is some working test code I used:
```
volatile float x;
volatile uint32_t t0, t1, t2;

void setup( void ) {
  uint16_t i,j;
  float k;

  Serial.begin(9600);
  delay(1000);
  
  t0 = millis();

  for (j=0; j<1000; j++)
    for (i=0; i<91; i++)
      x = _Sine(i);

  t1 = millis();      

  for (j=0; j<1000; j++) 
    for (k=0; k<91; k++) 
      x = sin(k);

  t2 = millis();
  Serial.print("_Sine: "); Serial.println(t1 - t0);
  Serial.print("sin: "); Serial.println(t2 - t1);
}

void loop( void ) { }
```
hoda says:

January 25, 2017 at 1:12 am

Thank you! I am looking forward!

- Jim Eli says:
  
  January 25, 2017 at 10:33 am
  hoda,
  
  I’ve updated the code to reflect a small change in the registers used by the _Sine function. It should work properly now.
  See this line:
```
    : "=a" (tmp) : "r" (angle), "z" (SineTable), "F" (NAN)
```
John Bandy says:

April 23, 2019 at 12:42 pm

In addition to changing the “=r” in the _Sine function to “=a”, I had to change the ret before the NaN processing code to an rjmp to a label just after the NaN code, in order for the function to return the right value in tmp.

- Jim Eli says:
  
  April 24, 2019 at 9:41 am
  
  @John Bandy the “rjmp” to a label would be the proper method to code this, especially since we don’t know what the optimizer might do. However, the example ran fine when I compiled it because the float value is placed into the proper registers for return even if the “return tmp” statement is skipped.

	Viggo on Examination of the Arduino mil…
	Jim Eli on Examination of the Arduino mic…
	bz on Examination of the Arduino mic…
	ivachovsky on Blynking an IoT Yunshan ESP826…
	borre on Blynking an IoT Yunshan ESP826…