My Cup Overflows

overflow

When performing math (even basic addition and subtraction) with signed numbers an overflow problem sometimes arises. The Arduino microcontroller indicates the existence of an overflow error by setting the overflow flag in the SREG. Here’s a demonstration of the overflow problem with a simple addition operation:

volatile int8_t n1=0x70; //112
volatile int8_t n2=0x35; //53
volatile int8_t answer;

void setup() {
  Serial.begin(9600);
  
  asm(
    "add %1, %2 \n"
  
    : "=r" (answer) : "r" (n1), "r" (n2)
  );
  
  Serial.print("answer = "); Serial.println(answer);
}

The result to the above addition is, answer = -91, or 0xA5 hexadecimal. That’s wrong! The reason the answer turns out wrong is because the result is larger than the 8-bit register can hold.

The largest “signed 8-bit number” is +127, or 0x7f hexadecimal. However, the Status Register Overflow Flag (V flag) was set during this addition to warn us that the result is erroneous. But, it’s completely up to us, the programmer to deal with this issue.

What’s Your Sign?

In “8-bit signed number” operations, the overflow flag is set when either of the following two conditions occur:

• There is a carry from bit 6 to bit 7, but no carry out of bit 7 (C flag not set).
• There is a carry out of bit 7 (C flag set), but no carry from bit 6 to bit 7.

I bring these two cases to your attention, because we can perform addition on two negative numbers with the sign bit remaining correct, while the addition could still overflow. For example, when adding -2 (0x80) and -128 (0xFE), the result becomes 0x7E (+126), which again is incorrect.

When adding two numbers with different signs, the absolute value of the result is a smaller number than the absolute value of the operands prior to the addition. In this case, an overflow is impossible.

Therefore overflow is only possible when adding two numbers with the same sign. Furthermore, when adding two “same-signed numbers”, the sign of the result must be the same. The conclusion here is, for signed number addition, if the overflow flag is set, the result is invalid, and in unsigned addition, if the carry flag is set, the result is invalid. In signed number operations, overflow is possible, and overflow corrupts the result and negates the sign bit.

See my tutorial on Arduino Inline Assembly Math here.

Posted in Uncategorized | Tagged , , , , , | Leave a comment

Arduino Inline Assembly Port & Pin Compendium

compendium

The following is a compendium of inline assembly functions dealing with ports and pins. Use these at your own risk. These functions have been trimmed of most bounds checking, so they can easily be abused. The Arduino Inline Assembly Tutorial explains most of the details starting here.

analogWrite

This inline code writes an analog value (in the form of a PWM wave) to a particular pin. After executing, the pin will generate a steady square wave of the specified duty cycle until the next call (or call to digitalRead() or digitalWrite() on the same pin). The frequency of the PWM signal on most pins is approximately 490 Hz. On the Uno and similar boards, pins 5 and 6 have a frequency of approximately 980 Hz. On Arduino boards with the ATmega168/328, this function works on pins 3, 5, 6, 9, 10, and 11. The analogWrite function has nothing to do with the analog pins or the analogRead function.

A pinMode() call is included inside this function, so there is no need to set the pin as an output before executing this code.

This version of AnalogWrite, with no frills saves ~542 bytes over the built-in function:

//analogWrite requires a PWM pin 
//PWM pin/timer table:
//3:  (TIMER2B) PD3/TCCR2A/COM2B1/OCR2B
//5:  (TIMER0B) PD5/TCCR0A/COM0B1/OCR0B
//6:  (TIMER0A) PD6/TCCR0A/COM0A1/OCR0A
//9:  (TIMER1A) PB1/TCCR1A/COM1A1/OCR1A
//10: (TIMER1B) PB2/TCCR1A/COM1B1/OCR1B
//11: (TIMER2A) PB3/TCCR2A/COM2A1/OCR2A
//set below 6 defines per above table
#define ANALOG_PORT         PORTB
#define ANALOG_PIN          PORTB3
#define ANALOG_DDR          DDRB
#define TIMER_REG           TCCR2A
#define COMPARE_OUTPUT_MODE COM2A1
#define COMPARE_OUTPUT_REG  OCR2A

volatile uint8_t val = 128; //0-255

  asm (
    "sbi  %0, %1   \n" //DDR set to output (pinMode)

    "cpi  %6, 0    \n" //if full low (0)
    "breq _SetLow  \n"
    "cpi  %6, 0xff \n" //if full high (0xff)
    "brne _SetPWM  \n"

    "sbi  %2, %1   \n" //set high
    "rjmp _SkipPWM \n"

  "_SetLow:        \n"
    "cbi  %2, %1   \n" //set low
    "rjmp _SkipPWM \n"

  "_SetPWM:        \n"
    "ld   r24, X   \n"
    "ori  r24, %3  \n"
    "st   X, r24   \n" //connect pwm pin timer# & channel
    "st   Z, %6    \n" //set pwm duty cycle (val)

  "_SkipPWM:       \n"
    : : "I" (_SFR_IO_ADDR(ANALOG_DDR)), "I" (ANALOG_PIN),
    "I" (_SFR_IO_ADDR(ANALOG_PORT)), "M" (_BV(COMPARE_OUTPUT_MODE)),
    "x" (_SFR_MEM_ADDR(TIMER_REG)), "z" (_SFR_MEM_ADDR(COMPARE_OUTPUT_REG)), "r" (val)
    : "r24"
  );

analogRead

The Arduino board contains a 6 channel, 10-bit analog to digital converter which is the brains beneath the analogRead function. It maps input voltages between 0 and 5 into integer values between 0 and 1023, thus yielding a resolution between readings of: 5/1024 units or, 0.0049 volts (4.9 mV) per unit. The input range and resolution can be changed through the ANALOG_V_REF define. This code reads the value from the specified analog channel (0-7), which correspond to the analog pins (note, do NOT use A0-A7 for the channel number in this code). Further information about the underlying ADC can be found here.

While this version of analogRead (aRead) saves a few bytes (~50), it also gives the option of changing the speed via the ADC prescaler. However, don’t arbitrarily change the prescale without understanding the consequences. ATMEL advises the slowest prescale should be used (PS128). A higher speed (smaller prescale) reduces the accuracy of the AD conversion. The arduino sets the prescale to 128 during initiation, just as the code below does.

//Define various ADC prescales
#define PS2   (1<<ADPS0)                             //8000kHz ADC clock freq
#define PS4   (1<<ADPS1)                             //4000kHz
#define PS8   ((1<<ADPS0) | (1<<ADPS1))              //2000kHz
#define PS16  (1<<ADPS2)                             //1000kHz
#define PS32  ((1<<ADPS2) | (1<<ADPS0))              //500kHz
#define PS64  ((1<<ADPS2) | (1<<ADPS1))              //250kHz
#define PS128 ((1<<ADPS2) | (1<<ADPS1) | (1<<ADPS0)) //125kHz
#define ANALOG_V_REF     DEFAULT //INTERNAL, EXTERNAL, or DEFAULT
#define ADC_PRESCALE     PS128   //PS16, PS32, PS64 or P128(default)

uint16_t aRead(uint8_t channel) {
  uint16_t result;
  
  asm (
    "andi %1, 0x07    \n" //force pin==0 thru 7
    "ori  %1, (%6<<6) \n" //(pin | ADC Vref)
    "sts  %2, %1      \n" //set ADMUX

    "lds  r18, %3             \n" //get ADCSRA
    "andi r18, 0xf8           \n" //clear prescale bits
    "ori  r18, ((1<<%5) | %7) \n" //(new prescale | ADSC)
    "sts  %3, r18             \n" //set ADCSRA

    "_loop:       \n" //loop until ADSC cleared
    "lds  r18, %3 \n"
    "sbrc r18, %5 \n"
    "rjmp _loop   \n"

    "lds  %A0, %4   \n" //result = ADCL 
    "lds  %B0, %4+1 \n" //ADCH

    : "=r" (result) : "r" (channel), "M" (_SFR_MEM_ADDR(ADMUX)),
    "M" (_SFR_MEM_ADDR(ADCSRA)), "M" (_SFR_MEM_ADDR(ADCL)),
    "I" (ADSC), "I" (ANALOG_V_REF), "M" (ADC_PRESCALE)
    : "r18"
  );
  
  return result;
}

pinMode(OUTPUT)

The arduino pinMode function configures pin behavior. The code presented from here on, has been previously explained inside the Arduino Inline Tutorial Series.

asm (
  "sbi %0, %1 \n" //1=OUTPUT
    : : "I" (_SFR_IO_ADDR(DDRB)), "I" (DDB5)
);

pinMode (INPUT PULLUP)

asm (
  "cbi %0, %2 \n"
  "sbi %1, %2 \n"
    : : "I" (_SFR_IO_ADDR(DDRB)), "I" (_SFR_IO_ADDR(PORTB)), "I" (DDB5)
);

pinMode (INPUT)

asm (
  "cbi %0, %2 \n"
  "cbi %1, %2 \n"
    : : "I" (_SFR_IO_ADDR(DDRB)), "I" (_SFR_IO_ADDR(PORTB)), "I" (DDB5)
);

pinMode with Multiple Pins

#define PIN_DIRECTION 0b00101000 //PIN 3 & 5 OUTPUT
//#define PIN_DIRECTION (1<<DDB3) | (1<<DDB5)
asm (
  "out %0, %1 \n"
  : : "I" (_SFR_IO_ADDR(DDRB)), "r" (PIN_DIRECTION)
);

digitalWrite HIGH

If a pin has been configured as an OUTPUT, its voltage will be set to the corresponding value: 5V (or 3.3V on 3.3V boards) for HIGH, 0V (ground) for LOW. However, if the pin is configured as an INPUT, digitalWrite enables (HIGH) or disables (LOW) the internal pullup on the input pin.

asm (
  "sbi %0, %1 \n"
  : : "I" (_SFR_IO_ADDR(PORTB)),"I" (PORTB5)
);

digitalWrite LOW

asm (
  "cbi %0, %1 \n"
  : : "I" (_SFR_IO_ADDR(PORTB)), "I" (PORTB5) 
);

digitalWrite(output)

volatile uint8_t output = HIGH; //LOW or HIGH
asm (
  "cpi %2, 0     \n"
  "breq 1f       \n"
  "sbi %0, %1    \n"
  "rjmp 2f       \n"
  "1: cbi %0, %1 \n"
  "2:            \n"
  : : "I" (_SFR_IO_ADDR(PORTB)), "I" (PORTB5), "r" (output)
);

digitalToggle

Try to find this one in the Arduino wiring code:

//toggle pin
asm (
  "in r24, %0  \n"
  "eor r24, %1 \n"
  "out %0, r24 \n"
  : : "I" (_SFR_IO_ADDR(PORTB)), "r" ((uint8_t)_BV(PORTB5)) : "r24"
);

digitalRead

digitalRead simply reads the value from a specified digital pin, either HIGH or LOW.

volatile uint8_t status;
 
asm (
  "in __tmp_reg__, __SREG__  \n"
  "cli                       \n"                     
  "ldi %0, 1                 \n" //high 
  "sbis %1, %2               \n" //skip next if pin high
  "clr %0                    \n" //low
  "out __SREG__, __tmp_reg__ \n"
  : "=r" (status) : "I" (_SFR_IO_ADDR(PINB)), "I" (PINB5)  
);

digitalRead Alternative

This is a generic alternative, which can be called programmatically. Note it must be called using a pointer to the PIN (&PINB), otherwise the compiler emits incorrect code:

//call like so:
//uint8_t status = dRead(&PINB, PINB5);

__attribute__ ((noinline)) uint8_t dRead(volatile uint8_t *port, uint8_t pin) {
  uint8_t result, mask=1;

  asm (
    "movw  r30, %1 \n" //port reg addr in Z
  "1:              \n"
    "cpi  %2, 0    \n" //loop until pin==0
    "breq 2f       \n" //leave loop
    "lsl  %3       \n" //shift (mask) left 1 position
    "dec  %2       \n" //decrement loop counter
    "rjmp 1b       \n" //repeat
  "2:              \n"
    "in   __tmp_reg__, __SREG__ \n" //preserve sreg
    "cli           \n" //disable interrupts
    "ld   r18, Z   \n" //fetch port data
    "and  r18, %3  \n" //compare pin with mask
    "ldi  %0, 1    \n" //set return high
    "brne 3f       \n" 
    "clr  %0       \n" //set return low
  "3:              \n"
    "out  __SREG__, __tmp_reg__ \n"
    : "=&r" (result) : "r" (port), "a" (pin), "r" (mask) : "r18", "r30", "r31"
  );

  return result;
}

Example of turning off PWM for arduino digital pin #11

//digital PWM pin registers:
//3:  (TIMER2B) PD3/TCCR2A/COM2B1/OCR2B
//5:  (TIMER0B) PD5/TCCR0A/COM0B1/OCR0B
//6:  (TIMER0A) PD6/TCCR0A/COM0A1/OCR0A
//9:  (TIMER1A) PB1/TCCR1A/COM1A1/OCR1A
//10: (TIMER1B) PB2/TCCR1A/COM1B1/OCR1B
//11: (TIMER2A) PB3/TCCR2A/COM2A1/OCR2A

asm (
  "ld  r16, Z \n"
  "ldi r17, 0xff \n"
  "eor r17, %1 \n"
  "and r16, r17 \n"
  "st  Z, r16 \n"
  : : "z" (_SFR_MEM_ADDR(TCCR2A)), "d" (COM2A1) : "r16", "r17"
);

Also available as a book, with greatly expanded coverage!

BookCover
[click on the image]

Posted in Uncategorized | Tagged , , , , , , , | 1 Comment

Arduino Inline Assembly Tutorial (Examples)

case study

As the final tutorial in this series, we present four example inline assembly functions for the arduino. Specifically, these cover the conversion of a byte to a hexadecimal string, SPI Mode 0 hardware transfer, SPI Mode 0 Bit-banging, and the C library atoi function. Do not take these functions as archetypical examples of high-quality coding practice or brilliantly efficient inline code. They are neither.

Most of the previous examples in this series were simple “snippets of code”, and as such gave a myopic view of inline assembly. The goal here is to show complete and working demonstrations of how to include inline assembly into the typical arduino program. Each example includes explanatory comments covering the key portions of code.

In addition to these examples, have a look at the Arduino Inline Assembly Blink Program.

Stringing Hexadecimals

The following code converts a byte value into a hexadecimal string. Notice at the start of the code, that the constraint #0 value (val) is temporarily saved in the r25 register. The function then converts the first nibble. When the conversion process is complete, the function loops back and converts the second nibble. Note how the code uses the SREG T-bit to flag the first vs. second nibble.

void ByteToHexStr(uint8_t val, char *str) {
  asm (
    "set           \n" //flag first nibble
    "mov r25, %0   \n" //save val
    "swap %0       \n" //swap for correct nibble order
  "1:              \n"
    "andi %0, 0xf  \n" //mask a nibble
    "cpi  %0, 0xa  \n" //>10?
    "brcc 2f       \n" //yes
    "subi %0, 0xd0 \n" //convert numeral (0-9) 
    "rjmp 3f       \n" //skip next
  "2:              \n"
    "subi %0, 0xc9 \n" //convert letter (A-F)
  "3:              \n"
    "st Z+, %0     \n" //put into string
    "brtc 4f       \n" //upper nibble?
    "clt           \n" //clear nibble flag
    "mov %0, r25   \n" //get upper nibble
    "rjmp 1b       \n" //repeat conversion
  "4:              \n" //exit
    : : "r" (val), "z" (str) : "memory"
  );
}

I SPI With My Little Eye…

Serial Peripheral Interface (SPI) is a synchronous serial data protocol used by microcontrollers for communicating with one or more peripheral devices, or for communication between two microcontrollers. The SPI standard is loose and each device implements it a little differently, which means you must pay close attention to the device’s datasheet when implementing the protocol. Generally speaking, there are four modes of transmission, defined by the clock phase and polarity.

Here are two versions of the SPI transfer function. The first of these programs incorporates the arduino hardware SPI. The second is a bit-bang version using different pins. More information on SPI can be found here and here.

SPI Mode 0 Hardware Transfer

static __attribute__ ((noinline)) uint8_t SpiXfer(uint8_t data) {
  asm (
    "out  %1, %0          \n" //put data out SPDR register
    "nop                  \n" //pause
  "1:                     \n"
    “in   __tmp_reg__, %2 \n" //check xmit complete
    "sbrs __tmp_reg__, %3 \n"
    "rjmp 1b              \n"
    "in   %0, %1          \n" //get incoming data
    : "+r" (data) : "M" (_SFR_IO_ADDR(SPDR)),
    "M" (_SFR_IO_ADDR(SPSR)), "I" (SPIF)
  );

  return data;
}

SPI Bit-Bang

#define MOSI_PORT  PORTD
#define MOSI_BIT   PORTD5
#define MISO_PORT  PIND
#define MISO_BIT   PIND6
#define CLOCK_PORT PORTD
#define CLOCK_BIT  PORTD7

static __attribute__ ((noinline)) uint8_t SpiBitBang(uint8_t data) {
  register uint8_t tmp, i=8;
  
  //save and restore sreg because t-bit is utilized
  asm (
    "in __tmp_reg__, __SREG__ \n"
  "1:               \n"
    "sbrs %0, 0x07  \n" //is output data bit high?
    "rjmp 2f        \n" //no
    "sbi  %3, %4    \n" //output a high bit
    "rjmp 3f        \n"
  "2:               \n"
    "cbi  %3, %4    \n" //output a low bit
  "3:               \n"
    "lsl  %0        \n" //shift to next bit
    "in   %1, %5    \n" //get input
    "tst  %1        \n" //anything here?
    "breq 4f        \n" //nope
    "bst  %1, %6    \n" //set t-bit if input bit is high
    "clr  %1        \n" //zeroize register
    "bld  %1, 0     \n" //set bit 0
    "or   %0, %1    \n" //or low bit with data for return value
  "4:               \n"
    "sbi  %7, %8    \n" //toggle clock bit high
    "nop            \n" //pause
    "cbi  %7, %8    \n" //toggle clock bit low
    "subi %2, 1     \n" //more bits?
    "brne 1b        \n" //do next bit
    "out __SREG__, __tmp_reg__ \n"
    : "+r" (data), "=&r" (tmp): "a" (i),
    "M" (_SFR_IO_ADDR(MOSI_PORT)), "I" (MOSI_BIT),
    "M" (_SFR_IO_ADDR(MISO_PORT)), "I" (MISO_BIT),
    "M" (_SFR_IO_ADDR(CLOCK_PORT)),  "I" (CLOCK_BIT)
  );

  return data;
}

A Toy

Atoi is a function in the that converts a string into an integer numerical representation (atoi stands for ASCII to integer). It is included in the C standard library header file stdlib.h. It is prototyped as follows:

int atoi(const char *str);

The str argument is a string, represented by an array of characters, containing the characters of a signed integer number. The string must be null-terminated.

Here is the basic idea of the atoi function implemented in C language:

int16_t atoi(char s[]) {
  uint8_t i, sign;
  int16_t n;
  
  //skip white space
  for (i=0; s[i]<=' '; i++);
  
  //sign
  sign = 0;
  if (s[i] == '-') {
    sign = 1;
    i++;
  }
  
  //convert
  for (n=0; s[i]>='0' && s[i]<='9'; i++)
    n = 10*n + s[i] - '0';
  
  if (sign)
    return (-1*n);
  else
    return n;
}

Atoi Inline

Here is our implementation, which is only 64 bytes in length. By comparison, the arduino AVR libc atoi() function is 76 bytes long. This version is basically functionally equivalent, however there are a few detail differences (this function steps over all leading ASCII characters 0x2F and below, not just whitespace):

int16_t _atoi(const char *s) {
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wuninitialized"
  //sign & c are initialized inside inline asm code
  register uint8_t sign, c;
#pragma GCC diagnostic pop
  //force result into return registers
  register int16_t result asm("r24"); 
  
  asm (
    "ldi  %A0, 0x00         \n" //result = 0
    "ldi  %B0, 0x00         \n"

  "1:                       \n"
    "ld   %2, Z+            \n" //fetch char
    "cpi  %2, '-'           \n" //negative sign?
    "brne 2f                \n"
    "ldi  %3, 0x01          \n" //sign = TRUE

  "2:                       \n"
    "cpi  %2, '/' + 1       \n" //step over whitespace/garbage
    "brcc 3f                \n"
    "rjmp 1b                \n"

  "3:                       \n"
    "rjmp 5f                \n"

  "4:                       \n"
    "ldi  r23, 10           \n" //result *= 10
    "mul  %B0, r23          \n"
    "mov  %B0, r0           \n"
    "mul  %A0, r23          \n"
    "mov  %A0, r0           \n"
    "add  %B0, r1           \n"
    "clr  __zero_reg__      \n" //r1 trashed by mul
    "add  %A0, %2           \n" //result += new digit
    "adc  %B0, __zero_reg__ \n"
    "ld   %2, Z+            \n" //fetch next digit char
  
  "5:                       \n"
    "subi %2, '0'           \n" //convert char to 0-9
    "cpi  %2, 10            \n" //end of string?
    "brlo 4b                \n"

    "cpi  %3, 0             \n" //negative?
    "breq 6f                \n"
    "com  %B0               \n" //negate result
    "neg  %A0               \n"
    "sbci %B0, -1           \n"
  
  "6:                       \n"
    : "+r" (result) : "z" (s), "a" (c), "a" (sign) : "memory"
  );

  return result;
}

Conclusion

While there are countless more topics to cover, and many more rabbit-holes to dive down, I believe I have covered enough of the basics in this series. I sure enjoyed researching and writing these tutorials. And, hopefully you gained a few insights into the funky world of arduino (AVR) inline assembly programming. Now, get inline with your programming!

[updated: 4.11.16]

Also available as a book, with greatly expanded coverage!

BookCover
[click on the image]

Posted in Uncategorized | Tagged , , , , , | Leave a comment

Arduino Inline Assembly Tutorial (Interrupts)

interruption

Pardon The Interruption

The previous tutorial covered the basics of writing inline functions. A close relative of the function is the Interrupt Service Routine (ISR), which is the topic here. Portions of this tutorial may pertain to functions as well.

As a warning, this tutorial assumes an understanding of the basic concepts of interrupts in general, and specifically interrupt handlers on the arduino (AVR μC). Hopefully, you have already written a few arduino interrupts in C, using the internal arduino functionality. If not, you may want to study some of the links given in the reference section of this tutorial before continuing.

The Deck is Stacked

Basic knowledge of the stack is essential to understanding functions and interrupt handlers. The basic purpose of the stack is to support function calls and interrupts. Whenever a program makes a function call or whenever an interrupt occurs, the stack is used to store critical information which will be restored upon completion of the function or interrupt. Additional information on the stack can be found here and here.

First and primary, during a function call or interrupt, the hardware places the return address on the stack. The saving and restoration of the return address is accomplished transparently by the CALL and RET instructions. It is not necessary to perform any special instruction(s) to make this occur.

Second, if any “call-saved” registers will be “clobbered” inside the function, these registers are “pushed” onto the stack. In the case of an interrupt service routine, all of the registers used inside the ISR (and always the temporary and zero registers, r0 and r1) get pushed onto the stack. Additionally, during an ISR the SREG is saved and restored.

Finally, if the compiler deems it necessary, space is reserved for any local variables on the stack. Many times the compiler will place local variables into specific registers, and therefore doesn’t use the stack for temporary storage.

Here is an example of how the compiler uses the stack to store local variables inside of a function. This is sometimes referred to as “setting up a stack frame.” We will reserve 16 bytes for a character array (note: unrelated code has been removed for the purpose of clarity). The compiler performs all of this stack manipulation for us behind the scenes, so-to-speak:

void example(void) {
  char buffer[16]; //space will be reserved on the stack
 
  //
  //do something here. . .
  //
 
}

Result in this machine code:

;prologue
  PUSH r28          ;save registers on stack 
  PUSH r29 
  IN   r28, SPL     ;get stack pointer    
  IN   r29, SPH   
  SBIW r28, 16      ;reserve 16 bytes space on stack
                    ;the stack grows downward, hence the subtraction
  OUT  SPH, r29     ;update new stack pointer
  OUT  SPL, r28 
 
;
;do something here. . .
;
 
;epilogue
  ADIW r28, 16      ;remove the 16 bytes from the stack
  OUT  SPH, r29     ;restore stack pointer
  OUT  SPL, r28 
  POP  r29          ;restore registers from stack
  POP  r28 
  RET 
}

Upon return from the interrupt or function, all the preserved values are restored, or “popped” from the stack. Obviously, during the pro and epilogue code, the order of the push and pop instructions is very critical.

Interrupt Before and After

Below, I wrote a very basic interrupt routine that simply increments a byte so we can examine the prologue and epilogue code generated by the compiler:

//here is an example ISR coded in C:
volatile uint8_t a;
 
ISR(INT0_vect) {
  a++;
}
 
//this is the generated assembly code:
;prologue
0000027F 1f.92                PUSH r1       ;save r1 register
00000280 0f.92                PUSH r0       ;save r0 register
00000281 0f.b6                IN r0, SREG   ;get status register
00000282 0f.92                PUSH r0       ;save sreg 
00000283 11.24                CLR r1        
00000284 8f.93                PUSH r24      ;save r24 register
;increment byte (a) here
00000285 80.91.c3.01          LDS r24, (a) 
00000287 8f.5f                SUBI r24, 0xFF     
00000288 80.93.c3.01          STS (a), r24 
;epilogue
0000028A 8f.91                POP r24       ;restore r24 register
0000028B 0f.90                POP r0        ;restore status register
0000028C 0f.be                OUT SREG, r0
0000028D 0f.90                POP r0        ;restore r0 register
0000028E 1f.90                POP r1        ;restore r1 register
0000028F 18.95                RETI          ;return from interrupt

As you can see, the meat of the ISR is only 10 bytes long. However, together the prologue and epilogue add another 24 bytes, for a total of 34. It might be possible to save a few bytes and program cycles by tightly writing your own ISR pro and epilogue. GCC has a provision which allows writing your own pro and epilogues, which will be covered later.

We Interrupt This Program to Blink

It is now time to write an interrupt handler, or ISR in inline assembler. I can’t think of a better example than to adapt the basic Blink sketch to use the Timer #1 Overflow interrupt. Please note, because this code alters the Timer #1 registers, it will render any use of the arduino Timer #1 as nonfunctional (i.e. analogWrite pins 9 & 10, the Servo Library, etc.).

Handle It

The first order of business is to write the interrupt handler for the Timer #1 Overflow. This is the routine that is called when the Timer #1 counter (TCNT1) rolls over from 0xffff to zero. Our the ISR is very basic, and as always, it should be kept as short as possible. Inside the handler we perform two functions:

  • Reset the counter (TCNT1) allowing the next overflow to reoccur at 1 second intervals.
  • Toggle the LED.

An ISR can be coded using inline assembler just as in a “C Stub Function”, relying upon the compiler to insert the necessary prologue and epilogue code. I suggest you use this stub technique at first before graduating to writing the entire “naked” ISR. Here is a stub version of our ISR:

#define TCNT_BASE   0x0bdc
#define TCNT_BASE_H (((TCNT_BASE)>>8)&0xff)
#define TCNT_BASE_L ((TCNT_BASE)&0xff)

ISR(TIMER1_OVF_vect) {
  asm (
    //reload TCNT1 counter for 1sec interrupt
    "ldi r24, %3           \n"
    "st  Z+, r24           \n" //TCNT1L
    "ldi r24, %4           \n"
    "st  Z, r24            \n" //TCNT1H
    //toggle LED
    "in   __tmp_reg__, %0  \n" //read port
    "ldi  r24, %1          \n" //LED bit mask
    "eor  __tmp_reg__, r24 \n" //toggle LED bit
    "out  %0, __tmp_reg__  \n" //write port
    : : "I" (_SFR_IO_ADDR(PORTB)), "I" (_BV(PORTB5)),
    "z" (_SFR_MEM_ADDR(TCNT1)), "M" (TCNT_BASE_L), "M" (TCNT_BASE_H) : "r24"
  );
}

Having said all that, the boilerplate code the compiler inserts is not always the most efficient, and many times inadequate. For these reasons, and for the academic exercise, we will also select the “ISR_NAKED” attribute when defining the ISR. This gives us full control over all of the code inside the ISR. Full control is a good thing:

ISR(TIMER1_OVF_vect, ISR_NAKED)

Eleven instructions encompass the prologue and epilogue, which is more than the code required for the main purpose of the interrupt. Notice inside the handler, we utilize 3 registers, r24, r30 and r31. This means we need to preserve the content of these registers since the interrupt could be triggered at any time, even precisely when these registers may be in use. Additionally we need to preserve the status register (SREG). The SREG holds critical information on the state of the program when the interrupt fired. Neglecting to reserve any of this information would probably cause the program to crash.

Don’t forget to include the terminating RETI instruction also. By comparison, this ISR_NAKED version is 10 bytes shorter than the “Stub” version:

#include "k328p.h"

#define TCNT_BASE   0x0bdc
#define TCNT_BASE_H (((TCNT_BASE)>>8)&0xff)
#define TCNT_BASE_L ((TCNT_BASE)&0xff)

ISR(TIMER1_OVF_vect, ISR_NAKED) {
  asm (
    "push r31           \n" //save r30, r31 contents
    "push r30           \n"
    "push r24           \n"
    //preserve SREG
    "in   r24, __SREG__ \n"
    "push r24           \n"

    //reload TCNT1 counter for 1sec interrupt
    "clr r31            \n"
    "ldi r30, %2        \n"
    "ldi r24, %3        \n"
    "st  Z+, r24        \n" //TCNT1L
    "ldi r24, %4        \n"
    "st  Z, r24         \n" //TCNT1H
    //toggle LED
    "in   r30, %0       \n" //read port
    "ldi  r31, %1       \n" //LED bit mask
    "eor  r30, r31      \n" //toggle LED bit
    "out  %0, r30       \n" //write port

    //restore old SREG
    "pop  r24           \n"
    "out  __SREG__, r24 \n"
    //restore r30, r31
    "pop r24            \n"
    "pop  r30           \n"
    "pop  r31           \n"
    "reti               \n"
    : : "I" (kPORTB), "I" (_BV(PORTB5)), 
    "M" (kTCNT1), "M" (TCNT_BASE_L), "M" (TCNT_BASE_H)
  );
}

The initiation code required for the Timer #1 interrupt (setting the prescaler, loading the counter and enabling the overflow interrupt) is completely contained inside the Setup function. Obviously, it is not necessary to write this in inline assembly, it’s just good practice:

#include "k328p.h"

#define TCNT_BASE   0x0bdc
#define TCNT_BASE_H (((TCNT_BASE)>>8)&0xff)
#define TCNT_BASE_L ((TCNT_BASE)&0xff)

void setup() {
  uint16_t TNCTBase = TCNT_BASE;

  asm (
    "cli                  \n" //disable gloal interrupts 
    "sbi %0, %1           \n" //pinMode(13, OUTPUT);

    //set 256 prescale (CS12)
    "st  Z+, __zero_reg__ \n" //TCCR1A
    "ldi r24, %3          \n"
    "st  Z+, r24          \n" //zero TCCR1B
    "st  Z, __zero_reg__  \n" //zero TCCR1C
    //load counter for 1sec interrupt
    "ldi r30, %4          \n"
    "st  Z+, %A5          \n" //TCNT1L
    "st  Z, %B5           \n" //TCNT1H
    //enable overflow interrupt
    "ldi r30, %6          \n"
    "ldi r24, %7          \n"
    "st  Z, r24           \n" //TIMSK1

    "sei                  \n" //enable global interrupts 
    : : "I" (_SFR_IO_ADDR(DDRB)), "I" (PORTB5),
    "z" (_SFR_MEM_ADDR(TCCR1A)), "I" (_BV(CS12)),
    "M" (kTCNT1), "r" (TNCTBase),
    "M" (kTIMSK1), "I" (_BV(TOIE1)) : "r24", "memory"
  );
}

void loop() { }

Finally, we are introducing a new header file “k328p.h” (contents listed below) which contains all of the IO register defines in such a way that we can use them inside our inline assembly routines. The definitions in this file use the same standard ATMEL mnemonics for the IO registers with the letter ‘k’ pre-pended. They are the LSB of the IO register address, and allow greater flexibility in inline assembler code when referring to the IO registers (when using pointer registers with the LD/ST instructions). A close examination of the above code will reveal the method of use.

Arduino IO Register Defines

//k328p.h - definitions for ATmega328P
//4.4.2016
#ifndef _k328P_H_
#define _k328P_H_ 

//standard registers 
//0-0x1f: bit addressable
//0-0x3f: IN/OUT compatible 
//0-0x3f: add 0x20 when using LD/ST
#define kPINB   0x03
#define kDDRB   0x04
#define kPORTB  0x05
#define kPINC   0x06
#define kDDRC   0x07
#define kPORTC  0x08
#define kPIND   0x09
#define kDDRD   0x0A
#define kPORTD  0x0B

#define kTIFR0  0x15
#define kTIFR1  0x16
#define kTIFR2  0x17

#define kPCIFR  0x1B
#define kEIFR   0x1C
#define kEIMSK  0x1D
#define kGPIOR0 0x1E
#define kEECR   0x1F
//end bit addressable

#define kEEDR   0x20
#define kEEAR   0x21
#define kEEARL  0x21
#define kEEARH  0x22
#define kGTCCR  0x23
#define kTCCR0A 0x24
#define kTCCR0B 0x25
#define kTCNT0  0x26
#define kOCR0A  0x27
#define kOCR0B  0x28

#define kGPIOR1 0x2A
#define kGPIOR2 0x2B
#define kSPCR   0x2C
#define kSPSR   0x2D
#define kSPDR   0x2E

#define kACSR   0x30

#define kMCUSR  0x34
#define kMCUCR  0x35

#define kSPMCSR 0x37

#define kSPL    0x3D
#define kSPH    0x3E
#define kSREG   0x3F
//end IN/OUT compatible

//extended registers begin
#define kWDTCSR 0x60
#define kCLKPR  0x61

#define kPRR    0x64

#define kOSCCAL 0x66

#define kPCICR  0x68
#define kEICRA  0x69

#define kPCMSK0 0x6B
#define kPCMSK1 0x6C
#define kPCMSK2 0x6D
#define kTIMSK0 0x6E
#define kTIMSK1 0x6F
#define kTIMSK2 0x70

#define kADC    0x78
#define kADCW   0x78
#define kADCL   0x78
#define kADCH   0x79
#define kADCSRA 0x7A
#define kADCSRB 0x7B
#define kADMUX  0x7C

#define kDIDR0  0x7E
#define kDIDR1  0x7F

#define kTCCR1A 0x80
#define kTCCR1B 0x81
#define kTCCR1C 0x82

#define kTCNT1  0x84
#define kTCNT1L 0x84
#define kTCNT1H 0x85
#define kICR1   0x86
#define kICR1L  0x86
#define kICR1H  0x87
#define kOCR1A  0x88
#define kOCR1AL 0x88
#define kOCR1AH 0x89
#define kOCR1B  0x8A
#define kOCR1BL 0x8A
#define kOCR1BH 0x8B

#define kTCCR2A 0xB0
#define kTCCR2B 0xB1
#define kTCNT2  0xB2
#define kOCR2A  0xB3
#define kOCR2B  0xB4
#define kASSR   0xB6

#define kTWBR   0xB8
#define kTWSR   0xB9
#define kTWAR   0xBA
#define kTWDR   0xBB
#define kTWCR   0xBC
#define kTWAMR  0xBD

#define kUCSR0A 0xC0
#define kUCSR0B 0xC1
#define kUCSR0C 0xC2

#define kUBRR0  0xC4
#define kUBRR0L 0xC4
#define kUBRR0H 0xC5
#define kUDR0   0xC6
//end extended registers

//0-0x3f for LD/ST instructions
#define k2PINB   0x23
#define k2DDRB   0x24
#define k2PORTB  0x25
#define k2PINC   0x26
#define k2DDRC   0x27
#define k2PORTC  0x28
#define k2PIND   0x29
#define k2DDRD   0x2A
#define k2PORTD  0x2B
#define k2TIFR0  0x35
#define k2TIFR1  0x36
#define k2TIFR2  0x37
#define k2PCIFR  0x3B
#define k2EIFR   0x3C
#define k2EIMSK  0x3D
#define k2GPIOR0 0x3E
#define k2EECR   0x3F
#define k2EEDR   0x40
#define k2EEAR   0x41
#define k2EEARL  0x41
#define k2EEARH  0x42
#define k2GTCCR  0x43
#define k2TCCR0A 0x44
#define k2TCCR0B 0x45
#define k2TCNT0  0x46
#define k2OCR0A  0x47
#define k2OCR0B  0x48
#define k2GPIOR1 0x4A
#define k2GPIOR2 0x4B
#define k2SPCR   0x4C
#define k2SPSR   0x4D
#define k2SPDR   0x4E
#define k2ACSR   0x50
#define k2MCUSR  0x54
#define k2MCUCR  0x55
#define k2SPMCSR 0x57
#define k2SPL     0x5D
#define k2SPH     0x5E
#define k2SREG    0x5F

#endif //_k328P_H_

References

Arduino Interrupts
Newbie’s Guide to AVR Interrupts
PJRC Guide to Interrupts
AVR Libc Information on Interrupts
University of Maryland, BC, C Programming and Embedded Systems Course, Interrupt Information
AVR 8-bit Instruction Set
AVR-GCC Inline Assembler Cookbook
Extended Asm – Assembler Instructions with C Expression Operands
Mixing C and Assembly Language
ATMEL ATmega328P Datasheet

Also available as a book, with greatly expanded coverage!

BookCover
[click on the image]

Posted in Uncategorized | Tagged , , , , , | Leave a comment

Arduino Inline Assembly Tutorial (Functions)

func machine

At first consideration, the topic of functions seems simple and trite. Just discuss how to “CALL” and “RETURN” to and from a function, right? However, there are many subtopics involved as well. For example, passing and returning parameters, prologue and epilogue code, the stack frame and mixing assembly and C are topics deserving of separate tutorials. Hopefully, we can do all of these justice, but first, the basics…

Convert Snippet Into a Function

How about a simple demonstration of turning an inline code snippet into a function? In a previous tutorial on indirect addressing, several inline pieces of code were developed to perform various string operations. One such operation determined the character length of a string. The code is below.

String Length, Sounds Like strlen

const char src[4] = "abc";
volatile uint8_t len;
 
asm (
  "_loop:               \n"
  "ld   __tmp_reg__, Z+ \n"
  "tst  __tmp_reg__     \n"
  "brne _loop           \n"
  //Z points one character past the terminating NUL
  "subi %A1, 1          \n" //subtract post-increment
  "sbci %B1, 0          \n"
  "sub  %A1, %A2        \n" //length = end - start
  "sbc  %B1, %B2        \n"
  "mov  %0, %A1         \n" //save len (uint8_t)
  : "=r" (len) : "z" (src), "x" (src)
);

While this code could easily be included “inline”, it certainly would be more useful if it was defined as a general function. This would make it much easier to use throughout a program, and also reduce overall program size by incorporating only one instance of the code. So how is this accomplished?

Stub Your Code

The official Cookbook refers to this techniques as a “C Stub Function,” which is nothing more than a function definition containing only inline assembler code. Typically, in a “C Stub Function”, the function parameters and local variables define the data used in, and the value returned (if any) by the function. This is an easy method to pass data to/from the inline function, without the need to understand the underlying details of how its done. Therefore, eliminating the necessity of writing additional code.

The above “string length” snippet easily becomes a full blown function, _strlen() using this method. Notice the transformed function below receives a string, (s) as a parameter, and returns the length, which is defined as a local variable. We refer to these same variables in the input and output constraints:

inline uint8_t _strlen(const char *s) {
  uint8_t len;

  asm (
    "_loop:              \n"
    "ld  __tmp_reg__, Z+ \n"
    "tst __tmp_reg__     \n"
    "brne _loop          \n"
    //len=Z - 1 – src = (-1 - src) + Z = ~src + Z
    "com %A2             \n"
    "com %B2             \n"
    "add %A2, %A1        \n"
    "adc %B2, %B1        \n"
    : "=r" (len) : "z" (s), "x" (s)
  );

  return len;
}

Placing a Call

An extension to the “C Stub Function” technique is calling another C function from inside inline assembly code. The following bit of code demonstrates the CALL instruction. This instruction “calls” a subroutine located within the program memory (if we remember to properly define the function to avoid linkage errors). The C Stub Function even handles the return (RET) for us.

An additional detail required here, is the need to encapsulate the “called” function inside the extern “C” { } declaration (see below example). The extern “C”, C++ keyword prevents the function name from becoming “mangled”, thus preventing the linker from locating the called function.

extern "C" {
  void foo() {
    // do something here...
  }
}

void test() {
  asm (
    "call foo \n"
  );
}

Playing Catch

Next, we present a basic example of passing and returning parameters to and from C Stub Functions. The purpose of the following code is to convert an upper case ASCII character into its lower case equivalent. We’ve created two functions here, _isupper and _tolower, which validate the input character and then perform the conversion.

Take a look at the code below.

Notice, the first thing _tolower does is call the function, _isupper. Since _tolower hasn’t done anything yet, the C Stub Function simply hands the input character (c), the parameter to _tolower directly onto the _isupper function. Neat!

Next, _isupper checks the character to confirm its actually an upper case character. If so, it returns the character, otherwise it returns a zero. Upon returning to _tolower, the next instruction which is executed is “tst r24”, a test of the contents of register r24. If register #24 (r24) is not zero, the character is converted and the function returns.

Again, notice the use of the C++ keyword “extern C {}” here:

extern "C" {
  unsigned char _isupper(unsigned char c) {
    //bind variable to a specific register r18
    register unsigned char ch asm("r18");
    
    asm (
      "mov  %1, %0 \n" //save input
      "subi %1, 'A'\n" //subtract 0x41
      "brmi 2f     \n" //branch if minus
      "subi %1, 26 \n" //26 letters
      "brpl 2f     \n" //branch if plus
      "ret         \n" //c==upper, return
      "2: clr  %0  \n" //false
      : "+r" (c) : "r" (ch) 
    );
    
    return c;
  }
}

char _tolower(unsigned char c) {
  asm (
    "call _isupper \n" //validate char
    "tst r24       \n" //0 = not alpha char
    "breq 1f       \n" //not alpha char
    "ori %0, 0x20  \n" //make lower
    "1:            \n"
    : "+r" (c)
  );
  
  return c;
}

Insider Information

Why did function _tolower choose to test register #24 (r24)? The above two functions relied on “insider” information when using register r24. These routines knew that an 8-bit, byte-sized value is passed to and from a function via the r24 register. The C Compiler always passes function arguments and returns values in specific register locations. Knowing these locations are essential to writing efficient inline assembly code, especially when interfacing with the C language.

This is a good time to review the data type sizes: a char is 8 bits, an int is 16 bits, a long is 32 bits, a long long is 64 bits, floats are 32 bits, and pointers are 16 bits (function pointers are word addresses). Arguments are allocated left to right, starting in register r25 descending through register r8. All arguments are aligned to start in even-numbered registers (odd-sized arguments, like char, have one free register above them), for example, a single 8-bit value is passed via the r24 register (r25 is assumed empty), a single 16-bit value is passed via the r25:r24 register pair, and a 32-bit value would be passed via r25:r24:r23:r22 register combination.

Return values are expected to be passed in a similar fashion. An 8-bit value is passed via r24, a 16-bit value in r25:r24, and 32-bits in r22:r23:r24:r25. An 8-bit return value may be zero/sign-extended to 16-bits by the called function.

What’s the Use of a Register?

Function “call-used” registers are r18-r27, and r30-r31. Any, or all of these registers may be allocated by the compiler for local data. However, we may use them freely in assembler subroutines. Calling C subroutines can clobber any of them, and the caller is responsible for saving and restoring before and after use.

Function “call-saved” registers are r2-r17, and r28-r29. They may also be allocated by the compiler for local data, but C subroutines leaves them unchanged. Assembler subroutines are responsible for saving and restoring any of these registers, if changed. The Y register pair (r29:r28) is used as a frame pointer (pointing to local data placed on the stack) if necessary.

Fixed registers, r0, and r1 are never allocated by the compiler for local data. The temporary register, r0 can be clobbered by any C code (except interrupt handlers which save it), and may be used freely. The zero register is r1, and assumed to be always zero in any C code. It may be used for other purposes within a piece of assembler code, but must then be cleared after use (clr r1). Interrupt handlers save and clear r1 on entry, and restore r1 on exit (in case it was non-zero).

References

AVR 8-bit Instruction Set
AVR-GCC Inline Assembler Cookbook
Extended Asm – Assembler Instructions with C Expression Operands
Mixing C and Assembly Language

Also available as a book, with greatly expanded coverage!

BookCover
[click on the image]

Posted in Uncategorized | Tagged , , , , , | Leave a comment

Arduino Inline Assembly Tutorial (Tables)

table

Often, the fastest way to compute something on an arduino is to not compute it all.

Huh?

For example, trigonometric functions are costly operations and can abruptly slow your application to the pace of a crawl. And many times, the result is computed with far more precision than needed for the situation. Most often you just want the periodic wave-like characteristics of sine or cosine, which can easily be approximated. With a trigonometric function, its easy to substitute a lookup-table populated with pre-computed values at discrete steps. If your program can handle the loss of precision, yet requires as much speed as possible, this alternative is a good option.

The Ivy League Microcontroller

Since the arduino’s ATMEL AVR μC is based upon the modified Harvard architecture, the data and program instructions are stored in different memory. The program instructions are stored in flash, while data is stored in SRAM. These separate pathways are primarily implemented to enhance performance, but it also prohibits executing program instructions from data memory. Yet it may seem paradoxical, data is allowed to be stored inside program memory (see this information on the use of the PROGMEM attribute).
Placing a table in SRAM is simple, and shouldn’t present problems for an inline programmer (especially at our stage!). Consequentially, in this tutorial, we will store a table inside program memory.

Did He Say Frogmen?

Placing the table into program memory is easy. It is accomplished via a C language floating-point array, incorporating the special keyword, “PROGMEM”. PROGMEM instructs the compiler to place this data into flash memory:

static const float PROGMEM SineTable[91] = {
  0.0, 0.017452, 0.034899, 0.052336, 0.069756,
. . .
  0.997564, 0.998630, 0.999391, 0.999848, 1.0
};

Previously, when accessing SRAM (data memory) we used the LDS instruction. However, accessing program memory requires the use of the LPM instruction. LPM is the mnemonic for Load from Program Memory, and it loads a data byte from flash program memory into a register.

Details

The Flash program memory is organized as 16 bits words, while the registers and SRAM are organized as eight bits bytes. The Z-register is used to access the program memory. This 16 bits register pair is used as a 16 bit pointer to the Program memory. The 15 most significant bits selects the word address in Program memory. Because of this, the word address is multiplied by two before it is put in the Z-register. However, the good news is that in the code presented below all of these details are transparent.

Table Legs

The function below first limits the input value to a range between 0-90. If the input is out-of-range, it returns the floating-point Not-A-Number (NAN) value. It then multiplies the input by 4 to produce an index into our table. We multiply by four because our table is populated with floating point numbers, each of which is 4-bytes long. The index is simply added to the (PROGMEM) address of the start of the table. The functions finishes by retrieving the 4-byte float value and returning.

Note, floating point support inside the inline assembler is scarce. In this function we treat the float variable transparently, like any 32-bit variable. We get away with this because we’re not performing any operation on the value.

float _Sine(uint16_t angle) {
  float tmp;

  asm (
    //validate angle >= 0 && angle <= 90
    "cpi  %A1, 90+1 \n" 
    "cpc  %B1, __zero_reg__ \n"
    "brcc _NaN      \n" //out of range

     //calculate table index
    "lsl  %A1       \n" //float is 4 bytes wide
    "rol  %B1       \n" //index = angle * 4
    "lsl  %A1       \n"
    "rol  %B1       \n"

    //add index to start of SineTable
    "add  r30, %A1  \n" 
    "adc  r31, %B1  \n"

    //get sine value (4-bytes)
    "lpm  %A0, Z+   \n" 
    "lpm  %B0, Z+   \n"
    "lpm  %C0, Z+   \n"
    "lpm  %D0, Z    \n"
    "ret            \n" //exit
    
    //return NAN
    "_NaN:              \n" 
    "ldi  %A0, lo8(%3)  \n" //NAN = 0x7fc00000
    "ldi  %B0, hi8(%3)  \n"
    "ldi  %C0, hlo8(%3) \n"
    "ldi  %D0, hhi8(%3) \n"
    : "=r" (tmp) : "r" (angle), "z" (SineTable), "F" (NAN)
  );
  return tmp;
}

The Full Table

#include <avr/pgmspace.h>

//max errror ~0.017452 [91*4=364 bytes]
static const float PROGMEM SineTable[91] = {
  0.0, 0.017452, 0.034899, 0.052336, 0.069756, 0.087156, 
  0.104528, 0.121869, 0.139173, 0.156434, 0.173648, 0.190809, 
  0.207912, 0.224951, 0.241922, 0.258819, 0.275637, 0.292372, 
  0.309017, 0.325568, 0.34202, 0.358368, 0.374607, 0.390731, 
  0.406737, 0.422618, 0.438371, 0.45399, 0.469472, 0.48481, 
  0.5, 0.515038, 0.529919, 0.544639, 0.559193, 0.573576, 
  0.587785, 0.601815, 0.615661, 0.62932, 0.642788, 0.656059, 
  0.669131, 0.681998, 0.694658, 0.707107, 0.71934, 0.731354, 
  0.743145, 0.75471, 0.766044, 0.777146, 0.788011, 0.798636, 
  0.809017, 0.819152, 0.829038, 0.838671, 0.848048, 0.857167, 
  0.866025, 0.87462, 0.882948, 0.891007, 0.898794, 0.906308, 
  0.913545, 0.920505, 0.927184, 0.93358, 0.939693, 0.945519, 
  0.951057, 0.956305, 0.961262, 0.965926, 0.970296, 0.97437, 
  0.978148, 0.981627, 0.984808, 0.987688, 0.990268, 0.992546, 
  0.994522, 0.996195, 0.997564, 0.99863, 0.999391, 0.999848, 1.0
};

What Are You Doing For Me?

After a cursory comparison test between the table _Sine() function and the arduino floating point sin() function, we can draw some basic conclusions. Even though the table itself consumes 364 (91 x 4 = 364) bytes of flash (on top of the function code), the arduino library sin() function (and it’s required peripheral floating point support) uses approximately 900 bytes more flash memory.

However, saving space wasn’t necessarily the goal of this exercise, speed was the primary concern. Comparing 1,000 calls to both functions yielded an average duration of 121.7uS per sin() vs. 2.92uS for _Sine(). One final but obvious concern is the precision of the result. This will need to be evaluated to determine if it is sufficient for your application.

Bigger, Better, More

Various modifications can expand and improve the accuracy of the table code, but are beyond the scope of this tutorial. However, here are some basic ideas.

The obvious methods is to expand the table to decrease the step interval. Another technique is to incorporate interpolation similar to the following pseudo code:

float _iSine(uint16_t angle) {
     uint16_t x1 = floor( angle );
     float y1 = SineTable[x1];
     float y2 = SineTable[x1 + 1];
     return y1 + ( y2 - y1 ) * ( x - x1 )
}

For full 0-360 angle coverage, do something like:

float Sine(uint16_t i) {
  while (i > 359)
    i -= 360;

  if (i < 90)
    return iSine(i);
  else if (i < 180)
    return iSine(179 - i);
  else if (i < 270)
    return (-1*iSine(i - 180));
  else if (i < 360)
    return (-1*iSine(359 - i));
}

Another easy expansion with the sine table is to calculate cosine and tangent values:

float _Cosine(uint16_t a) {
  return _Sine( a + 90 );
}
 
float _Tangent(uint16_t a) {
  return ( _Sine(a) / _Cosine(a) );
}

References

AVR 8-bit Instruction Set
AVR-GCC Inline Assembler Cookbook
Extended Asm – Assembler Instructions with C Expression Operands
Further information on addressing modes can be found in Section 2 of the AVR Instruction Set Manual
AVR108: Setup and Use of the LPM Instruction
Sine Lookup Table Generator

Also available as a book, with greatly expanded coverage!

BookCover
[click on the image]

Posted in Uncategorized | Tagged , , , , , | Leave a comment

Arduino Inline Assembly Tutorial (Strings)

indirect

Addressing Modes

When loading and storing data, there are several addressing methods available for use. The arduino’s AVR microcontroller supports 13 address modes for accessing the Program memory (Flash) and Data memory (SRAM, Register file, I/O Memory, and Extended I/O Memory). Six modes use “direct addressing”, and as such are very basic. The direct modes are generally inherent in the assembly instruction. The good news is that, we covered all six in past tutorials, so there is no need to address them here (pun intended). Four additional modes incorporate indirect addressing, and will be the focus of this tutorial.

ADDRESSING MODES
*Register Direct, Single Register-
*Register Direct, Two Registers
*IO Direct
*Data Direct
Data Indirect
Data Indirect w/Displacement
Data Indirect w/Pre-Decrement
Data Indirect w/Post-Increment
Program Memory Constant
Program Memory w/Post-Inc
*Direct Program
Indirect Program
*Relative Program
---
* denotes previously covered.

String Theory

Indirect addressing can be said to involve “pointers”. In the C language, the word “pointer” scares people. Hopefully we can calm these irrational fears, by coding an assortment of string routines using simple indirect addressing modes. By the end of this tutorial, we should have a good basis for a library of string functions.

X, Y AND Z

The six registers, r26 through r31 can be paired together and referenced using the letters X, Y and Z. (the X register is r27:r26, the Y register is r29:r28, and the Z register is r31:r30). When combined, these registers are 16-bit “address pointers” for indirect addressing of the data space. In use, the X,Y and Z register pairs are loaded with an address of interest.

The three indirect address registers X, Y, and Z are defined as described here:

xyz

Speaking Indirectly

Previously we used the LDS instruction to load the value stored inside SRAM memory. For example, this code loads the number 42 into register r24:

  volatile uint8_t x=42;

  asm (
    "lds r24, (x) \n"
  );

But with the X, Y and Z pointer registers, we load the SRAM address into the register pairs (not the value stored there). Hence, we use the the term “indirect addressing”. For example, the following code loads the “address” of the string, (src) into the X register pair via the constraint, “x” (src). When we want the first character of the string, or as in this case, ‘a’, we load it “indirectly” from the X register pair (address) like so:

const char src[4] = "abc";

asm (
  "ld __tmp_reg__, X \n"
  : : "x" (src)
);

Fetch Me Z Pointer

Here is an example directly out of the AVR Inline Assembler Cookbook involving a true C-pointer. In this code snippet, ptr is a pointer to variable number. The ‘e’ constraint requests that ptr (which is the address of variable number) be loaded into one of the X, Y or Z register pairs, at the assembler’s choice.

Then, the value at the “address” inside the pointer register pair (or 0x11) is loaded into the temporary register (__tmp_reg__). It is incremented, and finally stored back through the pointer ptr into the variable number. At the completion of this inline code, number = 0x12, and of course, the value of ptr hasn’t changed.

volatile uint8_t number=0x11, *ptr = &number;

asm volatile(
  "ld __tmp_reg__, %a0 \n"
  "inc __tmp_reg__     \n"
  "st %a0, __tmp_reg__ \n"
  : : "e" (ptr) : "memory"
);

If you have don’t have a good grasp of C pointers, this could be slightly confusing. It might be helpful to examine the assembler code produced to see exactly what is happening here (note the compiler selected the Z register pair for the pointer, ptr):

0000029E e0.91.00.01   LDS R30, 0x0100 //load address into ptr (0x0102)
000002A0 f0.91.01.01   LDS R31, 0x0101
000002A2 00.80         LDD R0, Z+0     //load number into r0 (0x11)
000002A3 03.94         INC R0          //increment r0 to 0x12
000002A4 00.82         STD Z+0, R0     //store back into number (0x0102)

Address locations:
ptr:	0x0102	uint8_t* @0x0100
p:	0x11	uint8_t  @0x0102 

How Long is a String?

Now, onto strings. The following code calculates the length of the string str, not including the terminating NUL, or ‘\0’ character. It places the number of characters inside str into len:

const char src[4] = "abc";
volatile uint8_t len;

asm (
  "_loop:               \n"
  "ld   __tmp_reg__, Z+ \n"
  "tst  __tmp_reg__     \n"
  "brne _loop           \n"
  //Z points one character past the terminating NUL
  "subi %A1, 1          \n" //subtract post-increment
  "sbci %B1, 0          \n"
  "sub  %A1, %A2        \n" //length = end - start
  "sbc  %B1, %B2        \n"
  "mov  %0, %A1         \n" //save len (uint8_t)
  : "=r" (len) : "z" (src), "x" (src) : "memory"
);

First, notice we define input constraints for the string (str) twice, using both X and Z pairs. These constraints place the address of the string inside of the r30:r31 and r26:r27 register pairs. The reason for this will become clear in a moment.

Studying the code further, notice we load the first character of the string (pointed to by the “Z” register), placing it into the temp register (__tmp_reg__). Further, take note that the instruction has a plus sign ‘+‘ appended to the ‘Z’. This means the Z register is incremented by 1 after the load operation. It’s as if we combine two instructions into one! This is termed “Indirect Addressing with Post-Increment”.

Next, the temp register is tested (tst __tmp_reg__), and if it is NOT zero, execution will loop back and fetch another character. This repeats until finding the NUL character at the end of the string. This terminates the loop, however at this point, because of the post-increment operation, the Z register points one location past the end of the string.

We complete the routine by subtracting 1 for extra post-increment, and then subtract the ending string address from the start address. The result of this math is the length of the string.

Here is a slightly more efficient version, but I will leave it to you to determine the details of the shortened arithmetic (the embedded comment explains the math in cryptic fashion):

const char src[4] = "abc";
volatile uint8_t len;

asm (
  "_loop:              \n"
  "ld  __tmp_reg__, Z+ \n"
  "tst __tmp_reg__     \n"
  "brne _loop          \n"
  //len=Z - 1 – src = (-1 - src) + Z = ~src + Z
  "com %A2             \n"
  "com %B2             \n"
  "add %A2, %A1        \n"
  "adc %B2, %B1        \n"
  : "=r" (len) : "z" (src), "x" (src)
);

Zerox a String

Lets do another one. This code copies the src string (including the terminating NUL character) to the array pointed to by dst. However, the strings may not overlap, and the dst string must be large enough to receive the copy. If the destination string is not large enough, anything could happen…

const char src[4] = "abc";
char dst[4] = "   ";

asm (
  "_copy:               \n"
  "ld   __tmp_reg__, Z+ \n"  //load tmp reg w/src char
  "st   X+, __tmp_reg__ \n" //store tmp reg to dst 
  "tst  __tmp_reg__     \n" //check if 0 (end)
  "brne _copy           \n"
  : : "x" (dst) , "z" (src)
);

Wow, only 4-lines of inline assembly code can copy a string! As you can see, this is very straight forward and quite simple. It utilizes the X and Z register pairs, incorporating post-increment addressing with both.

Who’s String is Bigger?

This code compares the two strings s1 and s2. It returns an integer (in result) less than, equal to, or greater than zero if s1 is found to be less than, to match, or be greater than s2. Again, it utilizes the X and Z register pairs, incorporating post-increment addressing with both. Hopefully you are starting to recognize the power of “indirect addressing” combined with “post-indexing”.

char s1[4] = "abc";
char s2[4] = "xyz";
volatile int16_t result;

asm (
  "_compare:                     \n"
  "ld   %A0, X+                  \n"
  "ld   __tmp_reg__, Z+          \n"
  "sub  %A0, __tmp_reg__         \n"
  "cpse __tmp_reg__, __zero_reg__\n"
  "breq _compare                 \n"
  "sbc  %B0, %B0                  \n"
  : "=&r" (result) : "x" (s1) , "z" (s2)
);

String Cat

This code appends the src string to the dst string overwriting the NUL character at the end of dst, and then adds a terminating NUL character. The strings may not overlap, and the dst string must have enough space for the result. This example is slightly more involved, but with a little study the details should become clear.

const char src[4] = "def";
char dst[7] = "abc";

asm (
  "_dst:                \n" //find end of destination
  "ld   __tmp_reg__, X+ \n"
  "tst  __tmp_reg__     \n"
  "brne _dst            \n"
  "sbiw %A0, 1          \n" //undo post-increment
  "_src:                \n" //X==end of dst string
  "ld   __tmp_reg__, Z+ \n"  //copy src to dst
  "st   X+, __tmp_reg__ \n"
  "tst  __tmp_reg__     \n" //test for 0 (end)
  "brne _src            \n"
  : : "x" (dst), "z" (src) : "memory"
);

Charred String?

Finally, this code finds the first occurrence of the character val in the string src. Here “character” means “byte” (no wide or multi-byte characters allowed). The location of the matched character is placed in a pointer (c) or a NUL if the character is not found.

const char s[4] = "abc", *c;
volatile int16_t val = 0x63;

asm (
  "_loop:        \n"
  "ld   %A0, Z+  \n" //fetch char from string
  "cp   %A0, %A2 \n" //compare char with val
  "breq _found   \n"
  "tst  %A0      \n" //end of string (0)?
  "brne _loop    \n" //not at end
  "clr  %B0      \n" //not found, NULL pointer
  "rjmp _end     \n"
  "_found:       \n"
  "sbiw %A1, 1   \n" //undo post-increment
  "movw %A0, %A1 \n" //save pointer
  "_end:         \n"
  : "=x" (c) : "z" (s), "r" (val)
);

References

C Programming and Strings
Further information on addressing modes can be found in Section 2 of the AVR Instruction Set Manual
AVR 8-bit Instruction Set
AVR-GCC Inline Assembler Cookbook
Extended Asm – Assembler Instructions with C Expression Operands
AVRLibc String Functions

Also available as a book, with greatly expanded coverage!

BookCover
[click on the image]

Posted in Uncategorized | Tagged , , , , , | Leave a comment