Arduino Inline Assembly Tutorial (Examples)

case study

As the final tutorial in this series, we present four example inline assembly functions for the arduino. Specifically, these cover the conversion of a byte to a hexadecimal string, SPI Mode 0 hardware transfer, SPI Mode 0 Bit-banging, and the C library atoi function. Do not take these functions as archetypical examples of high-quality coding practice or brilliantly efficient inline code. They are neither.

Most of the previous examples in this series were simple “snippets of code”, and as such gave a myopic view of inline assembly. The goal here is to show complete and working demonstrations of how to include inline assembly into the typical arduino program. Each example includes explanatory comments covering the key portions of code.

In addition to these examples, have a look at the Arduino Inline Assembly Blink Program.

Stringing Hexadecimals

The following code converts a byte value into a hexadecimal string. Notice at the start of the code, that the constraint #0 value (val) is temporarily saved in the r25 register. The function then converts the first nibble. When the conversion process is complete, the function loops back and converts the second nibble. Note how the code uses the SREG T-bit to flag the first vs. second nibble.

void ByteToHexStr(uint8_t val, char *str) {
  asm (
    "set           \n" //flag first nibble
    "mov r25, %0   \n" //save val
    "swap %0       \n" //swap for correct nibble order
  "1:              \n"
    "andi %0, 0xf  \n" //mask a nibble
    "cpi  %0, 0xa  \n" //>10?
    "brcc 2f       \n" //yes
    "subi %0, 0xd0 \n" //convert numeral (0-9) 
    "rjmp 3f       \n" //skip next
  "2:              \n"
    "subi %0, 0xc9 \n" //convert letter (A-F)
  "3:              \n"
    "st Z+, %0     \n" //put into string
    "brtc 4f       \n" //upper nibble?
    "clt           \n" //clear nibble flag
    "mov %0, r25   \n" //get upper nibble
    "rjmp 1b       \n" //repeat conversion
  "4:              \n" //exit
    : : "r" (val), "z" (str) : "memory"

I SPI With My Little Eye…

Serial Peripheral Interface (SPI) is a synchronous serial data protocol used by microcontrollers for communicating with one or more peripheral devices, or for communication between two microcontrollers. The SPI standard is loose and each device implements it a little differently, which means you must pay close attention to the device’s datasheet when implementing the protocol. Generally speaking, there are four modes of transmission, defined by the clock phase and polarity.

Here are two versions of the SPI transfer function. The first of these programs incorporates the arduino hardware SPI. The second is a bit-bang version using different pins. More information on SPI can be found here and here.

SPI Mode 0 Hardware Transfer

static __attribute__ ((noinline)) uint8_t SpiXfer(uint8_t data) {
  asm (
    "out  %1, %0          \n" //put data out SPDR register
    "nop                  \n" //pause
  "1:                     \n"
    “in   __tmp_reg__, %2 \n" //check xmit complete
    "sbrs __tmp_reg__, %3 \n"
    "rjmp 1b              \n"
    "in   %0, %1          \n" //get incoming data
    : "+r" (data) : "M" (_SFR_IO_ADDR(SPDR)),
    "M" (_SFR_IO_ADDR(SPSR)), "I" (SPIF)

  return data;

SPI Bit-Bang

#define MOSI_BIT   PORTD5
#define MISO_BIT   PIND6

static __attribute__ ((noinline)) uint8_t SpiBitBang(uint8_t data) {
  register uint8_t tmp, i=8;
  //save and restore sreg because t-bit is utilized
  asm (
    "in __tmp_reg__, __SREG__ \n"
  "1:               \n"
    "sbrs %0, 0x07  \n" //is output data bit high?
    "rjmp 2f        \n" //no
    "sbi  %3, %4    \n" //output a high bit
    "rjmp 3f        \n"
  "2:               \n"
    "cbi  %3, %4    \n" //output a low bit
  "3:               \n"
    "lsl  %0        \n" //shift to next bit
    "in   %1, %5    \n" //get input
    "tst  %1        \n" //anything here?
    "breq 4f        \n" //nope
    "bst  %1, %6    \n" //set t-bit if input bit is high
    "clr  %1        \n" //zeroize register
    "bld  %1, 0     \n" //set bit 0
    "or   %0, %1    \n" //or low bit with data for return value
  "4:               \n"
    "sbi  %7, %8    \n" //toggle clock bit high
    "nop            \n" //pause
    "cbi  %7, %8    \n" //toggle clock bit low
    "subi %2, 1     \n" //more bits?
    "brne 1b        \n" //do next bit
    "out __SREG__, __tmp_reg__ \n"
    : "+r" (data), "=&r" (tmp): "a" (i),

  return data;

A Toy

Atoi is a function in the that converts a string into an integer numerical representation (atoi stands for ASCII to integer). It is included in the C standard library header file stdlib.h. It is prototyped as follows:

int atoi(const char *str);

The str argument is a string, represented by an array of characters, containing the characters of a signed integer number. The string must be null-terminated.

Here is the basic idea of the atoi function implemented in C language:

int16_t atoi(char s[]) {
  uint8_t i, sign;
  int16_t n;
  //skip white space
  for (i=0; s[i]<=' '; i++);
  sign = 0;
  if (s[i] == '-') {
    sign = 1;
  for (n=0; s[i]>='0' && s[i]<='9'; i++)
    n = 10*n + s[i] - '0';
  if (sign)
    return (-1*n);
    return n;

Atoi Inline

Here is our implementation, which is only 64 bytes in length. By comparison, the arduino AVR libc atoi() function is 76 bytes long. This version is basically functionally equivalent, however there are a few detail differences (this function steps over all leading ASCII characters 0x2F and below, not just whitespace):

int16_t _atoi(const char *s) {
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wuninitialized"
  //sign & c are initialized inside inline asm code
  register uint8_t sign, c;
#pragma GCC diagnostic pop
  //force result into return registers
  register int16_t result asm("r24"); 
  asm (
    "ldi  %A0, 0x00         \n" //result = 0
    "ldi  %B0, 0x00         \n"

  "1:                       \n"
    "ld   %2, Z+            \n" //fetch char
    "cpi  %2, '-'           \n" //negative sign?
    "brne 2f                \n"
    "ldi  %3, 0x01          \n" //sign = TRUE

  "2:                       \n"
    "cpi  %2, '/' + 1       \n" //step over whitespace/garbage
    "brcc 3f                \n"
    "rjmp 1b                \n"

  "3:                       \n"
    "rjmp 5f                \n"

  "4:                       \n"
    "ldi  r23, 10           \n" //result *= 10
    "mul  %B0, r23          \n"
    "mov  %B0, r0           \n"
    "mul  %A0, r23          \n"
    "mov  %A0, r0           \n"
    "add  %B0, r1           \n"
    "clr  __zero_reg__      \n" //r1 trashed by mul
    "add  %A0, %2           \n" //result += new digit
    "adc  %B0, __zero_reg__ \n"
    "ld   %2, Z+            \n" //fetch next digit char
  "5:                       \n"
    "subi %2, '0'           \n" //convert char to 0-9
    "cpi  %2, 10            \n" //end of string?
    "brlo 4b                \n"

    "cpi  %3, 0             \n" //negative?
    "breq 6f                \n"
    "com  %B0               \n" //negate result
    "neg  %A0               \n"
    "sbci %B0, -1           \n"
  "6:                       \n"
    : "+r" (result) : "z" (s), "a" (c), "a" (sign) : "memory"

  return result;


While there are countless more topics to cover, and many more rabbit-holes to dive down, I believe I have covered enough of the basics in this series. I sure enjoyed researching and writing these tutorials. And, hopefully you gained a few insights into the funky world of arduino (AVR) inline assembly programming. Now, get inline with your programming!

[updated: 4.11.16]

Also available as a book, with greatly expanded coverage!

[click on the image]

Posted in arduino, assembly language, avr, avr inline assenbly | Tagged , , , , , | Leave a comment

Arduino Inline Assembly Tutorial (Interrupts)


Pardon The Interruption

The previous tutorial covered the basics of writing inline functions. A close relative of the function is the Interrupt Service Routine (ISR), which is the topic here. Portions of this tutorial may pertain to functions as well.

As a warning, this tutorial assumes an understanding of the basic concepts of interrupts in general, and specifically interrupt handlers on the arduino (AVR μC). Hopefully, you have already written a few arduino interrupts in C, using the internal arduino functionality. If not, you may want to study some of the links given in the reference section of this tutorial before continuing.

The Deck is Stacked

Basic knowledge of the stack is essential to understanding functions and interrupt handlers. The basic purpose of the stack is to support function calls and interrupts. Whenever a program makes a function call or whenever an interrupt occurs, the stack is used to store critical information which will be restored upon completion of the function or interrupt. Additional information on the stack can be found here and here.

First and primary, during a function call or interrupt, the hardware places the return address on the stack. The saving and restoration of the return address is accomplished transparently by the CALL and RET instructions. It is not necessary to perform any special instruction(s) to make this occur.

Second, if any “call-saved” registers will be “clobbered” inside the function, these registers are “pushed” onto the stack. In the case of an interrupt service routine, all of the registers used inside the ISR (and always the temporary and zero registers, r0 and r1) get pushed onto the stack. Additionally, during an ISR the SREG is saved and restored.

Finally, if the compiler deems it necessary, space is reserved for any local variables on the stack. Many times the compiler will place local variables into specific registers, and therefore doesn’t use the stack for temporary storage.

Here is an example of how the compiler uses the stack to store local variables inside of a function. This is sometimes referred to as “setting up a stack frame.” We will reserve 16 bytes for a character array (note: unrelated code has been removed for the purpose of clarity). The compiler performs all of this stack manipulation for us behind the scenes, so-to-speak:

void example(void) {
  char buffer[16]; //space will be reserved on the stack
  //do something here. . .

Result in this machine code:

  PUSH r28          ;save registers on stack 
  PUSH r29 
  IN   r28, SPL     ;get stack pointer    
  IN   r29, SPH   
  SBIW r28, 16      ;reserve 16 bytes space on stack
                    ;the stack grows downward, hence the subtraction
  OUT  SPH, r29     ;update new stack pointer
  OUT  SPL, r28 
;do something here. . .
  ADIW r28, 16      ;remove the 16 bytes from the stack
  OUT  SPH, r29     ;restore stack pointer
  OUT  SPL, r28 
  POP  r29          ;restore registers from stack
  POP  r28 

Upon return from the interrupt or function, all the preserved values are restored, or “popped” from the stack. Obviously, during the pro and epilogue code, the order of the push and pop instructions is very critical.

Interrupt Before and After

Below, I wrote a very basic interrupt routine that simply increments a byte so we can examine the prologue and epilogue code generated by the compiler:

//here is an example ISR coded in C:
volatile uint8_t a;
ISR(INT0_vect) {
//this is the generated assembly code:
0000027F 1f.92                PUSH r1       ;save r1 register
00000280 0f.92                PUSH r0       ;save r0 register
00000281 0f.b6                IN r0, SREG   ;get status register
00000282 0f.92                PUSH r0       ;save sreg 
00000283 11.24                CLR r1        
00000284 8f.93                PUSH r24      ;save r24 register
;increment byte (a) here
00000285 80.91.c3.01          LDS r24, (a) 
00000287 8f.5f                SUBI r24, 0xFF     
00000288 80.93.c3.01          STS (a), r24 
0000028A 8f.91                POP r24       ;restore r24 register
0000028B 0f.90                POP r0        ;restore status register
0000028C                OUT SREG, r0
0000028D 0f.90                POP r0        ;restore r0 register
0000028E 1f.90                POP r1        ;restore r1 register
0000028F 18.95                RETI          ;return from interrupt

As you can see, the meat of the ISR is only 10 bytes long. However, together the prologue and epilogue add another 24 bytes, for a total of 34. It might be possible to save a few bytes and program cycles by tightly writing your own ISR pro and epilogue. GCC has a provision which allows writing your own pro and epilogues, which will be covered later.

We Interrupt This Program to Blink

It is now time to write an interrupt handler, or ISR in inline assembler. I can’t think of a better example than to adapt the basic Blink sketch to use the Timer #1 Overflow interrupt. Please note, because this code alters the Timer #1 registers, it will render any use of the arduino Timer #1 as nonfunctional (i.e. analogWrite pins 9 & 10, the Servo Library, etc.).

Handle It

The first order of business is to write the interrupt handler for the Timer #1 Overflow. This is the routine that is called when the Timer #1 counter (TCNT1) rolls over from 0xffff to zero. Our the ISR is very basic, and as always, it should be kept as short as possible. Inside the handler we perform two functions:

  • Reset the counter (TCNT1) allowing the next overflow to reoccur at 1 second intervals.
  • Toggle the LED.

An ISR can be coded using inline assembler just as in a “C Stub Function”, relying upon the compiler to insert the necessary prologue and epilogue code. I suggest you use this stub technique at first before graduating to writing the entire “naked” ISR. Here is a stub version of our ISR:

#define TCNT_BASE   0x0bdc
#define TCNT_BASE_H (((TCNT_BASE)>>8)&0xff)
#define TCNT_BASE_L ((TCNT_BASE)&0xff)

ISR(TIMER1_OVF_vect) {
  asm (
    //reload TCNT1 counter for 1sec interrupt
    "ldi r24, %3           \n"
    "st  Z+, r24           \n" //TCNT1L
    "ldi r24, %4           \n"
    "st  Z, r24            \n" //TCNT1H
    //toggle LED
    "in   __tmp_reg__, %0  \n" //read port
    "ldi  r24, %1          \n" //LED bit mask
    "eor  __tmp_reg__, r24 \n" //toggle LED bit
    "out  %0, __tmp_reg__  \n" //write port
    : : "I" (_SFR_IO_ADDR(PORTB)), "I" (_BV(PORTB5)),
    "z" (_SFR_MEM_ADDR(TCNT1)), "M" (TCNT_BASE_L), "M" (TCNT_BASE_H) : "r24"

Having said all that, the boilerplate code the compiler inserts is not always the most efficient, and many times inadequate. For these reasons, and for the academic exercise, we will also select the “ISR_NAKED” attribute when defining the ISR. This gives us full control over all of the code inside the ISR. Full control is a good thing:


Eleven instructions encompass the prologue and epilogue, which is more than the code required for the main purpose of the interrupt. Notice inside the handler, we utilize 3 registers, r24, r30 and r31. This means we need to preserve the content of these registers since the interrupt could be triggered at any time, even precisely when these registers may be in use. Additionally we need to preserve the status register (SREG). The SREG holds critical information on the state of the program when the interrupt fired. Neglecting to reserve any of this information would probably cause the program to crash.

Don’t forget to include the terminating RETI instruction also. By comparison, this ISR_NAKED version is 10 bytes shorter than the “Stub” version:

#include "k328p.h"

#define TCNT_BASE   0x0bdc
#define TCNT_BASE_H (((TCNT_BASE)>>8)&0xff)
#define TCNT_BASE_L ((TCNT_BASE)&0xff)

  asm (
    "push r31           \n" //save r30, r31 contents
    "push r30           \n"
    "push r24           \n"
    //preserve SREG
    "in   r24, __SREG__ \n"
    "push r24           \n"

    //reload TCNT1 counter for 1sec interrupt
    "clr r31            \n"
    "ldi r30, %2        \n"
    "ldi r24, %3        \n"
    "st  Z+, r24        \n" //TCNT1L
    "ldi r24, %4        \n"
    "st  Z, r24         \n" //TCNT1H
    //toggle LED
    "in   r30, %0       \n" //read port
    "ldi  r31, %1       \n" //LED bit mask
    "eor  r30, r31      \n" //toggle LED bit
    "out  %0, r30       \n" //write port

    //restore old SREG
    "pop  r24           \n"
    "out  __SREG__, r24 \n"
    //restore r30, r31
    "pop r24            \n"
    "pop  r30           \n"
    "pop  r31           \n"
    "reti               \n"
    : : "I" (kPORTB), "I" (_BV(PORTB5)), 
    "M" (kTCNT1), "M" (TCNT_BASE_L), "M" (TCNT_BASE_H)

The initiation code required for the Timer #1 interrupt (setting the prescaler, loading the counter and enabling the overflow interrupt) is completely contained inside the Setup function. Obviously, it is not necessary to write this in inline assembly, it’s just good practice:

#include "k328p.h"

#define TCNT_BASE   0x0bdc
#define TCNT_BASE_H (((TCNT_BASE)>>8)&0xff)
#define TCNT_BASE_L ((TCNT_BASE)&0xff)

void setup() {
  uint16_t TNCTBase = TCNT_BASE;

  asm (
    "cli                  \n" //disable gloal interrupts 
    "sbi %0, %1           \n" //pinMode(13, OUTPUT);

    //set 256 prescale (CS12)
    "st  Z+, __zero_reg__ \n" //TCCR1A
    "ldi r24, %3          \n"
    "st  Z+, r24          \n" //zero TCCR1B
    "st  Z, __zero_reg__  \n" //zero TCCR1C
    //load counter for 1sec interrupt
    "ldi r30, %4          \n"
    "st  Z+, %A5          \n" //TCNT1L
    "st  Z, %B5           \n" //TCNT1H
    //enable overflow interrupt
    "ldi r30, %6          \n"
    "ldi r24, %7          \n"
    "st  Z, r24           \n" //TIMSK1

    "sei                  \n" //enable global interrupts 
    : : "I" (_SFR_IO_ADDR(DDRB)), "I" (PORTB5),
    "z" (_SFR_MEM_ADDR(TCCR1A)), "I" (_BV(CS12)),
    "M" (kTCNT1), "r" (TNCTBase),
    "M" (kTIMSK1), "I" (_BV(TOIE1)) : "r24", "memory"

void loop() { }

Finally, we are introducing a new header file “k328p.h” (contents listed below) which contains all of the IO register defines in such a way that we can use them inside our inline assembly routines. The definitions in this file use the same standard ATMEL mnemonics for the IO registers with the letter ‘k’ pre-pended. They are the LSB of the IO register address, and allow greater flexibility in inline assembler code when referring to the IO registers (when using pointer registers with the LD/ST instructions). A close examination of the above code will reveal the method of use.

Arduino IO Register Defines

//k328p.h - definitions for ATmega328P
#ifndef _k328P_H_
#define _k328P_H_ 

//standard registers 
//0-0x1f: bit addressable
//0-0x3f: IN/OUT compatible 
//0-0x3f: add 0x20 when using LD/ST
#define kPINB   0x03
#define kDDRB   0x04
#define kPORTB  0x05
#define kPINC   0x06
#define kDDRC   0x07
#define kPORTC  0x08
#define kPIND   0x09
#define kDDRD   0x0A
#define kPORTD  0x0B

#define kTIFR0  0x15
#define kTIFR1  0x16
#define kTIFR2  0x17

#define kPCIFR  0x1B
#define kEIFR   0x1C
#define kEIMSK  0x1D
#define kGPIOR0 0x1E
#define kEECR   0x1F
//end bit addressable

#define kEEDR   0x20
#define kEEAR   0x21
#define kEEARL  0x21
#define kEEARH  0x22
#define kGTCCR  0x23
#define kTCCR0A 0x24
#define kTCCR0B 0x25
#define kTCNT0  0x26
#define kOCR0A  0x27
#define kOCR0B  0x28

#define kGPIOR1 0x2A
#define kGPIOR2 0x2B
#define kSPCR   0x2C
#define kSPSR   0x2D
#define kSPDR   0x2E

#define kACSR   0x30

#define kMCUSR  0x34
#define kMCUCR  0x35

#define kSPMCSR 0x37

#define kSPL    0x3D
#define kSPH    0x3E
#define kSREG   0x3F
//end IN/OUT compatible

//extended registers begin
#define kWDTCSR 0x60
#define kCLKPR  0x61

#define kPRR    0x64

#define kOSCCAL 0x66

#define kPCICR  0x68
#define kEICRA  0x69

#define kPCMSK0 0x6B
#define kPCMSK1 0x6C
#define kPCMSK2 0x6D
#define kTIMSK0 0x6E
#define kTIMSK1 0x6F
#define kTIMSK2 0x70

#define kADC    0x78
#define kADCW   0x78
#define kADCL   0x78
#define kADCH   0x79
#define kADCSRA 0x7A
#define kADCSRB 0x7B
#define kADMUX  0x7C

#define kDIDR0  0x7E
#define kDIDR1  0x7F

#define kTCCR1A 0x80
#define kTCCR1B 0x81
#define kTCCR1C 0x82

#define kTCNT1  0x84
#define kTCNT1L 0x84
#define kTCNT1H 0x85
#define kICR1   0x86
#define kICR1L  0x86
#define kICR1H  0x87
#define kOCR1A  0x88
#define kOCR1AL 0x88
#define kOCR1AH 0x89
#define kOCR1B  0x8A
#define kOCR1BL 0x8A
#define kOCR1BH 0x8B

#define kTCCR2A 0xB0
#define kTCCR2B 0xB1
#define kTCNT2  0xB2
#define kOCR2A  0xB3
#define kOCR2B  0xB4
#define kASSR   0xB6

#define kTWBR   0xB8
#define kTWSR   0xB9
#define kTWAR   0xBA
#define kTWDR   0xBB
#define kTWCR   0xBC
#define kTWAMR  0xBD

#define kUCSR0A 0xC0
#define kUCSR0B 0xC1
#define kUCSR0C 0xC2

#define kUBRR0  0xC4
#define kUBRR0L 0xC4
#define kUBRR0H 0xC5
#define kUDR0   0xC6
//end extended registers

//0-0x3f for LD/ST instructions
#define k2PINB   0x23
#define k2DDRB   0x24
#define k2PORTB  0x25
#define k2PINC   0x26
#define k2DDRC   0x27
#define k2PORTC  0x28
#define k2PIND   0x29
#define k2DDRD   0x2A
#define k2PORTD  0x2B
#define k2TIFR0  0x35
#define k2TIFR1  0x36
#define k2TIFR2  0x37
#define k2PCIFR  0x3B
#define k2EIFR   0x3C
#define k2EIMSK  0x3D
#define k2GPIOR0 0x3E
#define k2EECR   0x3F
#define k2EEDR   0x40
#define k2EEAR   0x41
#define k2EEARL  0x41
#define k2EEARH  0x42
#define k2GTCCR  0x43
#define k2TCCR0A 0x44
#define k2TCCR0B 0x45
#define k2TCNT0  0x46
#define k2OCR0A  0x47
#define k2OCR0B  0x48
#define k2GPIOR1 0x4A
#define k2GPIOR2 0x4B
#define k2SPCR   0x4C
#define k2SPSR   0x4D
#define k2SPDR   0x4E
#define k2ACSR   0x50
#define k2MCUSR  0x54
#define k2MCUCR  0x55
#define k2SPMCSR 0x57
#define k2SPL     0x5D
#define k2SPH     0x5E
#define k2SREG    0x5F

#endif //_k328P_H_


Arduino Interrupts
Newbie’s Guide to AVR Interrupts
PJRC Guide to Interrupts
AVR Libc Information on Interrupts
University of Maryland, BC, C Programming and Embedded Systems Course, Interrupt Information
AVR 8-bit Instruction Set
AVR-GCC Inline Assembler Cookbook
Extended Asm – Assembler Instructions with C Expression Operands
Mixing C and Assembly Language
ATMEL ATmega328P Datasheet

Also available as a book, with greatly expanded coverage!

[click on the image]

Posted in arduino, assembly language, avr, avr inline assenbly | Tagged , , , , , | Leave a comment

Arduino Inline Assembly Tutorial (Functions)

func machine

At first consideration, the topic of functions seems simple and trite. Just discuss how to “CALL” and “RETURN” to and from a function, right? However, there are many subtopics involved as well. For example, passing and returning parameters, prologue and epilogue code, the stack frame and mixing assembly and C are topics deserving of separate tutorials. Hopefully, we can do all of these justice, but first, the basics…

Convert Snippet Into a Function

How about a simple demonstration of turning an inline code snippet into a function? In a previous tutorial on indirect addressing, several inline pieces of code were developed to perform various string operations. One such operation determined the character length of a string. The code is below.

String Length, Sounds Like strlen

const char src[4] = "abc";
volatile uint8_t len;
asm (
  "_loop:               \n"
  "ld   __tmp_reg__, Z+ \n"
  "tst  __tmp_reg__     \n"
  "brne _loop           \n"
  //Z points one character past the terminating NUL
  "subi %A1, 1          \n" //subtract post-increment
  "sbci %B1, 0          \n"
  "sub  %A1, %A2        \n" //length = end - start
  "sbc  %B1, %B2        \n"
  "mov  %0, %A1         \n" //save len (uint8_t)
  : "=r" (len) : "z" (src), "x" (src)

While this code could easily be included “inline”, it certainly would be more useful if it was defined as a general function. This would make it much easier to use throughout a program, and also reduce overall program size by incorporating only one instance of the code. So how is this accomplished?

Stub Your Code

The official Cookbook refers to this techniques as a “C Stub Function,” which is nothing more than a function definition containing only inline assembler code. Typically, in a “C Stub Function”, the function parameters and local variables define the data used in, and the value returned (if any) by the function. This is an easy method to pass data to/from the inline function, without the need to understand the underlying details of how its done. Therefore, we eliminate the necessity of writing additional supporting code.

The above “string length” snippet easily becomes a full blown function, _strlen() using this method. Notice the transformed function below receives a string, (s) as a parameter, and returns the length, which is defined as a local variable. We refer to these same variables in the input and output constraints:

inline uint8_t _strlen(const char *s) {
  uint8_t len;

  asm (
    "_loop:              \n"
    "ld  __tmp_reg__, Z+ \n"
    "tst __tmp_reg__     \n"
    "brne _loop          \n"
    //len=Z - 1 – src = (-1 - src) + Z = ~src + Z
    "com %A2             \n"
    "com %B2             \n"
    "add %A2, %A1        \n"
    "adc %B2, %B1        \n"
    : "=r" (len) : "z" (s), "x" (s)

  return len;

Here is a look at the code generated by the above C-Stub Function (notice the compiler/assembler doesn’t need to generate a lot of “stub” code):

  MOVW r30, r24
  MOVW r26, r24
  LD r0,Z+
  TST r0
  BRNE loop
  COM r26
  COM r27
  ADD r26, r30
  ADC r27, r31

Placing a Call

An extension to the “C Stub Function” technique is calling another C function from inside inline assembly code. The following bit of code demonstrates the CALL instruction. This instruction “calls” a subroutine located within the program memory (if we remember to properly define the function to avoid linkage errors). The C Stub Function even handles the return (RET) for us.

An additional detail required here, is the need to encapsulate the “called” function inside the extern “C” { } declaration (see below example). The extern “C”, C++ keyword prevents the function name from becoming “mangled”, thus preventing the linker from locating the called function.

extern "C" {
  void foo() {
    // do something here...

void test() {
  asm (
    "call foo \n"

Playing Catch

Next, we present a basic example of passing and returning parameters to and from C Stub Functions. The purpose of the following code is to convert an upper case ASCII character into its lower case equivalent. We’ve created two functions here, _isupper and _tolower, which validate the input character and then perform the conversion.

Take a look at the code below.

Notice, the first thing _tolower does is call the function, _isupper. Since _tolower hasn’t done anything yet, the C Stub Function simply hands the input character (c), the parameter to _tolower directly onto the _isupper function. Neat!

Next, _isupper checks the character to confirm its actually an upper case character. If so, it returns the character, otherwise it returns a zero. Upon returning to _tolower, the next instruction which is executed is “tst r24”, a test of the contents of register r24. If register #24 (r24) is not zero, the character is converted and the function returns.

Again, notice the use of the C++ keyword “extern C {}” here:

extern "C" {
  unsigned char _isupper(unsigned char c) {
    //bind variable to a specific register r18
    register unsigned char ch asm("r18");
    asm (
      "mov  %1, %0 \n" //save input
      "subi %1, 'A'\n" //subtract 0x41
      "brmi 2f     \n" //branch if minus
      "subi %1, 26 \n" //26 letters
      "brpl 2f     \n" //branch if plus
      "ret         \n" //c==upper, return
      "2: clr  %0  \n" //false
      : "+r" (c) : "r" (ch) 
    return c;

char _tolower(unsigned char c) {
  asm (
    "call _isupper \n" //validate char
    "tst r24       \n" //0 = not alpha char
    "breq 1f       \n" //not alpha char
    "ori %0, 0x20  \n" //make lower
    "1:            \n"
    : "+r" (c)
  return c;

Insider Information

Why did function _tolower choose to test register #24 (r24)? The above two functions relied on “insider” information when using register r24. These routines knew that an 8-bit, byte-sized value is passed to and from a function via the r24 register. The C Compiler always passes function arguments and returns values in specific register locations. Knowing these locations are essential to writing efficient inline assembly code, especially when interfacing with the C language.

This is a good time to review the data type sizes: a char is 8 bits, an int is 16 bits, a long is 32 bits, a long long is 64 bits, floats are 32 bits, and pointers are 16 bits (function pointers are word addresses). Arguments are allocated left to right, starting in register r25 descending through register r8. All arguments are aligned to start in even-numbered registers (odd-sized arguments, like char, have one free register above them), for example, a single 8-bit value is passed via the r24 register (r25 is assumed empty), a single 16-bit value is passed via the r25:r24 register pair, and a 32-bit value would be passed via r25:r24:r23:r22 register combination.

Return values are expected to be passed in a similar fashion. An 8-bit value is passed via r24, a 16-bit value in r25:r24, and 32-bits in r22:r23:r24:r25. An 8-bit return value may be zero/sign-extended to 16-bits by the called function.

What’s the Use of a Register?

Function “call-used” registers are r18-r27, and r30-r31. Any, or all of these registers may be allocated by the compiler for local data. However, we may use them freely in assembler subroutines. Calling C subroutines can clobber any of them, and the caller is responsible for saving and restoring before and after use.

Function “call-saved” registers are r2-r17, and r28-r29. They may also be allocated by the compiler for local data, but C subroutines leaves them unchanged. Assembler subroutines are responsible for saving and restoring any of these registers, if changed. The Y register pair (r29:r28) is used as a frame pointer (pointing to local data placed on the stack) if necessary.

Fixed registers, r0, and r1 are never allocated by the compiler for local data. The temporary register, r0 can be clobbered by any C code (except interrupt handlers which save it), and may be used freely. The zero register is r1, and assumed to be always zero in any C code. It may be used for other purposes within a piece of assembler code, but must then be cleared after use (clr r1). Interrupt handlers save and clear r1 on entry, and restore r1 on exit (in case it was non-zero).


AVR 8-bit Instruction Set
AVR-GCC Inline Assembler Cookbook
Extended Asm – Assembler Instructions with C Expression Operands
Mixing C and Assembly Language

Also available as a book, with greatly expanded coverage!

[click on the image]

Posted in arduino, assembly language, avr, avr inline assenbly | Tagged , , , , , | Leave a comment

Arduino Inline Assembly Tutorial (Tables)


Often, the fastest way to compute something on an arduino is to not compute it all.


For example, trigonometric functions are costly operations and can abruptly slow your application to the pace of a crawl. And many times, the result is computed with far more precision than needed for the situation. Most often you just want the periodic wave-like characteristics of sine or cosine, which can easily be approximated. With a trigonometric function, its easy to substitute a lookup-table populated with pre-computed values at discrete steps. If your program can handle the loss of precision, yet requires as much speed as possible, this alternative is a good option.

The Ivy League Microcontroller

Since the arduino’s ATMEL AVR μC is based upon the modified Harvard architecture, the data and program instructions are stored in different memory. The program instructions are stored in flash, while data is stored in SRAM. These separate pathways are primarily implemented to enhance performance, but it also prohibits executing program instructions from data memory. Yet it may seem paradoxical, data is allowed to be stored inside program memory (see this information on the use of the PROGMEM attribute).
Placing a table in SRAM is simple, and shouldn’t present problems for an inline programmer (especially at our stage!). Consequentially, in this tutorial, we will store a table inside program memory.

Did He Say Frogmen?

Placing the table into program memory is easy. It is accomplished via a C language floating-point array, incorporating the special keyword, “PROGMEM”. PROGMEM instructs the compiler to place this data into flash memory:

static const float PROGMEM SineTable[91] = {
  0.0, 0.017452, 0.034899, 0.052336, 0.069756,
. . .
  0.997564, 0.998630, 0.999391, 0.999848, 1.0

Previously, when accessing SRAM (data memory) we used the LDS instruction. However, accessing program memory requires the use of the LPM instruction. LPM is the mnemonic for Load from Program Memory, and it loads a data byte from flash program memory into a register.


The Flash program memory is organized as 16 bits words, while the registers and SRAM are organized as eight bits bytes. The Z-register is used to access the program memory. This 16 bits register pair is used as a 16 bit pointer to the Program memory. The 15 most significant bits selects the word address in Program memory. Because of this, the word address is multiplied by two before it is put in the Z-register. However, the good news is that in the code presented below all of these details are transparent.

Table Legs

The function below first limits the input value to a range between 0-90. If the input is out-of-range, it returns the floating-point Not-A-Number (NAN) value. It then multiplies the input by 4 to produce an index into our table. We multiply by four because our table is populated with floating point numbers, each of which is 4-bytes long. The index is simply added to the (PROGMEM) address of the start of the table. The functions finishes by retrieving the 4-byte float value and returning.

Note, floating point support inside the inline assembler is scarce. In this function we treat the float variable transparently, like any 32-bit variable. We get away with this because we’re not performing any operation on the value.

float _Sine(uint16_t angle) {
  float tmp;

  asm (
    //validate angle >= 0 && angle <= 90
    "cpi  %A1, 90+1 \n" 
    "cpc  %B1, __zero_reg__ \n"
    "brcc _NaN      \n" //out of range

     //calculate table index
    "lsl  %A1       \n" //float is 4 bytes wide
    "rol  %B1       \n" //index = angle * 4
    "lsl  %A1       \n"
    "rol  %B1       \n"

    //add index to start of SineTable
    "add  r30, %A1  \n" 
    "adc  r31, %B1  \n"

    //get sine value (4-bytes)
    "lpm  %A0, Z+   \n" 
    "lpm  %B0, Z+   \n"
    "lpm  %C0, Z+   \n"
    "lpm  %D0, Z    \n"
    "ret            \n" //exit
    //return NAN
    "_NaN:              \n" 
    "ldi  %A0, lo8(%3)  \n" //NAN = 0x7fc00000
    "ldi  %B0, hi8(%3)  \n"
    "ldi  %C0, hlo8(%3) \n"
    "ldi  %D0, hhi8(%3) \n"
    : "=a" (tmp) : "r" (angle), "z" (SineTable), "F" (NAN)
  return tmp;

The Full Table

#include <avr/pgmspace.h>

//max errror ~0.017452 [91*4=364 bytes]
static const float PROGMEM SineTable[91] = {
  0.0, 0.017452, 0.034899, 0.052336, 0.069756, 0.087156, 
  0.104528, 0.121869, 0.139173, 0.156434, 0.173648, 0.190809, 
  0.207912, 0.224951, 0.241922, 0.258819, 0.275637, 0.292372, 
  0.309017, 0.325568, 0.34202, 0.358368, 0.374607, 0.390731, 
  0.406737, 0.422618, 0.438371, 0.45399, 0.469472, 0.48481, 
  0.5, 0.515038, 0.529919, 0.544639, 0.559193, 0.573576, 
  0.587785, 0.601815, 0.615661, 0.62932, 0.642788, 0.656059, 
  0.669131, 0.681998, 0.694658, 0.707107, 0.71934, 0.731354, 
  0.743145, 0.75471, 0.766044, 0.777146, 0.788011, 0.798636, 
  0.809017, 0.819152, 0.829038, 0.838671, 0.848048, 0.857167, 
  0.866025, 0.87462, 0.882948, 0.891007, 0.898794, 0.906308, 
  0.913545, 0.920505, 0.927184, 0.93358, 0.939693, 0.945519, 
  0.951057, 0.956305, 0.961262, 0.965926, 0.970296, 0.97437, 
  0.978148, 0.981627, 0.984808, 0.987688, 0.990268, 0.992546, 
  0.994522, 0.996195, 0.997564, 0.99863, 0.999391, 0.999848, 1.0

What Are You Doing For Me?

After a cursory comparison test between the table _Sine() function and the arduino floating point sin() function, we can draw some basic conclusions. Even though the table itself consumes 364 (91 x 4 = 364) bytes of flash (on top of the function code), the arduino library sin() function (and it’s required peripheral floating point support) uses approximately 900 bytes more flash memory.

However, saving space wasn’t necessarily the goal of this exercise, speed was the primary concern. Comparing 1,000 calls to both functions yielded an average duration of 121.7uS per sin() vs. 2.92uS for _Sine(). One final but obvious concern is the precision of the result. This will need to be evaluated to determine if it is sufficient for your application.

Bigger, Better, More

Various modifications can expand and improve the accuracy of the table code, but are beyond the scope of this tutorial. However, here are some basic ideas.

The obvious methods is to expand the table to decrease the step interval. Another technique is to incorporate interpolation similar to the following pseudo code:

float _iSine(uint16_t angle) {
     uint16_t x1 = floor( angle );
     float y1 = SineTable[x1];
     float y2 = SineTable[x1 + 1];
     return y1 + ( y2 - y1 ) * ( x - x1 )

For full 0-360 angle coverage, do something like:

float Sine(uint16_t i) {
  while (i > 359)
    i -= 360;

  if (i < 90)
    return iSine(i);
  else if (i < 180)
    return iSine(179 - i);
  else if (i < 270)
    return (-1*iSine(i - 180));
  else if (i < 360)
    return (-1*iSine(359 - i));

Another easy expansion with the sine table is to calculate cosine and tangent values:

float _Cosine(uint16_t a) {
  return _Sine( a + 90 );
float _Tangent(uint16_t a) {
  return ( _Sine(a) / _Cosine(a) );


AVR 8-bit Instruction Set
AVR-GCC Inline Assembler Cookbook
Extended Asm – Assembler Instructions with C Expression Operands
Further information on addressing modes can be found in Section 2 of the AVR Instruction Set Manual
AVR108: Setup and Use of the LPM Instruction
Sine Lookup Table Generator

Also available as a book, with greatly expanded coverage!

[click on the image]

Code (error) updated: 1/25/2017

Posted in Uncategorized | Tagged , , , , , | 4 Comments

Arduino Inline Assembly Tutorial (Strings)


Addressing Modes

When loading and storing data, there are several addressing methods available for use. The arduino’s AVR microcontroller supports 13 address modes for accessing the Program memory (Flash) and Data memory (SRAM, Register file, I/O Memory, and Extended I/O Memory). Six modes use “direct addressing”, and as such are very basic. The direct modes are generally inherent in the assembly instruction. The good news is that, we covered all six in past tutorials, so there is no need to address them here (pun intended). Four additional modes incorporate indirect addressing, and will be the focus of this tutorial.

*Register Direct, Single Register-
*Register Direct, Two Registers
*IO Direct
*Data Direct
Data Indirect
Data Indirect w/Displacement
Data Indirect w/Pre-Decrement
Data Indirect w/Post-Increment
Program Memory Constant
Program Memory w/Post-Inc
*Direct Program
Indirect Program
*Relative Program
* denotes previously covered.

String Theory

Indirect addressing can be said to involve “pointers”. In the C language, the word “pointer” scares people. Hopefully we can calm these irrational fears, by coding an assortment of string routines using simple indirect addressing modes. By the end of this tutorial, we should have a good basis for a library of string functions.


The six registers, r26 through r31 can be paired together and referenced using the letters X, Y and Z. (the X register is r27:r26, the Y register is r29:r28, and the Z register is r31:r30). When combined, these registers are 16-bit “address pointers” for indirect addressing of the data space. In use, the X,Y and Z register pairs are loaded with an address of interest.

The three indirect address registers X, Y, and Z are defined as described here:


Speaking Indirectly

Previously we used the LDS instruction to load the value stored inside SRAM memory. For example, this code loads the number 42 into register r24:

  volatile uint8_t x=42;

  asm (
    "lds r24, (x) \n"

But with the X, Y and Z pointer registers, we load the SRAM address into the register pairs (not the value stored there). Hence, we use the the term “indirect addressing”. For example, the following code loads the “address” of the string, (src) into the X register pair via the constraint, “x” (src). When we want the first character of the string, or as in this case, ‘a’, we load it “indirectly” from the X register pair (address) like so:

const char src[4] = "abc";

asm (
  "ld __tmp_reg__, X \n"
  : : "x" (src)

Fetch Me Z Pointer

Here is an example directly out of the AVR Inline Assembler Cookbook involving a true C-pointer. In this code snippet, ptr is a pointer to variable number. The ‘e’ constraint requests that ptr (which is the address of variable number) be loaded into one of the X, Y or Z register pairs, at the assembler’s choice.

Then, the value at the “address” inside the pointer register pair (or 0x11) is loaded into the temporary register (__tmp_reg__). It is incremented, and finally stored back through the pointer ptr into the variable number. At the completion of this inline code, number = 0x12, and of course, the value of ptr hasn’t changed.

volatile uint8_t number=0x11, *ptr = &number;

asm volatile(
  "ld __tmp_reg__, %a0 \n"
  "inc __tmp_reg__     \n"
  "st %a0, __tmp_reg__ \n"
  : : "e" (ptr) : "memory"

If you have don’t have a good grasp of C pointers, this could be slightly confusing. It might be helpful to examine the assembler code produced to see exactly what is happening here (note the compiler selected the Z register pair for the pointer, ptr):

0000029E e0.91.00.01   LDS R30, 0x0100 //load address into ptr (0x0102)
000002A0 f0.91.01.01   LDS R31, 0x0101
000002A2 00.80         LDD R0, Z+0     //load number into r0 (0x11)
000002A3 03.94         INC R0          //increment r0 to 0x12
000002A4 00.82         STD Z+0, R0     //store back into number (0x0102)

Address locations:
ptr:	0x0102	uint8_t* @0x0100
p:	0x11	uint8_t  @0x0102 

How Long is a String?

Now, onto strings. The following code calculates the length of the string str, not including the terminating NUL, or ‘\0’ character. It places the number of characters inside str into len:

const char src[4] = "abc";
volatile uint8_t len;

asm (
  "_loop:               \n"
  "ld   __tmp_reg__, Z+ \n"
  "tst  __tmp_reg__     \n"
  "brne _loop           \n"
  //Z points one character past the terminating NUL
  "subi %A1, 1          \n" //subtract post-increment
  "sbci %B1, 0          \n"
  "sub  %A1, %A2        \n" //length = end - start
  "sbc  %B1, %B2        \n"
  "mov  %0, %A1         \n" //save len (uint8_t)
  : "=r" (len) : "z" (src), "x" (src) : "memory"

First, notice we define input constraints for the string (str) twice, using both X and Z pairs. These constraints place the address of the string inside of the r30:r31 and r26:r27 register pairs. The reason for this will become clear in a moment.

Studying the code further, notice we load the first character of the string (pointed to by the “Z” register), placing it into the temp register (__tmp_reg__). Further, take note that the instruction has a plus sign ‘+‘ appended to the ‘Z’. This means the Z register is incremented by 1 after the load operation. It’s as if we combine two instructions into one! This is termed “Indirect Addressing with Post-Increment”.

Next, the temp register is tested (tst __tmp_reg__), and if it is NOT zero, execution will loop back and fetch another character. This repeats until finding the NUL character at the end of the string. This terminates the loop, however at this point, because of the post-increment operation, the Z register points one location past the end of the string.

We complete the routine by subtracting 1 for extra post-increment, and then subtract the ending string address from the start address. The result of this math is the length of the string.

Here is a slightly more efficient version, but I will leave it to you to determine the details of the shortened arithmetic (the embedded comment explains the math in cryptic fashion):

const char src[4] = "abc";
volatile uint8_t len;

asm (
  "_loop:              \n"
  "ld  __tmp_reg__, Z+ \n"
  "tst __tmp_reg__     \n"
  "brne _loop          \n"
  //len=Z - 1 – src = (-1 - src) + Z = ~src + Z
  "com %A2             \n"
  "com %B2             \n"
  "add %A2, %A1        \n"
  "adc %B2, %B1        \n"
  : "=r" (len) : "z" (src), "x" (src)

Zerox a String

Lets do another one. This code copies the src string (including the terminating NUL character) to the array pointed to by dst. However, the strings may not overlap, and the dst string must be large enough to receive the copy. If the destination string is not large enough, anything could happen…

const char src[4] = "abc";
char dst[4] = "   ";

asm (
  "_copy:               \n"
  "ld   __tmp_reg__, Z+ \n"  //load tmp reg w/src char
  "st   X+, __tmp_reg__ \n" //store tmp reg to dst 
  "tst  __tmp_reg__     \n" //check if 0 (end)
  "brne _copy           \n"
  : : "x" (dst) , "z" (src)

Wow, only 4-lines of inline assembly code can copy a string! As you can see, this is very straight forward and quite simple. It utilizes the X and Z register pairs, incorporating post-increment addressing with both.

Who’s String is Bigger?

This code compares the two strings s1 and s2. It returns an integer (in result) less than, equal to, or greater than zero if s1 is found to be less than, to match, or be greater than s2. Again, it utilizes the X and Z register pairs, incorporating post-increment addressing with both. Hopefully you are starting to recognize the power of “indirect addressing” combined with “post-indexing”.

char s1[4] = "abc";
char s2[4] = "xyz";
volatile int16_t result;

asm (
  "_compare:                     \n"
  "ld   %A0, X+                  \n"
  "ld   __tmp_reg__, Z+          \n"
  "sub  %A0, __tmp_reg__         \n"
  "cpse __tmp_reg__, __zero_reg__\n"
  "breq _compare                 \n"
  "sbc  %B0, %B0                  \n"
  : "=&r" (result) : "x" (s1) , "z" (s2)

String Cat

This code appends the src string to the dst string overwriting the NUL character at the end of dst, and then adds a terminating NUL character. The strings may not overlap, and the dst string must have enough space for the result. This example is slightly more involved, but with a little study the details should become clear.

const char src[4] = "def";
char dst[7] = "abc";

asm (
  "_dst:                \n" //find end of destination
  "ld   __tmp_reg__, X+ \n"
  "tst  __tmp_reg__     \n"
  "brne _dst            \n"
  "sbiw %A0, 1          \n" //undo post-increment
  "_src:                \n" //X==end of dst string
  "ld   __tmp_reg__, Z+ \n"  //copy src to dst
  "st   X+, __tmp_reg__ \n"
  "tst  __tmp_reg__     \n" //test for 0 (end)
  "brne _src            \n"
  : : "x" (dst), "z" (src) : "memory"

Charred String?

Finally, this code finds the first occurrence of the character val in the string src. Here “character” means “byte” (no wide or multi-byte characters allowed). The location of the matched character is placed in a pointer (c) or a NUL if the character is not found.

const char s[4] = "abc", *c;
volatile int16_t val = 0x63;

asm (
  "_loop:        \n"
  "ld   %A0, Z+  \n" //fetch char from string
  "cp   %A0, %A2 \n" //compare char with val
  "breq _found   \n"
  "tst  %A0      \n" //end of string (0)?
  "brne _loop    \n" //not at end
  "clr  %B0      \n" //not found, NULL pointer
  "rjmp _end     \n"
  "_found:       \n"
  "sbiw %A1, 1   \n" //undo post-increment
  "movw %A0, %A1 \n" //save pointer
  "_end:         \n"
  : "=x" (c) : "z" (s), "r" (val)


C Programming and Strings
Further information on addressing modes can be found in Section 2 of the AVR Instruction Set Manual
AVR 8-bit Instruction Set
AVR-GCC Inline Assembler Cookbook
Extended Asm – Assembler Instructions with C Expression Operands
AVRLibc String Functions

Also available as a book, with greatly expanded coverage!

[click on the image]

Posted in Uncategorized | Tagged , , , , , | Leave a comment

Arduino Inline Assembly Tutorial (Branching)


Loop and Branch

Branching is a fundamental feature of computers. For example, branching allows a computer to repeat instruction sequences. One of the most  basic forms of repetition is a “loop”, and the loop is probably the most widely used programming technique.

There are two type of branches, unconditional and conditional. An unconditional branch is basically a JUMP. We briefly discussed jumping here. In this tutorial we will examine loops utilizing conditional branches.

Not Equal

Basic loops in the C language use the “for” construct. Here is a very simple countdown for loop in C that repeats 8 times:

for (i=8; i>0; i--) {
//repeat instructions located here...

Let’s duplicate the above C language loop with inline assembler:

volatile uint8_t counter = 8;

asm (
    "1:         \n"
    "nop        \n" //repeating code goes here
    "dec %1     \n"
    "brne 1b    \n"
    : : "r" (counter)

First, please note, we’re not performing any function(s) inside the body of this loop, with the exception of killing time with a NOP. Obviously, in an actual loop we would replace the NOP with some sort of functional code.

Second, notice how we decrement the counter value and then use the instruction BRNE (BRanch if Not Equal) to loop back to the label ‘1‘ location. We add ‘b’ to the label to inform the assembler we are branching to a label located “before” the current instruction. If the label was after the branch instruction, we would be branching “forward”, and use “1f” instead of “1b“.

BRNE is a conditional relative branch. It tests the Zero flag (Z) of the Status Register (SREG) and branches if the Z flag is cleared. When the counter value is decremented to 0, the Zero flag is set, and the branch doesn’t occur. Therefore, this loop executes 8 times.

Finally, the counter value (counter) must be pre-loaded with the number of loop iterations. If counter was zero when this inline code starts to execute, then the loop will iterate 255 times! Which leads us to ask, how can we get a loop to repeat more than 255 times?


One possible solution is to nest loops. Placing a loop inside of another loop would allow up to 255 * 255, or  65,025 iterations. Alternatively, more iterations of the loop could use a 16-bit or 32-bit integer as the loop counter (see our delay example at the end of this post).

Here is an example of a nested loop:

volatile uint8_t outer = 0xff;
volatile uint8_t inner = 0xff;

asm (

  "1:                  \n" //outer loop
  "mov __tmp_reg__, %0 \n" //(re)load inner loop counter

  "2:                  \n" //inner loop
                           //perform stuff here...

  "dec __tmp_reg__     \n" //DEC inner loop counter
  "brne 2b             \n" //branch to '1' if tmp_reg not 0

  "dec %1              \n" //DEC outer loop counter
  "brne 1b             \n" //branch to '2' if "outer" not 0

  : : "r" (inner), "r" (outer)

More Branches

Prior to a branch instruction, there must be an operation which sets a flag in the Status Register (SREG). In the above examples we used the DEC instruction to perform this operation. However, there are several methods for setting flags, for example: INC, DEC, ADD and SUB work just as well. We can also conduct simple comparisons and tests. Keep in mind, probably the two most widely used comparison instructions are CP and CPI.
CP, CPI, CPC, CPSE and TST are explicitly designed for this purpose.

As you may have guessed, there are many more branch instructions beside just BRNE. We can branch if equal (BREQ), if same or higher (BRSH), if lower (BRLO), if carry is clear (BRCC), or carry set (BRCS). Also, one must be careful to use the appropriate instruction when comparing signed (vs. unsigned) values.

As noted above, its important to realize that comparisons involving both signed and unsigned values require different branch instructions. Unsigned values are between 0 and 255, and signed values are between -128 and 127. The following two tables summarize the comparisons and branch instructions in order to accomplish a desired result. Notice that a few of the comparison require the use of two branch instructions.

unsigned branches

The comparison instructions operate like a subtraction (without saving the result). Therefore, these tables are valid to subtraction operations as well.

signed branches

Practically Speaking

Lets look at a few practical, yet simple examples (isspace, isdigit, and isalpha functions). These are all standard C library functions defined inside the “ctype.h” header file. They all take an ASCII integer char as input and return an integer. Take note however, our perverted versions accept and return a char-sized parameter.

A Space

The standard C function isspace(c) returns true for the standard “white-space” characters listed below:
' ' (0x20)

'\t' (0x09)

'\n' (0x0a)

'\v' (0x0b)

'\f' (0x0c)

'\r' (0x0d)

Our isspace function only detects a space character, ‘ ‘ (0x20). In this code we demonstrate the use of the BREQ instruction:

uint8_t _isspace(unsigned char c) {
  uint8_t result;

  asm (
    "cpi  %1, ' '  \n"
    "breq 1f       \n" //branch if equal
    "clr  %0       \n" //false
    "rjmp 2f       \n"
    "1: ldi  %0, 1 \n" //true
    "2:            \n" //exit
    : "=r" (result) : "r" (c)

  return result;

Going Digital

The standard C function isdigit(c) returns true for the characters ‘0’ (0x30) through ‘9’ (0x39) along with the negative sign ‘-‘ (0x2d). Our function only detects digits, and neglects the negative sign. This code demonstrates the use of the BRMI and BRPL instructions:

uint8_t _isdigit(unsigned char c) {
  uint8_t result;

  asm (
    "subi %1, 0x30 \n"
    "brmi 2f       \n" //branch if minus
    "subi %1, 10   \n"
    "brpl 2f       \n" //branch if plus
    "ldi  %0, 1    \n" //true
    "rjmp 3f       \n"
    "2: clr  %0    \n" //false
    "3:            \n" //exit
    : "=r" (result) : "r" (c)

  return result;

Alphabet Soup

The standard C function isalpha(c) returns true for the characters ‘a’ (0x61) through ‘z’ (0x7a), and ‘A’ (0x41) through ‘Z’ (0x5a). Our function does the same. This code demonstrates the use of the BREQ and BRPL instructions:

uint8_t _isalpha(unsigned char c) {
  uint8_t result;

  asm (
    "sbrs %1, 6     \n" //check bit 6
    "rjmp 1f        \n" //bit 6 is clear, cannot be alpha
    "andi %1, ~0x60 \n" //clear bit 5&6
    "breq 1f        \n" //0 cannot be alpha
    "subi %1, 27    \n" //26 letters
    "brpl 1f        \n" //>z cannot be alpha
    "ldi %0, 1      \n" //true
    "rjmp 2f        \n"
    "1: clr  %0     \n" //false
    "2:             \n" //exit
    : "=r" (result) : "r" (c)

  return result;


Previously we showed how to use numbers as labels. There are other valid methods. A problem arises when reusing macros using labels. In such cases you may make use of the special pattern %=, which is replaced by a unique number in each asm statement’s block of code. The following code had been taken from avr/include/iomacros.h:

#define loop_until_bit_is_clear(port,bit) \
  asm (                                   \
    "L_%=:       \n"                      \
    "sbic %0, %1 \n"                      \
    "rjmp L_%=   \n"                      \
    : : "I" (_SFR_IO_ADDR(port)), "I" (bit))

For example, when used for the first time, L_%= may be translated to L_1404, the next usage might create L_1405 or whatever. In any case, the labels become unique.

Another option is to use actual names for the labels. The above example would then look like:

#define loop_until_bit_is_clear(port,bit) \
  asm (                                   \
    "start:      \n"                      \
    "sbic %0, %1 \n"                      \
    "rjmp start  \n"                      \
    : : "I" (_SFR_IO_ADDR(port)), "I" (bit))

Wait, One More Thing

Finally, an assembly language tutorial series without a home-grown version of a delay routine would be like a swimming pool without water. Here is where we present our version.

The following code produces approximately a 1 second delay on a 16Mhz Arduino. The included C language MACROs allow adapting this code for other delay periods. Take note, this code does not load the MSB of the 32-bit “DELAY_VALUE“, so a delay longer than ~1.048 seconds (DELAY_VALUE larger than 0x00ffffff) would require a slight modification (by now, you should be able to handle that). Also, take note of the label name we use:

#define CLOCK_MHZ       16UL
#define DELAY_LENGTH_MS 1000UL
#define DELAY_VALUE     (uint32_t)((CLOCK_MHZ * 1000UL * DELAY_LENGTH_MS) / 5UL)

asm (
"loop:          \n"
  "subi %A2, 1  \n"
  "sbci %B2, 0  \n"
  "sbci %C2, 0  \n" //note: 1 byte short of full 32-bits
  "brcc loop    \n"
  : : "r" (DELAY_VALUE)


AVR Branch Instructions
AVR 8-bit Instruction Set
AVR-GCC Inline Assembler Cookbook
Extended Asm – Assembler Instructions with C Expression Operands
Accurate Delay Code Example

[updated: 3/28/2016]

Also available as a book, with greatly expanded coverage!

[click on the image]

Posted in Uncategorized | Tagged , , , , , | Leave a comment

Towards a More General digitalRead

2 words

The arduino digitalRead function is a nice bit of code. However, it takes more than a cursory glance to determine exactly how it performs (see Yak Shaving). It also compiles into approximately 222 bytes of code, and its slow in comparison to a simplified inline routine:

void setup() {

void loop() { }


volatile uint8_t status;
void setup() {
  asm (
    "in __tmp_reg__, __SREG__ \n"
    "cli         \n"                     
    "ldi %0, 1   \n" 
    "sbis %1, %2 \n" //skip next if pin high
    "clr %0      \n"
    "out __SREG__, __tmp_reg__ \n"
    : "=r" (status) : "I" (_SFR_IO_ADDR(PINB)), "I" (PINB5)  

void loop() { }

The simplified function occupies only 16 bytes. However, since this is inline code vs. a function, every time a digital read is required in your program, it will consume another 16 bytes. Yet, it would take about 13 inline routines to consume about the same amount of code the standard arduino function uses.

I doubt you’ll write a program that uses over 13 read operations. However, you might.

Another minor factor with the inline routine, is that the port and pin must be known at compile time. The port and pin are “hard coded” into assembly instruction, like so:

000002A1 1d.9b  SBIS 0x03, 5	

In the above disassembly, 0x03 is the port (PINB), and 5 is the pin bit (PINB5). Not really a big issue, unless you want to address the pin and port location programmatically. In that regard, you can’t use the simplified routine, or any of the C language MACROs floating around the Internet similar to:

#define bit_get(p,m) ((p) & (m))

But here is a generic alternative, which occupies approximately 34 bytes. Note it must be called using a pointer to the PIN (&PINB), otherwise the compiler will emit incorrect code:

//call like so:
//uint8_t status = dRead(&PINB, PINB5);

__attribute__ ((noinline)) uint8_t dRead(volatile uint8_t *port, uint8_t pin) {
  uint8_t result, mask=1;

  asm (
    "movw  r30, %1 \n" //port reg addr in Z
  "1:              \n"
    "cpi  %2, 0    \n" //loop until pin==0
    "breq 2f       \n" //leave loop
    "lsl  %3       \n" //shift (mask) left 1 position
    "dec  %2       \n" //decrement loop counter
    "rjmp 1b       \n" //repeat
  "2:              \n"
    "in   __tmp_reg__, __SREG__ \n" //preserve sreg
    "cli           \n" //disable interrupts
    "ld   r18, Z   \n" //fetch port data
    "and  r18, %3  \n" //compare pin with mask
    "ldi  %0, 1    \n" //set return high
    "brne 3f       \n" 
    "clr  %0       \n" //set return low
  "3:              \n"
    "out  __SREG__, __tmp_reg__ \n"
    : "=&r" (result) : "r" (port), "a" (pin), "r" (mask) : "r18", "r30", "r31"

  return result;

Also available as a book, with greatly expanded coverage!

[click on the image]

Posted in Uncategorized | Tagged , , , , , , | Leave a comment