A3 - Exploring the Assembler

From VO-EM Wiki
Jump to: navigation, search

Before we go any further, it will make life a lot easier to be able to understand the way the assembler works a little better. Specifically, the way it stores values when we use assembler directives, and how it turns our labels into addresses.

Requirements

We're still going to need the text editor, java runtime and dasm today. We're not going to be using the debugger.

However, it might help you a lot to have the Windows Calculator open, and use 'view' to set it to 'programmer'. It's very useful for doing quick conversions between binary, hexadecimal and decimal.

Assembler directives

We've already used a couple of assembler directives - we used ".start" to tell our program where to begin executing, and we used ".word" to store the arguments for our multiplication program. Put simply, all commands beginning with a period (".") tell the assembler to do something. They are not turned into instructions for the console to execute, but they may tell the assembler to input data into our program.

The assembler is very simple - it more or less starts at the beginning of your program, and turns each line of assembly, in order, into an instruction for the CPU. Since each instruction is 32 bits long, it makes 4 bytes at a time doing this. When it comes across a directive like ".word", all it does is takes the argument in the next column and places it where we're currently up to. So if you write

        .word    5 

It will simply add the bytes

00 00 00 05 

into our program.

Labels & the Symbol Table

When an instruction or directive has a label (something written in the far left column), it makes a note of what byte it's up to in the "symbol table". The symbol table is just a collection of addresses maintained by the assembler. So, if the first line of our program is

a       .word    5 

"a" will be "0" in the symbol table.

You can see the symbol table by adding "-l" when you assemble a program. Try assembling our program from last tutorial with the "-l" option added:

java -jar dasm.jar multiply.dls -a -l

The -l option outputs, among other things, the symbol table for our viewing pleasure. It should look something like this:

S Y M B O L   T A B L E
=======================
SymName          Attributes Value

a                           0
b                           4
end                         24
loop                        14
main                        8

Now, open the "multiply.dlx" file with your text editor. DLX files are just plain text - you can read them and, if you're crazy, even write them by hand!

It looks like this:

.abs
00000000  00 00 00 05 00 00 00 04 8C 01 00 00 8C 02 00 04
00000010  20 03 00 00 10 40 00 0C 00 61 18 20 28 42 00 01
00000020  0B FF FF F0 00 00 00 01

.start 8

That's our whole multiply program. You can see that "a", which was "0" in the symbol table and had the value "5" does indeed exist at address zero:

.abs
00000000  00 00 00 05 00 00 00 04 8C 01 00 00 8C 02 00 04
00000010  20 03 00 00 10 40 00 0C 00 61 18 20 28 42 00 01
00000020  0B FF FF F0 00 00 00 01

.start 8

While "b", which had the value "4", is coincidentally at address 4:

.abs
00000000  00 00 00 05 00 00 00 04 8C 01 00 00 8C 02 00 04
00000010  20 03 00 00 10 40 00 0C 00 61 18 20 28 42 00 01
00000020  0B FF FF F0 00 00 00 01

.start 8

You can see that "main", the label we gave the start of our program, begins at address 8. When we typed ".start main" to tell our program to start at the "main" label, the assembler has replaced "main" with 8 from the symbol table. If it didn't do this replacing for us, we'd have to manually keep track of the location of every line of code.

.abs
00000000  00 00 00 05 00 00 00 04 8C 01 00 00 8C 02 00 04
00000010  20 03 00 00 10 40 00 0C 00 61 18 20 28 42 00 01
00000020  0B FF FF F0 00 00 00 01

.start 8

The .equ directive

As a bonus, try putting the following command in the multiply program and assembling it with -l:

myval    .equ    9 

You should now see in the symbol table

myval                       9

Super simple, right? All .equ does is puts a value in the symbol table, without making any changes to our program itself.

This can be useful when we want to refer to a value by a label, for example

BADGUY_DAMAGE   .equ   4 

And then, assuming we have the player's Health Points stored in register 1, we could say

                subi   r1,r1,BADGUY_DAMAGE

This subtracts BADGUY_DAMAGE from register 1 and stores it in register 1. Way easier to read than just having unlabeled numbers strewn about everywhere, right?

Opcode encoding

To make life really simple, I've put the source code of our multiply program next to the assembled code. This shows how truly one-to-one assembly programming is.

.abs
00000000  00 00 00 05   a      .word   5
00000004  00 00 00 04   b      .word   4
00000008  8C 01 00 00   main   lw      r1,a
0000000C  8C 02 00 04          lw      r2,b
00000010  20 03 00 00          clr     r3 
00000014  10 40 00 0C   loop   beqz    r2,end
00000018  00 61 18 20          add     r3,r3,r1
0000001C  28 42 00 01          subi    r2,r2,1
00000020  0B FF FF F0          j       loop  
00000024  00 00 00 01   end    halt          
.start 8                      .start  main 

Let's get really into the thick of things. Have a look at the beqz instruction:

10 40 00 0C   loop   beqz    r2,end

As per its definition in the list of opcodes, beqz is an I-format instruction. Meaning, if we convert it into binary, we can split it up like this:

1    0     4    0     0    0     0    C
0001 0000  0100 0000  0000 0000  0000 1100
OOOO OOii  iiij jjjj  KKKK KKKK  KKKK KKKK

So, we can see how our assembly has been turned into machine code. Firstly, 0001 00 is the opcode - the part that tells the CPU what kind of instruction it is. In this case, it's a beqz. Next is 00 010. This is the part that tells the CPU what register we're comparing to zero. You can see that, converted to decimal, this is simply "2", for "r2". We're not using the "j" section this time.

The K section has 0000 0000 0000 1100, which is hex 0xC or decimal 12. Can you figure out how the assembler got this number? The "beqz" command jumps execution by adding the value of the K section to PC. We wrote "beqz r2,end". In the symbol table, the "end" label is at 0x24. When the CPU executes the beqz command, the value of PC will be 0x18 (because it's already loaded the instruction at 0x14 and incremented by 4 in readiness for next cycle). So, 0x18 + 0xC = 0x24 (decimal 24 + 12 = 36).

The jump command at 0x20 works in the same way.

00000020  0B FF FF F0          j       loop  

Being an L-format instruction, it's organized as

0    B     F    F     F    F     F    0 
0000 1011  1111 1111  1111 1111  1111 0000
OOOO OOLL  LLLL LLLL  LLLL LLLL  LLLL LLLL

Where 0000 10 is the opcode that tells the CPU that it's a "j" instruction, and the rest of the opcode is the amount to jump by. Since it's a signed integer, the L section has the value -0x10 (decimal -16).

It should be simple to see how the assembler got this number - PC will be at 0x24 when the "j" instruction is executed. The "loop" label is at 0x14. 0x24 + (-0x10) = 0x24 - 0x10 = 0x14. We're jumping backwards through the program to get to the start of our loop.

Summary

If you're not experienced with hex and binary, there were probably enough numbers on this page to make your eyes spin. However, once you have your head around it, you'll see how simple the assembler's job really is!

All it does is 1) lets us use some keywords that are a little easier to remember than raw hex values, 2) packs our arguments into our instructions for us, and 3) keeps track of our labels so we don't have to memorise our whole program.

There's no scary deep magic or hidden trickery going on under the hood - what we write is what we get.

In the next tutorial, we'll get back to writing assembly with an article on saving and loading.