On this page:
7.1 Assignment Summary
7.2 Language Diagram
7.3 Preface
7.4 Exposing Memory in Paren-x64
7.5 Loc-lang
7.6 Values-lang
7.5.0.17

7 Compiler 2: Values-lang to x64

7.1 Assignment Summary

The goal of this assignment is to introduce the idea of designing languages by abstracting away from annoying details, and implementing those abstractions by compilation. We will (1) abstract away from instructions on locations to expressions on values, and (2) abstract away from a small number of physical locations and replace them with an essentially unboundedly large number of abstract names.

This assignment is due Friday January 24, 2020 at 11:59pm.

Assignment Checklist

You should find a new repository in your https://github.students.cs.ubc.ca account named a2_<userid> with a code skeleton and a support library. You should complete the assignment in that git repository and push it to the GitHub Students instance.

You should first copy your solution to Compiler 1: Paren-x64 to x64 into the starter code provided.

If we patch the code skeleton or library after the release, the most recent version will be available here at a2-skeleton.rkt and a2-compiler-lib.rkt. The name of the skeleton is a2-skeleton.rkt to avoid accidentally overwriting your files, but your file in the Git repository should be named a2.rkt.

7.2 Language Diagram

7.3 Preface

The language Paren-x64 from last week has two significant limitations we will address this week.

First, in x64, all operations act on locations, such as registers, and Paren-x64 is the same. In order to program, we must manually keep track of where a value is located in order to perform computation on it. We must move values into particular locations before computing. Often when programming, we think about acting on values, not locations. For example, when doing arithmetic, we simply write (+ 2 2), i.e., "Add the values 2 and 2", rather than "first move 2 into a register, then move 2 into a different register, now add the contents of the two locations and store it into one of them".

Second, there are exactly 16 physical locations in Paren-x64. This limits the way our computations can be written. We must, manually, order and collapse sub-computations to keep the number of locations that are in use small. We must keep track of which locations are still "live" and which are "dead" before we move a value into a location. In most programming languages, we abstract away from a small number of physical locations and introduce a large, arbitrary set of locations for values. If we ever need to name a value or sub-computation, we just invent a new location by creating a new name. Each fresh name, sometimes called a variable, can be thought of as naming an abstract location, which can hold a value, and we don’t need to bother about which physical location it represents.

In this assignment, we will abstract away from the small number of physical locations, and abstract away from operations on locations. We’ll do this in several steps, introducing several intermediate languages in the process.

7.4 Exposing Memory in Paren-x64

To implement Loc-lang, we first need to expose x64 features to access memory locations. In particular, we expose displacement mode operands for memory locations. The displacement mode operand is a new operand that can appear in some location positions as the argument or an instruction. These allow accessing memory locations using pointer arithmetic.

The displacement mode operand is written as QWORD [reg - integer] or QWORD [reg + integer] in x64, where reg is a register holding some memory address and the integer is an offset number of bytes from that address to access. The keyword QWORD, which is an unintuitive spelling of "four bytes", indicates that this operand is accessing 64 bits at a time.

For example, if rbp holds a memory address, we can move the value 42 to that memory address using the instruction mov QWORD [rbp - 0], 42. We can move the value from memory into the register rax using the instruction mov rax, QWORD [rbp - 0].

For now, our offsets will be multiples of 8. The offset is a number of bytes, and since we are dealing primarily with 64-bit values, we will increment pointers in values of 8. For example, the following snippet of code moves two values into memory, then pulls them out and adds them.

mov QWORD [rbp - 0], 21

mov QWORD [rbp - 8], 21

mov rax, QWORD [rbp - 8]

mov rbx, QWORD [rbp - 0]

add rax, rbx

These accesses grow downwards, subtracting from the base pointer rather than adding, following common conventions about how stack memory is used. This is an arbitrary choice, but we choose to follow the convention.

The new version of Paren-x64 v2 is below.
  p ::= (begin s ...)
     
  s ::= (set! loc triv)
  | (set! reg loc)
  | (set! reg_1 (binop reg_1 int32))
  | (set! reg_1 (binop reg_1 loc))
     
  triv ::= reg
  | int64
     
  loc ::= reg
  | addr
     
  reg ::= rsp
  | rbp
  | rax
  | rbx
  | rcx
  | rdx
  | rsi
  | rdi
  | r8
  | r9
  | r10
  | r11
  | r12
  | r13
  | r14
  | r15
     
  addr ::= (rbp - dispoffset)
  | (rbp + dispoffset)
     
  binop ::= *
  | +

(define (dispoffset? n)
  (and (integer? n)
       (zero? (remainder n 8))))

We add the new nonterminal addr to the language, and add addr as a production to loc. The addr nonterminal represents a displacement mode operand to an instruction. We limit it to access rbp, which is where the run-time system will put the base frame pointer, the pointer to the start of the current stack frame. The offset of each addr is restricted to be an integer that is divisible by 8, the number of bytes in a machine word in x64. This ensures all memory accesses are machine-word aligned, meaning we leave space for all bytes in the word between each access.

Implementing this will require a new run-time system to initialize the stack. The new a2-compiler-lib.rkt will do this for you. The run-time system provides a default stack of size 8 megabytes. This should be enough for now, but if it’s not, you can use the parameter? current-stack-size to increase it. Note that increasing the stack size will increase the binary size and compile time, due to how the run-time system currently allocates the stack.

The new run-time system also prints the value of rax to the screen, instead of returning it via an exit code, and does the work of converting numbers to ASCII strings. Now when you run your binary, you’ll get a more familiar output. The execute function will parse this output into a Racket number. If you’re interested in how this is done, you can read the definition of x86-64-runtime.

Exercise 1: Redesign and extend the implementation of check-paren-x64. You should be able to modify your type checker for the previous Paren-x64 to include cases for displacement mode operands and accept programs for Paren-x64 v2.

You should start by writing new examples and tests for the new specification, and adding new match clauses following the template. You’ll need at least 6 new tests, checking for success and failure of the new features. These examples in this assignment description don’t count; you’ll need 6 new tests.

Exercise 2: Redesign and extend your implementation of interp-paren-x64 to support Paren-x64 v2.

You should start by writing new examples and tests for the new specification, and adding new match clauses following the template. You’ll need at least 4 new tests, checking for success and failure of the new features.

Examples:
> (interp-paren-x64 '(begin (set! rax 42)))

42

> (interp-paren-x64 '(begin (set! rax 42) (set! rax (+ rax 0))))

42

> (interp-paren-x64
   '(begin
      (set! (rbp - 0) 0)
      (set! (rbp - 8) 42)
      (set! rax (rbp - 0))
      (set! rax (+ rax (rbp - 8)))))

42

> (interp-paren-x64
   '(begin
      (set! rax 0)
      (set! rbx 0)
      (set! r9 42)
      (set! rax (+ rax r9))))

42

> (interp-paren-x64
   '(begin
      (set! (rbp - 0) 42)
      (set! rax (rbp + 0))))

42

Exercise 3: Redesign and extend the implementation of generate-x64, a compiler from Paren-x64 v2 to x64. You should be able to modify your previous definition of generate-x64 to support displacement mode operands.

You’ll need at least 4 new tests, checking for success and failure of the new features.

Examples:
> (require racket/pretty)
> (pretty-display
   (generate-x64 '(begin (set! rax 42))))

  mov rax, 42

> (pretty-display
   (generate-x64 '(begin (set! rax 42) (set! rax (+ rax 0)))))

  mov rax, 42

  add rax, 0

> (pretty-display
   (generate-x64
   '(begin
      (set! (rbp - 0) 0)
      (set! (rbp - 8) 42)
      (set! rax (rbp - 0))
      (set! rax (+ rax (rbp - 8))))))

  mov QWORD [rbp - 0], 0

  mov QWORD [rbp - 8], 42

  mov rax, QWORD [rbp - 0]

  add rax, QWORD [rbp - 8]

> (pretty-display
   (generate-x64
   '(begin
      (set! rax 0)
      (set! rbx 0)
      (set! r9 42)
      (set! rax (+ rax r9)))))

  mov rax, 0

  mov rbx, 0

  mov r9, 42

  add rax, r9

> (current-pass-list
   (list
    check-paren-x64
    generate-x64
    wrap-x64-run-time
    wrap-x64-boilerplate))
> (execute '(begin (set! rax 42)))

42

> (execute '(begin (set! rax 42) (set! rax (+ rax 0))))

42

> (execute
   '(begin
      (set! (rbp - 0) 0)
      (set! (rbp - 8) 42)
      (set! rax (rbp - 0))
      (set! rax (+ rax (rbp - 8)))))

42

> (execute
   '(begin
      (set! rax 0)
      (set! rbx 0)
      (set! r9 42)
      (set! rax (+ rax r9))))

42

Exercise 4: Remove your definitions of wrap-x64-boilerplate and wrap-x64-run-time. These are now provided by a2-compiler-lib.rkt, since your run-time system is getting more complicated.

7.5 Loc-lang

Now we have nearly unlimited physical locations on the machine, and we want to abstract away from instructions that need to know where exactly a thing is (a register or in memory), and end up with instructions that operate on arbitrary abstract locations.

The first step is to abstract away the annoying details of x64 instructions, so that any instruction operates on any physical location. We do this by defining Paren-asm, a kind of "less-finicky" assembly.

  p ::= (begin s ... (halt loc))
     
  s ::= (set! loc int64)
  | (set! loc loc)
  | (set! loc (binop loc loc))
     
  binop ::= *
  | +
     
  loc ::= reg
  | addr
     
  reg ::= rsp
  | rbp
  | rbx
  | rcx
  | rdx
  | rsi
  | rdi
  | r8
  | r9
  | r10
  | r11
  | r12
  | r13
  | r14
  | r15
     
  addr ::= (rbp - dispoffset)
  | (rbp + dispoffset)

Note that rax has been removed from reg. Paren-asm assumes control of rax as a temporary register.

As a result, we introduce a halt instruction to encode returning a value. We introduce an instruction (halt loc), which halts the program and returns the value in loc to the run-time system. This instruction is valid only as the final statement of a program, and it must be present.

Exercise 5: Design and implement the function patch-instructions which implements instructions from Paren-asm that don’t exist in Paren-x64 v2 by translating them to instruction sequences that are valid Paren-x64 v2. The source language is Paren-asm and the target is Paren-x64 v2. You should use rax as a temporary register.

From this point on, it goes without saying that you need at least one test for each feature in the language. You should also include tests for edge cases of various features.

Examples:
> (patch-instructions '(begin (set! rbx 42) (halt rbx)))

'(begin (set! rbx 42) (set! rax rbx))

; The next example is not a valid Paren-asm program, but compiles correctly.
; The compiler is still considered correct. Since the signature for the
; compiler was violated, the compiler may generate whatever it likes.
> (patch-instructions
  '(begin
     (set! rbx 42)
     (set! rbx (+ rbx 0))
     (halt rbx)))

'(begin

   (set! rbx 42)

   (set! rax rbx)

   (set! rax (+ rax 0))

   (set! rbx rax)

   (set! rax rbx))

> (patch-instructions
   '(begin
      (set! (rbp - 0) 0)
      (set! (rbp - 8) 42)
      (set! (rbp - 0) (rbp + 8))
      (halt (rbp - 0))))

'(begin

   (set! (rbp - 0) 0)

   (set! (rbp - 8) 42)

   (set! rax (rbp + 8))

   (set! (rbp - 0) rax)

   (set! rax (rbp - 0)))

> (patch-instructions
   '(begin
      (set! rbx 0)
      (set! rcx 0)
      (set! r9 42)
      (set! rbx (+ rcx r9))
      (halt rbx)))

'(begin

   (set! rbx 0)

   (set! rcx 0)

   (set! r9 42)

   (set! rax rcx)

   (set! rax (+ rax r9))

   (set! rbx rax)

   (set! rax rbx))

Now we can introduce abstract locations, alocs. We do this by defining Loc-assigned-lang, below.

  p ::= (begin info s ... (halt aloc))
     
  info ::= ((assignment ((aloc rloc) ...)) any ...)
     
  s ::= (set! aloc int64)
  | (set! aloc aloc)
  | (set! aloc (binop aloc aloc))
     
  binop ::= *
  | +
     
  rloc ::= reg
  | addr
     
  reg ::= rsp
  | rbx
  | rcx
  | rdx
  | rsi
  | rdi
  | r8
  | r9
  | r10
  | r11
  | r12
  | r13
  | r14
  | r15
     
  addr ::= (rbp - integer)
  | (rbp + integer)

(define (aloc? s)
  (and (symbol? s)
       (not (register? s))
       (regexp-match-exact? #rx".+\\.[0-9]+" (symbol->string s))))

In Loc-assigned-lang, we introduce a definition of abstract location, aloc. An aloc is a symbol that is not a register, and is roughly of the form <name>.<number>. We assume all alocs are globally unique, and any reference to the same aloc is to the same location—these are not the names from Racket that you’re used to.

We assume someone else has come up with an assignment of abstract locations to physical locations. We model this in Loc-assigned-lang as an assignment tag in the info field, which comes before the instruction sequence. Note that rax and rbp are not valid in the assignment, since these are dedicated registers.

The assignment is stored in the info field of the program. The info field is a set of key-value pairs used to store information associated with programs. In general, we will only give a partial specification—it may contain arbitrary other key-value pairs, indicated by the any pattern in the nonterminal definition. This lax specification is useful for debugging: you may leave residual info from earlier languages, or include your own debugging info. The info field is also unordered, so do not assume the assignment is the first element.

share/a2-compiler-lib.rkt provides info-ref and info-set functions for working with the info field.

The info field is essentially a Racket dict?, but it uses an additional list around values to make it a little easier to read and write.

Examples:
> (info-set '() 'x.1 'rbx)

'((x.1 rbx))

> (info-set (info-set '() 'x.1 'rbx) 'y.2 'r9)

'((x.1 rbx) (y.2 r9))

> (info-set '() 'assignment '((x.1 rbx) (y.2 r9)))

'((assignment ((x.1 rbx) (y.2 r9))))

> (info-ref '((assignment ((x.1 rbx) (y.2 r)))) 'assignment)

'((x.1 rbx) (y.2 r))

> (info-ref
    (info-ref '((assignment ((x.1 rbx) (y.2 r)))) 'assignment)
    'x.1)

'rbx

Exercise 6: Design and implement the function replace-locations which replaces each abstract location with its assigned physical location. The source language is Loc-assigned-lang and the target language is Paren-asm.

Examples:
> (replace-locations
    '(begin ((assignment ((x.1 rax))))
       (set! x.1 0)
       (halt x.1)))

'(begin (set! rax 0) (halt rax))

> (replace-locations
    '(begin ((assignment ((x.1 rax) (y.1 rbx) (w.1 r9))))
       (set! x.1 0)
       (set! y.1 x.1)
       (set! w.1 (+ x.1 y.1))
       (halt w.1)))

'(begin (set! rax 0) (set! rbx rax) (set! r9 (+ rax rbx)) (halt r9))

Previously, we assumed someone else had come up with an assignment from alocs to physical locations. Now we must discharge that assumption. We are the person who comes up with the assignment. But we don’t actually want to write down the assignment manually—that’s a computer’s job.

We design the language Loc-locals-lang, in which we assume the program is annotated with information about all abstract locations used in the program. We then implement a function to assign each of those alocs a location.

  p ::= (begin info s ... (halt aloc))
     
  info ::= ((locals (aloc ...)) any ...)
     
  s ::= (set! aloc int64)
  | (set! aloc aloc)
  | (set! aloc (binop aloc aloc))
     
  binop ::= *
  | +

Exercise 7: Design and implement the function assign-homes which assigns each abstract location a physical location. The source language is Loc-locals-lang and the target language is Loc-assigned-lang.

It should assign each abstract location to a fresh frame location using displacement mode operands. Remember that the dispoffset must be a multiple of 8.

Examples:
> (assign-homes
    '(begin ((locals (x.1)))
       (set! x.1 0)
       (halt x.1)))

'(begin

   ((locals (x.1)) (assignment ((x.1 (rbp - 8)))))

   (set! x.1 0)

   (halt x.1))

> (assign-homes
   '(begin ((locals (x.1 y.1 w.1)))
      (set! x.1 0)
      (set! y.1 x.1)
      (set! w.1 (+ x.1 y.1))
      (halt w.1)))

'(begin

   ((locals (x.1 y.1 w.1))

    (assignment ((x.1 (rbp - 8)) (y.1 (rbp - 16)) (w.1 (rbp - 24)))))

   (set! x.1 0)

   (set! y.1 x.1)

   (set! w.1 (+ x.1 y.1))

   (halt w.1))

Finally, we can create Loc-lang, a location-oriented programming language. In this language, the programmer does not need to know about physical locations at all, nor about machine constraints. They simply issue instructions on abstract locations. The grammar of Loc-lang is below.

  p ::= (begin s ... (halt aloc))
     
  s ::= (set! aloc int64)
  | (set! aloc aloc)
  | (set! aloc (binop aloc aloc))
     
  binop ::= *
  | +

Exercise 8: Design and implement the function uncover-locals which adds a list of all locations used in a program to the info field of a program. The source language is Loc-lang and the target language is Loc-locals-lang.

For working with sets, you may want to use Sets.

Examples:
> (uncover-locals
   '(begin
      (set! x.1 0)
      (halt x.1)))

'(begin ((locals (x.1))) (set! x.1 0) (halt x.1))

> (uncover-locals
   '(begin
      (set! x.1 0)
      (set! y.1 x.1)
      (set! w.1 (+ x.1 y.1))
      (halt w.1)))

'(begin

   ((locals (w.1 y.1 x.1)))

   (set! x.1 0)

   (set! y.1 x.1)

   (set! w.1 (+ x.1 y.1))

   (halt w.1))

Exercise 9: Design and implement the function check-loc-lang, a type checker for Loc-lang.

Examples:
> (check-loc-lang
   '(begin
      (set! x.1 0)))

check-loc-lang: Invalid program; expected one of:

  something of the form `(begin ,s ... (halt ,aloc)), but

got '(begin (set! x.1 0))

> (check-loc-lang
   '(begin
      (set! x.1 0)
      (halt x.1)))

'(begin (set! x.1 0) (halt x.1))

> (check-loc-lang
   '(begin
      (set! x.1 0)
      (set! y.1 x.1)
      (set! w.1 (+ x.1 y.1))
      (halt w.1)))

'(begin (set! x.1 0) (set! y.1 x.1) (set! w.1 (+ x.1 y.1)) (halt w.1))

> (check-loc-lang
   '(begin
      (set! x.1 0)
      (set! y.1 x.1)
      (set! w.1 (+ x.1 5))
      (halt w.1)))

check-loc-lang: Invalid statement; expected one of:

  Expected a 64 bit integer, got (+ x.1 5)

  Expected an abstract location, got (+ x.1 5)

  Expected an abstract location, got 5

If you an unsure whether your compiler is correct, you may find it helpful to use the type checker for Paren-x64 v2 in your compiler pipeline.

7.6 Values-lang

Now we want to abstract further, away from locations and focus on values and operations on values.

Values-lang is the final source language for this week. It abstracts away from operations on locations and instead has operations on values. It also include a feature for binding values to names for reuse. The value of a Values-lang program is the final value of an expression.

First, we will assume that each name bound in a program is a globally unique aloc. This gives us the language Values-unique-lang, defined below.

  e ::= n
  | (let ([aloc n]) e)
     
  n ::= v
  | (binop v v)
     
  v ::= int64
  | aloc
     
  binop ::= *
  | +

Exercise 10: Design and implement the function select-instructions which compiles Values-unique-lang expressions into sequences of Loc-lang instructions. The source language is Values-unique-lang and the target language is Loc-lang.

You may find values, define-values, and let-values useful for returning multiple values from a function.

We recommend defining two helper functions, assign-tmp and assign-n. assign-tmp should take a v and two other arguments: the Loc-lang instruction sequence for assigning v to a fresh aloc, and the aloc that v gets assigned to. assign-n should take a aloc and an n and return the sequence of Loc-lang ss that implement assigning the n to the aloc. This design can help minimize the number of unnecessary move instructions your compiler generates.

You should use the function fresh from a2-compiler-lib.rkt.

Examples:
> (select-instructions '(+ 2 2))

'(begin

   (set! tmp.2 2)

   (set! tmp.3 2)

   (set! tmp.1 (+ tmp.2 tmp.3))

   (halt tmp.1))

> (select-instructions
   '(let ([x.1 5])
      x.1))

'(begin (set! x.1 5) (set! tmp.4 x.1) (halt tmp.4))

> (select-instructions
   '(let ([x.1 (+ 2 2)])
      x.1))

'(begin

   (set! tmp.5 2)

   (set! tmp.6 2)

   (set! x.1 (+ tmp.5 tmp.6))

   (set! tmp.7 x.1)

   (halt tmp.7))

> (select-instructions
   '(let ([x.1 2])
      (let ([x.2 2])
        (+ x.1 x.2))))

'(begin (set! x.1 2) (set! x.2 2) (set! tmp.8 (+ x.1 x.2)) (halt tmp.8))

Now we must discharge our assumption that all names are unique. Below we define Values-lang, a value-oriented language with simple binary expressions and lexical binding.

  e ::= n
  | (let ([x n]) e)
     
  n ::= v
  | (binop v v)
     
  v ::= int64
  | x
     
  x ::= name
     
  binop ::= *
  | +

(define (name? x) (symbol? x))

x is a short-hand nonterminal for names, which are arbitrary symbols. To ensure transform names into abstract location, it suffices to append a unique number to each.

Exercise 11: Design and implement the function uniquify. The source language is Values-lang and the target language is Values-unique-lang.

After uniquify, you can assume every name is bound exactly once.

You should use the function fresh from a2-compiler-lib.rkt.

Examples:
> (uniquify '(+ 2 2))

'(+ 2 2)

> (uniquify
   '(let ([x 5])
      x))

'(let ((x.9 5)) x.9)

> (uniquify
   '(let ([x (+ 2 2)])
      x))

'(let ((x.10 (+ 2 2))) x.10)

> (uniquify
   '(let ([x 2])
      (let ([y 2])
        (+ x y))))

'(let ((x.11 2)) (let ((y.12 2)) (+ x.11 y.12)))

> (uniquify
   '(let ([x 2])
      (let ([x 2])
        (+ x x))))

'(let ((x.13 2)) (let ((x.14 2)) (+ x.14 x.14)))

Exercise 12: Design and implement check-values-lang, a type checker for Values-lang.

Exercise 13: Design and implement interp-values-lang, an interpreter for Values-lang.

Examples:
> (interp-values-lang '(+ 2 2))

4

> (interp-values-lang
   '(let ([x 5])
      x))

5

> (interp-values-lang
   '(let ([x (+ 2 2)])
      x))

4

> (interp-values-lang
   '(let ([x 2])
      (let ([y 2])
        (+ x y))))

4

> (interp-values-lang
   '(let ([x 2])
      (let ([x 2])
        (+ x x))))

4

> (current-pass-list
   (list
    check-values-lang
    uniquify
    select-instructions
    check-loc-lang
    uncover-locals
    assign-homes
    replace-locations
    patch-instructions
    check-paren-x64
    generate-x64
    wrap-x64-run-time
    wrap-x64-boilerplate))
> (execute '(+ 2 2))

4

> (execute
   '(let ([x 5])
      x))

5

> (execute
   '(let ([x (+ 2 2)])
      x))

4

> (execute
   '(let ([x 2])
      (let ([y 2])
        (+ x y))))

4

> (execute
   '(let ([x 2])
      (let ([x 2])
        (+ x x))))

4

> (define (compile-pre-runtime p)
    (parameterize ([current-pass-list
                    (list
                     check-values-lang
                     uniquify
                     select-instructions
                     check-loc-lang
                     uncover-locals
                     assign-homes
                     replace-locations
                     patch-instructions
                     check-paren-x64
                     generate-x64)])
      (compile p)))
> (pretty-display
   (compile-pre-runtime '(+ 2 2)))

  mov QWORD [rbp - 8], 2

  mov QWORD [rbp - 24], 2

  mov rax, QWORD [rbp - 8]

  add rax, QWORD [rbp - 24]

  mov QWORD [rbp - 16], rax

  mov rax, QWORD [rbp - 16]

> (pretty-display
   (compile-pre-runtime
    '(let ([x 5])
       x)))

  mov QWORD [rbp - 8], 5

  mov rax, QWORD [rbp - 8]

  mov QWORD [rbp - 16], rax

  mov rax, QWORD [rbp - 16]

> (pretty-display
   (compile-pre-runtime
    '(let ([x 2])
       (let ([y 2])
         (+ x y)))))

  mov QWORD [rbp - 8], 2

  mov QWORD [rbp - 16], 2

  mov rax, QWORD [rbp - 8]

  add rax, QWORD [rbp - 16]

  mov QWORD [rbp - 24], rax

  mov rax, QWORD [rbp - 24]