6 Compiler 1: Paren-x64 to x64

7.5.0.17

6 Compiler 1: Paren-x64 to x64

6.1 Assignment Summary

The goal of this assignment is to introduce (1) the first language, x64 assembly, (2) the general structure of a compiler, and (3) methods for testing your compiler. In this assignment, you will implement a small abstraction layer on top of x64, called Paren-x64, which allows easily writing a subset of x64 programs while ignoring some of the boilerplate and operating-system-specific details.

This assignment is due Friday, January 17, 2020 at 11:59pm.

Assignment Checklist

Before you start, you should read:

These lay out important expectations regarding grading and sharing of code.

You should find a new repository in your https://github.students.cs.ubc.ca account named a1_<userid> with a code skeleton and a support library. You should complete the assignment in that git repository and push it to the GitHub Students instance.

If we patch the code skeleton or library after the release, the most recent version will be available here at a1-skeleton.rkt and a1-compiler-lib.rkt. A GitHub repo you can merge or rebase from will be availble here at https://github.students.cs.ubc.ca/cpsc411-2019w-t2/a1_wilbowma.

Design and implement check-paren-x64, a validator for Paren-x64.
Design and implement interp-paren-x64, an interpreter for Paren-x64.
Design and implement an x64 run-time system for Paren-x64.
Design and implement generate-x64, a compiler from Paren-x64 to x64.
Test your compiler by comparing the interpreter to the compiler.

6.2 Language Diagram

6.3 Preface: What’s bad about x64

In the previous assignment, we saw the first language: x64. The primary limitation in x64 is its concrete syntax—it has too few parens. This makes it hard write program transformations over it. String manipulation is very tedious as strings provide very little structure.

In this class, our focus is on designing and exploiting abstraction boundaries; and implementing, transforming, and optimizing programs. To focus on this, we will ignore low-level character parsing (yuk!), and instead assume all programs come to us in a format that is easy for our programming language to consume and transform—s-expressions.

The second limitation is that x64 includes many extraneous details, like details about which label starts execution, and provides no definition for the value of an x64 program. This makes interpreting the result of an x64 program difficult—we’re forced to do something like manually print the result to the screen.

In this lesson, we abstract away from each of these details. We start be defining a subset of of x64 that we will support, x64-411. We then design a language, Paren-x64, that supports all instructions from x64-411 and removes the two limitations of x64 described above.

Our compiler will compile from Paren-x64 to x64.

6.4 x64-411: Our x64 Subset

x64-411 is the subset of x64 instructions used in this assignment, represented as a string in Intel syntax. We do not give a precise grammar for x64-411; instead, we describe the instructions below and give some examples.

mov triv, triv
This string represents the move instruction Move a value from one place to another. This instruction will correspond to (set! triv triv) in Paren-x64.
x64 imposes further restrictions on mov (remember—x64 is, we must not ask why). We can only move a value into a register, or from a register to a register.
We can thus make the grammar of Paren-x64 more precise by describing three move statements—(set! reg integer), and (set! reg reg). As we are on a 64-bit machine, we must restrict the size of the integer to be a 64-bit integer, in the range -2^63 <= i <= 2^63 - 1.
add triv, triv
This string represents the add instruction, which intuitively adds two values. This will correspond to the Paren-x64 instruction (set! triv (+ triv triv)).
In fact, we cannot add values in x64: we can add a value (i.e.,, an integer), to a register add reg, integer, or add the values in two registers, add reg_1, reg_2.
We can make the definition of Paren-x64 more precise by separating out these instructions. The first form of the instruction is written (set! reg_1 (+ reg_2 integer)), where the reg_1 and reg_2 must be identical. The second form is written (set! reg_1 (+ reg_2 reg_3)) again where the first two registers reg_1 and reg_2 must be identical. For example, (set! rax (+ rax r9)) set! rax (+ rax 42) are valid, but (set! rax (+ rbx r9)) and (set! rax (+ rbx 42)) are not.
Note that we must duplicate the register in the syntax of the binary operator instructions. Paren-x64 uses a Racket-like syntax, separating the assignment operation set! from the add operation +. However, x64-411 does not separate these two concepts. Thus, for now, the instruction (set! reg (binop reg triv)) must have literally the same register in the positions named reg.
Furthermore, when using an integer directly in the add statement, it must be a 32-bit integer, in the range -2^31 <= i <= 2^31 - 1. Yes, that’s right, 32-bit not 64-bit; we do not ask "why" of x64. For example, (set! rax (+ rax 2147483647)) is valid, but (set! rax (+ rax 2147483648)) is not. Instead, we would first need to move 2147483648 to a register to add it to rax.
imul triv, triv
This string represents the add instruction, which intuitively multiplies two values. This will correspond to the Paren-x64 instruction (set! triv (* triv triv)).
Multiplication is further restricted by x64. Again, we cannot multiply values directly. We can only multiply the value of a register by a 32-bit integer imul reg_1, integer, or the value in a register by the value in a register imul reg_1, reg_2.
These correspond to the Paren-x64 instructions (set! reg_1 (* reg_2 reg_3)), or the value of a register by a 32-bit integer (set! reg_1 (* reg_2 integer)). Similar to addition, the registers reg_1 and reg_2 must be syntactically the same.

We represent simple instruction sequences in x64-411 as strings:

Example:

mov rax, 42

Example:

add rax, 42

Example:

mov rax, 170679
mov rdi, rax
add rdi, rdi
mov rsp, rdi
imul rsp, rsp
mov rbx, 8991

6.5 Paren-x64: The First Source Language

The first source language, Paren-x64, represents a sequence of x64-411 instructions in s-expression format. It also provides an interpretation of each program as a value. The formal definitions of Paren-x64 is given below.

6.5.1 Paren-x64 Definition

A rough grammar for Paren-x64 is given below. This grammar is too liberal and we will gradually refine it to build more restrictions into the syntax. It is not always possible to build restrictions into syntax directly, so most grammars will come with additional restrictions and specifications in English.

p	::=	(begin s ...)

s	::=	(set! triv triv)
	\|	(set! triv (binop triv triv))

triv	::=	reg
	\|	integer

binop	::=	+
	\|	*

reg	::=	rsp
	\|	rbp
	\|	rax
	\|	rbx
	\|	rcx
	\|	rdx
	\|	rsi
	\|	rdi
	\|	r8
	\|	r9
	\|	r10
	\|	r11
	\|	r12
	\|	r13
	\|	r14
	\|	r15

Every Paren-x64 program, p, begins with begin, followed by a sequence of statements. Right now, we represent only 3 of the x64 instructions in Paren-x64.

To simplify defining languages as grammars without too much additional English specification, we create two conventions.

Any time we suffix a nonterminal by an underscore and a number, such as reg_1, this is a reference to a particular instance of a nonterminal, and it is restricted to be identical to any other instance of the same nonterminal with the same underscore suffix within the same expression. So (set! reg_1 (+ reg_1 integer)) represents an add instruction in Paren-x64, and both occurences of reg_1 must be the same register. However, two instances of nonterminals with different suffixes, like reg_1 and reg_2, are not necessarily different—they are only not guaranteed to be the same.
Any non-terminal that is not defined as syntax, such as int64, may be defined by a Racket predicate following the grammar, such as int64?.

Using these two conventions, we define the grammar of Paren-x64 with all the above restrictions below.

p	::=	(begin s ...)

s	::=	(set! reg int64)
	\|	(set! reg reg)
	\|	(set! reg_1 (binop reg_1 int32))
	\|	(set! reg_1 (binop reg_1 reg))

reg	::=	rsp
	\|	rbp
	\|	rax
	\|	rbx
	\|	rcx
	\|	rdx
	\|	rsi
	\|	rdi
	\|	r8
	\|	r9
	\|	r10
	\|	r11
	\|	r12
	\|	r13
	\|	r14
	\|	r15

binop	::=	*
	\|	+

(define (max-int word-size) (sub1 (expt 2 (sub1 word-size))))
(define (min-int word-size) (* -1 (expt 2 (sub1 word-size))))

(define (int-size? word-size i)
(and (integer? i)
(<= (min-int word-size) i (max-int word-size))))

(define (int32? i) (int-size? 32 i))
(define (int64? i) (int-size? 64 i))

Since Paren-x64 is an imperative language and does not return values like Racket does, we must decide how to interpreter a Paren-x64 program as a value. As discussed in lecture, we can do this by imposing some convention on what the value of the program is. We choose to interpreter the value of an Paren-x64 program as the value in rax, modulo 256 (modulo), when the program is finished executing (has no instructions left to execute). We discuss the reason for this interpretation shortly.

We also assume that any register is initialized before it is accessed. We inherit this restriction from x64. In x64, the value of an uninitialized register is undefined, that is, accessing an uninitialized register results in undefined behavior, behavior that has no specified behavior defined by the language specification.

Undefined behavior is common in low-level languages that lack much static or dynamic enforcement of assumptions. It’s good to eliminate undefined behavior by adding static or dynamic checks, as it makes the behavior of all programs in your language more predicatable, allowing programmers to better reason about their code. However, it is not always practical to achieve. It may be too difficult to statically check a property and still allow all the programs you want to allow, or too expensive to check a property dynamically. There, we will sometimes make assumptions that certain things never happens, injecting undefined behavior into our compiler. Note that because we assume this never happens in Paren-x64, technically speaking, a program that accesses an uninitialized register is not a valid Paren-x64 program.

Exercise 1: Design a function check-paren-x64 that takes an arbitrary value and either returns it, if it is an s-expression representing a valid Paren-x64 program, or raises a descriptive error message.
Write at least 10 unit tests exercising each feature of the language. The tests should check for failure and success. You should use check-equal?, check-exn, and check-not-exn from RackUnit: Unit Testing.
All tests should be inside the special test submodule. See Main and Test Submodules for more details.

Exercise 1 (Still, because implement really means design and implement): Implement the function check-paren-x64. Start by writing the template, following the instructions from The Compiler Design Recipe for templates for language processors.
Below are some examples. Your checker does not need to produces exactly the same error messages.

Examples:

> (check-paren-x64
   '(set! (+ rax rdi) 42))
check-paren-x64: Invalid program; expected one of:
  something of the form `(begin ,s ...), but got '(set! (+
rax rdi) 42)
> (check-paren-x64
  '(begin
     (set! (+ rax rdi) 42)))
check-paren-x64: Invalid statement; expected one of:
  a register, got '(+ rax rdi)
  a register, got '(+ rax rdi)
  something of the form `(set! ,reg_1 (,binop ,reg_2
,reg_3)), but got '(set! (+ rax rdi) 42)
  something of the form `(set! ,reg_1 (,binop ,reg_2
,int32)), but got '(set! (+ rax rdi) 42)
> (check-paren-x64
   '(begin
      (set! rax (+ rax 42))))
'(begin (set! rax (+ rax 42)))
> (check-paren-x64
   '(begin
      (set! rax (+ rdi 42))))
check-paren-x64: Invalid statement; expected one of:
  a 64 bit integer, got '(+ rdi 42)
  a register, got '(+ rdi 42)
  'rax to be identical to 'rdi
  'rax to be identical to 'rdi
> (check-paren-x64
    '(begin
       (set! rax 170679)
       (set! rdi rax)
       (set! rdi (+ rdi rdi))
       (set! rsp rdi)
       (set! rsp (* rsp rsp))
       (set! rbx 8991)))
'(begin
   (set! rax 170679)
   (set! rdi rax)
   (set! rdi (+ rdi rdi))
   (set! rsp rdi)
   (set! rsp (* rsp rsp))
   (set! rbx 8991))

The function check-paren-x64 is a validator for Paren-x64. It is in essence, a parser. It reads an arbitrary value, expected to represent a Paren-x64 program, and returns a valid program in the language Paren-x64, or raises an error.

It’s a simple parser, because it expects the input to be an s-expression, and we represents programs in our compiler as s-expressions. This means our parsers, like the rest of our compiler, are simple tree automata, rather than complex string automata.

You could view it as a type checker that checks for a single type: The-Paren-x64-Type, which every valid instruction has and has quite simple typing rules. Usually we think of parsers as transforming from one representations to another, but this parsers does not transform the input if it is valid. In this sense, it is a type checker: checking that the input program is already following the typing disciplines of the language of the language (which aren’t very restrictive).

Writing validators, like parsers or type checkers, for intermediate language programs produced by your compiler is a powerful debugging technique. By designing them in the same way as check-paren-x64, so that they return the input if it’s valid, you can easily add them as passes in your compiler and detect when an early pass produces an ill-typed program. This is a form of property-based testing, and will catch many more bugs than unit testing alone.

You are strongly encouraged to write and update validators for intermediate languages and use them for property-based testing.

Exercise 2: Design and implement an interpreter for Paren-x64, called interp-paren-x64. The input is a Paren-x64 expressions and the output is a Racket number.
Note that built into this signature is the fact that you may assume you always receives a valid Paren-x64 program. This means you do not have to write additional checks to validate the input; you have assumed the called has validated the input. You may also rely on any assumptions implied by being a valid Paren-x64 program. For example, in the instruction (set! reg_1 (+ reg_2 integer)), you can assume reg_1 and reg_2 are identical, and integer is a 32-bit integer, since otherwise the input would not have been a valid Paren-x64 program.
Start by writing examples of what you expect the behavior to be. These examples can serve as tests, but the first serve to help you understand the expected behavior of the program you will implement. One you understand the expected behavior, you will be better able to implement it.
You should write at least 10 unit tests for this function, to cover each case of the language, and to test edge cases like underflow and overflow.
To properly implement arithmetic operations, you need to handle two’s complement arithmetic, which overflows on large positive numbers and underflows on small negative numbers. You may want to use x64-add and x64-mul from a1-compiler-lib.rkt.
One thing to consider is how to implement access to uninitialized registers. Accessing an uninitiaized register undefined behavior in Paren-x64. Returning a random value here will make testing your interpreter difficult. However, picking a particular value will not make things easy either; when run on an undefiend program, the compiler and the interpreter may disagree about the result, but both the interpreter and the compiler could be correct.

Examples:

> (interp-paren-x64
   '(begin
      (set! rax 0)
      (set! rax (+ rax 42))))
42
> (interp-paren-x64
   '(begin
      (set! rax 170679)
      (set! rdi rax)
      (set! rdi (+ rdi rdi))
      (set! rsp rdi)
      (set! rsp (* rsp rsp))
      (set! rbx 8991)))
183

Exercise 5: Design and implement a racket function generate-x64 which compiles a Paren-x64 represented as an s-expression into a sequence of x64-411 instructions, represented as a tring. generate-x64 may assume it always receives a valid Paren-x64 program.
The output should be a sequence of x64-411 instructions represented as a string. x64-411 instructions are all valid x64 instructions, but the output of this pass will not yet be a valid x64 program. It will be missing x64 boilerplate, like the declaration of the start label, and a run-time system that communicate its result to the operating system.
You should start by writing examples and tests. You should write at least 7 unit tests, testing each feature of the source language.

Examples:

> (generate-x64
   '(begin
      (set! rax 0)
      (set! rax (+ rax 42))))
"\n  mov rax, 0\n  add rax, 42"
> (require racket/pretty)
> (pretty-display
   (generate-x64
    '(begin
       (set! rax 0)
       (set! rax (+ rax 42)))))
  mov rax, 0
  add rax, 42
> (pretty-display
   (generate-x64
    '(begin
       (set! rax 170679)
       (set! rdi rax)
       (set! rdi (+ rdi rdi))
       (set! rsp rdi)
       (set! rsp (* rsp rsp))
       (set! rbx 8991))))
  mov rax, 170679
  mov rdi, rax
  add rdi, rdi
  mov rsp, rdi
  imul rsp, rsp
  mov rbx, 8991

6.5.1.1 Run-time systems.

The abstractions provided by the operating system for running x64 are not the same as the convention we just created for Paren-x64. The operating system does not know it should "return" the result of rax, whatever "return" means. To implement this convention, we need to write some x64 code that will take any code implementing the Paren-x64 convention and communicate the result to the operating system. This is a very simple run-time system.

Our choice of run-time system depends on the abstractions provided by the language, the machine, the operating system, and the user interface we desire. Some languages print the final result. Some languages discard the final result, relying on the user program to print the result, or modify the filesystem, or modify the state of the machine in some other non-temporary way.

Our language does not provide the user with any way to modify the state of the machine, so our run-time system must do the job of communicating the return value to the user.

We could print the result, but as we saw in the factorial example above, the operating system’s definition of "print" does not match our intuition. When trying to print "120", we get the character "x". This would make for a very confusing user interface.

Instead, we opt for a very simple run-time system: communicate via the operating system exit code. This exit code is a number between 0 and 255 given to the exit system call, and is easily accessible in shells via the variable $? (or $status in some shells). In Racket, we can access the exit code of a subprocess using system/exit-code. This limits how much our programs can communicate; we will lift that restriction in later assignments.

Exercise 3: Design and implement a Racket function wrap-x64-run-time which installs the Paren-x64 run-time system. The input is same as the output for generate-x64: a sequence of x64-411 instructions as a string.
You will first need to write the run-time system, in x64, which expects a program to finish with a value in rax, and then calls the exit system call with the value of rax passed as the exit code.
a1-compiler-lib.rkt provides some helper definitions, such as sys-exit, that you may find useful. While this level of abstraction is not necessary in your compiler, it is useful if you want to try to support multiple operating systems.
For formatting strings in Racket, you may want to investigate format, ~a, and at-exp.

Exercise 4: Design and implement a Racket function wrap-x64-boilerplate that a sequence of x64 instructions, as a string, and wraps it in all the necessary boilerplate to make nasm compile. The input is a sequence of x64 instructions, represented as a string in Intel syntax, and the output is a fill x64 program, represented as a string in Intel syntax.
Note that the input is not necessary restricted to x64-411, since the input should be the same as the output of wrap-x64-run-time. The boilerplate added includes lines such as global start, and installing the start: label at the first instructions in the sequence.

Exercise 6: Test your compiler by running the same programs through your interpreter and through your compiler.
You should write at least 7 unit tests, testing each feature of the source language.

Examples:

> (interp-paren-x64
   '(begin
      (set! rax 170679)
      (set! rdi rax)
      (set! rdi (+ rdi rdi))
      (set! rsp rdi)
      (set! rsp (* rsp rsp))
      (set! rbx 8991)))
183
> (current-pass-list
   (list
    check-paren-x64
    generate-x64
    wrap-x64-run-time
    wrap-x64-boilerplate))
> (execute
   '(begin
      (set! rax 170679)
      (set! rdi rax)
      (set! rdi (+ rdi rdi))
      (set! rsp rdi)
      (set! rsp (* rsp rsp))
      (set! rbx 8991)))
183

For students eager for more compilers, we may include challenge problems in assignments. You do not have to complete challenge problems, and we will not give bonus points for completing them. However, completing them may give you additional instruction on various aspects of compilation.

Challenge 1: Design and implement a function check-registers-init that takes a valid Paren-x64 program and returns it only if the program uses only initialized registers, and raises an error if some register is used that might be uninitialized.
This function should not be used in check-paren-x64, but you may use it in your compiler pipeline by adding it to current-pass-list. Your interpreter may also assume it only receives programs that pass the output of this checker.

← prev up next →

1	CPSC 411 Syllabus
2	Course Calendar
3	The Compiler Design Recipe
4	Organizing Racket Code
5	Compiler 0: x64 to elf64
6	Compiler 1: Paren-x64 to x64
7	Compiler 2: Values-lang to x64
8	Compiler 3: Register Allocation (Original)
9	Compiler 3: Register Allocation (Corrected)
10	Compiler 4: Adding Control Flow (Original)
11	Compiler 4: Adding Control Flow (Corrected)
12	Assignment 5: The CPSC411 Compiler Recovery and Reinvestment Assignment of 2020.
13	Compiler 6: Calling Conventions
14	Compiler 7: Immediate Data Types
15	Compiler 8: Structured Data Types
16	Compiler 9: First-class Functions
17	Compiler 10: Recursive Data and Syntactic Sugar
18	Assignment 11: The CPSC411 Compiler Encore

6.1	Assignment Summary
6.2	Language Diagram
6.3	Preface: What’s bad about x64
6.4	x64-411: Our x64 Subset
6.5	Paren-x64: The First Source Language