17 Compiler 10: Recursive Data and Syntactic Sugar

7.5.0.17

17 Compiler 10: Recursive Data and Syntactic Sugar

17.1 Assignment Summary

The goal of this assignment is to add recursive data to our language. This allows us to create streams or cyclic graphs, which are useful in a variety of algorithms (such as writing control-flow graph algorithms for compilers). With this feature, we’ll finally be able to completely collapse distinctions in our surface syntax.

We’ll also add a little syntactic sugar to the surface language. This isn’t really related to recursive data, but the assignment is small if we don’t merge the two separate goals into a single assignment.

This assignment is due Friday, April 10, 2020 at 11:59pm.

You can use the reference solution here: https://www.students.cs.ubc.ca/~cs-411/2019w2/a10-interrogator.cgi

Assignment Checklist

You should find a new repository in your https://github.students.cs.ubc.ca account named a10_<team ids> with a code skeleton and a support library. You should complete the assignment in that git repository and push it to the GitHub Students instance.

You should first merge your solution to Compiler 9: First-class Functions with the starter code provided. The new starter code has the correct provides and includes a submodule to help you run your compiler on the command line if you want. The name of the skeleton is a10-skeleton.rkt to avoid accidentally overwriting your files, but your file in the Git repository should be named a10.rkt.

17.2 Language Diagram

17.3 Preface: What’s wrong with Exprs-Lang v9

Exprs-lang v9 gave us the ability to write first-class lexically scoped local functions, a powerful tool for abstracting over computation. However, without the ability to write recursive data, we could only write simple non-recursive local functions. This makes implementing local loops annoying; we had to go through top-level definitions instead. It also means we can’t directly implement other cyclic data structures, like streams or control-flow graphs! By adding recursive data, we can implement arbitrary cyclic data structures, and get first-class lexically-scoped recursive functions (generalized loops) as a happy side effect.

Many languages support cyclic data through imperative mutation, but as we’ll see in this assignment, that’s a low-level implementation technique that the user need not be required to use.

This week, we’ll add letrec to the surface language, defining Racketish, a tiny Racket-like core language. It’s missing a few key features from Racket, but it’s a substantial start.

Most of Racket is implemented using its macro system, thereby extending Racket in Racket, which all elaborates to a simple core language.

p	::=	(module b ... e)

b	::=	(define x e)

e	::=	v
	\|	(apply e e ...)
	\|	(let ([x e]) e)
	\|	(letrec ([x e] ...) e)
	\|	(if e e e)

v	::=	fixnum
	\|	prim-f
	\|	x
	\|	#t
	\|	#f
	\|	()
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal
	\|	(lambda (x ...) e)

prim-f	::=	*
	\|	+
	\|	-
	\|	eq?
	\|	<
	\|	<=
	\|	>
	\|	>=
	\|	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not
	\|	pair?
	\|	procedure?
	\|	vector?
	\|	cons
	\|	car
	\|	cdr
	\|	make-vector
	\|	vector-length
	\|	vector-set!
	\|	vector-ref
	\|	procedure-arity

We add letrec, which we saw in the last assignment, to the surface language. Unlike last week, it is not restricted to bind variables only to procedures. This apparently simple change has deep consequences.

Further, since we have names instead of abstract locations, letrec is restricted to only bind each name once in a single letrec.

17.4 Preliminary Passes

17.4.1 uniquify

First we modify uniquify. Below we define Racketish-Unique.

p	::=	(module b ... e)

b	::=	(define x aloc e)

e	::=	v
	\|	(apply e e ...)
	\|	(let ([x aloc e]) e)
	\|	(letrec ([x aloc e] ...) e)
	\|	(if e e e)

v	::=	fixnum
	\|	prim-f
	\|	x
	\|	aloc
	\|	#t
	\|	#f
	\|	()
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal
	\|	(lambda (x aloc ...) e)

prim-f	::=	*
	\|	+
	\|	-
	\|	eq?
	\|	<
	\|	<=
	\|	>
	\|	>=
	\|	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not
	\|	pair?
	\|	procedure?
	\|	vector?
	\|	cons
	\|	car
	\|	cdr
	\|	make-vector
	\|	vector-length
	\|	vector-set!
	\|	vector-ref
	\|	procedure-arity

As usual with uniquify, the only change is that all names x are replaced by abstract locations aloc.

Exercise 1: Redesign and extend the implementation of uniquify. The source language is Racketish and the target language is Racketish-Unique.
Be careful with the binding for letrec.

17.4.2 define->letrec

Next we elaborate define into letrec, as before. Below we define Just-Exprs-lang v10.

p	::=	(module b ... e)

b	::=	(define aloc e)

e	::=	v
	\|	(apply e e ...)
	\|	(letrec ([aloc e] ...) e)
	\|	(let ([aloc e]) e)
	\|	(if e e e)

v	::=	fixnum
	\|	prim-f
	\|	aloc
	\|	#t
	\|	#f
	\|	()
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal
	\|	(lambda (aloc ...) e)

prim-f	::=	*
	\|	+
	\|	-
	\|	eq?
	\|	<
	\|	<=
	\|	>
	\|	>=
	\|	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not
	\|	pair?
	\|	procedure?
	\|	vector?
	\|	cons
	\|	car
	\|	cdr
	\|	make-vector
	\|	vector-length
	\|	vector-set!
	\|	vector-ref
	\|	procedure-arity

Exercise 2: Redesign and extend the implementation of define->letrec. The source language is Racketish-Unique and the target language is Just-Exprs-lang v10.

17.5 Resolving Recursive Binding

The rest of our compiler assumes that letrec only binds procedures. We need to purify all letrecs so this is true. This is non-trivial, because we could have bound a recursive data structure. Consider following Racket example:

Example:

> (letrec ([x.1 (cons 1 (lambda () x.1))])
(+ (car x.1) (car ((cdr x.1)))))
2

x.1 is an encoding of an infinite stream of 1s.

To translate a recursive data structure into a let-bound data structure, we need mutable variables. We add mutable variables to the target language, and transform the above example into:

Example:

> (let ([x.1 (void)])
    (let ([x.2 (cons 1 (lambda () x.1))])
      (begin
        (set! x.1 x.2)
        (+ (car x.1) (car ((cdr x.1)))))))
2

We don’t want to do this transformation for every letrec, because set! will be more expensive to compile than what we can do for letrec-bound procedures.

We’ve actually already seen this transformation; we did it when implementing closures, although its details were somewhat obscured. Closures are a form of recursive data—they have a binding to themselves in their environment. When we implemented closures as procedures, we translated the cletrec form into a let followed by several unsafe-procedure-set! calls to tie the knot.

17.5.1 purify-letrec

We define Landin-knot-lang v10.

p	::=	(module e)

c	::=	(set! aloc e)

e	::=	v
	\|	(apply e e ...)
	\|	(letrec ([aloc e] ...) e)
	\|	(let (info ((assigned (aloc ...)))) ([aloc e] ...) e)
	\|	(if e e e)
	\|	(begin c ... e)

v	::=	fixnum
	\|	prim-f
	\|	aloc
	\|	#t
	\|	#f
	\|	()
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal
	\|	(lambda (aloc ...) e)

prim-f	::=	*
	\|	+
	\|	-
	\|	eq?
	\|	<
	\|	<=
	\|	>
	\|	>=
	\|	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not
	\|	pair?
	\|	procedure?
	\|	vector?
	\|	cons
	\|	car
	\|	cdr
	\|	make-vector
	\|	vector-length
	\|	vector-set!
	\|	vector-ref
	\|	procedure-arity

Now letrec only binds procedures. We also allow n-ary binding in let, to simplify the compiler pass. We add a single computation, (set! aloc e). Each let form is annotated with an info structure containing the list of alocs that are assigned in its body, and is used to keep track of mutable alocs.

To purify letrec to only bind procedures, we transform `(letrec ([,xs ,es] ...) ,e) into the following, partitioning the original bindings `([,xs ,es] ...).

`(let (info ((assigned ())))
   ([,xs_s ,es_s] ...)
   (let (info ((assigned (,xs_c ...))))
     ([,xs_c (void)] ...)
       (letrec ([,xs_l ,es_l] ...)
          (let (info ((assigned ())))
            ([,xs_t ,es_c] ...)
              (begin
                (set! ,xs_c ,xs_t) ...
                 ,e)))))

where

`([,xs_s ,es_s] ...) are all simple bindings
`([,xs_l ,es_l] ...) are all lambda bindings
`([,xs_c ,es_c] ...) are all complex bindings
`(,xs_t ...) are all fresh auxiliary alocs.

A binding `[,x ,e] is simple binding if the bound expression e contains no occurrences of the variables xs bound by the letrec expression and no applications unless the application is a prim-f with simple operands. We can also reduce compile time by considering letrec non-simple; otherwise, the simple check may be non-linear in the number of let-bindings. We may get better run-time performance by trying to check if a nested letrec is really simple, though.

The restriction on applications is to avoid problems if we ever add control-flow constructs, such as call/cc, to our language.

A binding `[,x ,e] is lambda if e is literally a lambda value.

Otherwise, the binding is complex.

Complex bindings `[,x ,e] are compiled as follows. First, we set up an initial let-binding to x to (void). Then, we introduce an auxiliary binding `[,xt ,e]. In it, e now has the correct binding, referring to x. But x does not have the right value. So we finally mutate x to have the value of xt, so now both xt and x have the same value, and the right recursive binding structure. This implements recursion using back-patching mutation, a technique called Landin’s Knot.

Exercise 3: Design and implement purify-letrec. The source language is Just-Exprs-lang v10 and the target language is Landin-knot-lang v10.
The reference implementation takes an optional second parameter specifying an optimization level. (curryr purify-letrec 0) (the default) does not treat nested letrec as simple. (curryr purify-letrec 1) tries to detect simple nested letrecs.

17.5.2 convert-assigned

Previously, we did not expose set! beyond the backend of the compiler, so we must implement set!. We cannot simply expose set! from the backend of the compiler, because this version of set! behaves differently. It acts much more like Racket’s set-box!.

This set! is forcing the variable to act like a heap-allocated location that is implicitly dereferenced. In the backend of our compiler, set! simply moved a value from one location to another. That can’t be what happens here. Consider the Racket example from earlier:

(let ([x.1 (void)])
  (let ([x.2 (cons 1 (lambda () x.1))])
    (begin
      (set! x.1 x.2)
      (+ (car x.1) (car ((cdr x.1)))))))

If this set! had a move semantics, i.e., it simply moved values around, then first the value (void) would be moved into x.1, and next the value (cons 1 (lambda () (void))) would be moved into x.2, so we would never get a recursive data structure.

Instead, this set! is forcing the variable x.1 to act as a box (essentially, as a pointer), and any reference to x.1 acts as an implicit dereference of the box. The example would make a great deal more sense if written as:

(let ([x.1 (box (void))])
  (let ([x.2 (cons 1 (lambda () (unbox x.1)))])
    (begin
      (set-box! x.1 x.2)
      (+ (car (unbox x.1)) (car ((cdr (unbox x.1))))))))

So now, we need to implement set! and the assigned locations to give them a box-like semantics, rather than a move semantics.

We do this using vectors.

Vectors use more memory than we need. We could reduce memory usage by adding a primitive box data type. However, this requires adding a new tag, modifying complex passes, and has little pedagogical value.

Below we define Landin-vec-lang v10. We no longer have set! or impure computation; everything must be let-bound once more.

p	::=	(module e)

c	::=	(set! aloc e)

e	::=	v
	\|	(apply e e ...)
	\|	(letrec ([aloc (lambda (aloc ...) e)] ...) e)
	\|	(let (info ((assigned (aloc ...)))) ([aloc e] ...) e)
	\|	(if e e e)
	\|	(begin c ... e)

v	::=	fixnum
	\|	prim-f
	\|	aloc
	\|	#t
	\|	#f
	\|	()
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal
	\|	(lambda (aloc ...) e)

prim-f	::=	_...

Remember that this pass happens before implement-safe-primops, and the safe version of vector-set! must be let-bound since it might return an error value. You might think about how to redesign the compiler to avoid the unnecessary dynamic checks we’re about to introduce.

We translate each assigned abstract location into a vector, and each reference to an assigned location to a dereference from the vector, and each set! to a vector-set!. For example:

`(let (info ((assigned (x.1))))
      ([x.1 (void)])
   (let (info ((assigned ())))
     ([tmp.2 (apply cons 1 (lambda () x.1))])
     (begin
       (set! x.1 tmp.2)
       x.1)))
=>
`(let ([x.1 (apply make-vector 1)])
   (let ([tmp.2 (apply cons 1 (lambda () (apply vector-ref x.1 0)))])
     (let ([tmp.3 (apply vector-set! x.1 0 tmp.2)])
       (apply vector-ref x.1 0))))

Exercise 4: Design and implement the function convert-assigned. The source language is Landin-knot-lang v10 and the target language is Landin-vec-lang v10.

17.5.3 dox-lambdas, implement-safe-primops, uncover-free, convert-closures

Now we just need to make interfaces match. We’ve moved multi-arity let up a few passes, so we need to adapt the languages between here and sequentialize-let.

Exercise 5: Redesign and extend the implementation of
dox-lambdas, whose source language is Landin-vec-lang v10. You need to add support for multi-arity let.
implement-safe-primops, you need to add support for multi-arity let in the source language.
uncover-free, you need to add support for multi-arity let in the source language.
convert-closures, you need to add support for multi-arity let in the source language.

17.6 Syntactic Sugar

17.6.1 Preface: What’s wrong with Racketish?

Racketish is a good start at a core language. You could start building standard libraries and get to work. But it’s missing a handful of nice features. Most languages support various complex data literals, such as Racket’s quote form for lists. They also support features you cannot encode as functions, like short-circuiting boolean and and or.

These get elaborated very early. In Racket, these are mostly implemented as macros, a user-extensible system for expressing syntactic rewrites. We’ll design a compiler pass that isn’t quite a macro system, but should give you an idea of how a macro system works, and give you the ability to add new syntactic sugar to the language.

Below, we define Racketish-Surface.

p	::=	(module b ... e)

b	::=	(define x e)

e	::=	v
	\|	’s-expr
	\|	#(e ...)
	\|	(apply e e ...)
	\|	(e e ...)
	\|	(macro-id e ...)
	\|	(let ([x e]) e)
	\|	(letrec ([x e] ...) e)
	\|	(if e e e)

v	::=	fixnum
	\|	prim-f
	\|	x
	\|	#t
	\|	#f
	\|	()
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal
	\|	(lambda (x ...) e)

s-expr	::=	#t
	\|	#f
	\|	fixnum
	\|	ascii-char-literal
	\|	()
	\|	(s-expr ...)

macro-id	::=	and
	\|	or
	\|	quote
	\|	vector
	\|	begin

prim-f	::=	*
	\|	+
	\|	-
	\|	eq?
	\|	<
	\|	<=
	\|	>
	\|	>=
	\|	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not
	\|	pair?
	\|	procedure?
	\|	vector?
	\|	cons
	\|	car
	\|	cdr
	\|	make-vector
	\|	vector-length
	\|	vector-set!
	\|	vector-ref
	\|	procedure-arity

We add two special syntaxes, ’s-expr for quoted lists, and #(e ...) for vector literals. Note that our language does not have symbols, so our s-expr differs from Racket’s. ’s-expr is a reader macro in Racket and is implicitly elaborated to (quote s-expr). #(e ...) is the notation for vector literals, and is equivalent to (vector e ...), but not syntactically identical in the same way ’s-expr and (quote s-expr) are.

Because Racket’s reader supports them, we do not have to do anything to add them! We avoid implementing a parser once more (huzzah!). This is one interesting feature of bootstrapping—defining a language in itself.

Parsing is often described as a solved problem, but it is not, and is easy to get wrong, and doing a bad job can summon the dark gods to our realm. We do not want to summon dark gods, so we worked directly with data in the host language, and it was fine.

You can see the two are equivalent in Racket by trying typing the following in to the REPL or in the interactions frame in DrRacket:

'(1 2 3)

(quote (1 2 3))

#(1 2 3)

(vector 1 2 3)

I can’t typeset these examples using my normal REPL typesetting feature since Racket implicitly renders them identically.

Using match on this notation requires a little care, since our program is itself a quoted datum in Racket, and match treats them slightly differently.

(match '(quote (1 2 3))
  [`(quote ,es) es])

(match ''(1 2 3)
  [`(quote ,es) es])

(match '#(1 2 3)
  [`#(,es ...) es])

(match '#(1 2 3)
  [(vector es ...) es])

;This one will fail to match
(match '#(1 2 3)
   [`(vector ,es ...) es])

We also introduce a new form for applying macros, and add a few macros. We also make function application implicit; essentially, anything that looks like an application form is either a macro application, or, is implicitly a function application.

17.6.2 uniquify

Macros can often introduce new variable binding, so it’s still important to have some mechanism for handling fresh variables. Before we expand, we once more uniquify.

Below we define Racketish-Surface-Unique, updating it to include the new forms and macros. We typeset changes with respect to Racketish-Surface.

p	::=	(module b ... e)

b	::=	(define x aloc e)

e	::=	v
	\|	’s-expr
	\|	#(e ...)
	\|	(e e ...)
	\|	(macro-id e ...)
	\|	(let ([x aloc e]) e)
	\|	(letrec ([x aloc e] ...) e)
	\|	(if e e e)

v	::=	fixnum
	\|	prim-f
	\|	x
	\|	aloc
	\|	#t
	\|	#f
	\|	()
	\|	(void)
	\|	(error uint8)
	\|	ascii-char-literal
	\|	(lambda (x aloc ...) e)

s-expr	::=	#t
	\|	#f
	\|	fixnum
	\|	ascii-char-literal
	\|	()
	\|	(s-expr ...)

macro-id	::=	and
	\|	or
	\|	quote
	\|	vector
	\|	begin

prim-f	::=	*
	\|	+
	\|	-
	\|	eq?
	\|	<
	\|	<=
	\|	>
	\|	>=
	\|	fixnum?
	\|	boolean?
	\|	empty?
	\|	void?
	\|	ascii-char?
	\|	error?
	\|	not
	\|	pair?
	\|	procedure?
	\|	vector?
	\|	cons
	\|	car
	\|	cdr
	\|	make-vector
	\|	vector-length
	\|	vector-set!
	\|	vector-ref
	\|	procedure-arity

Exercise 6: Redesign and extend the implementation of uniquify. The source language is Racketish-Surface and the target language is Racketish-Surface-Unique.
To support implicit application, you’ll probably want predicates for determining when something is a macro id, and when something is a value keyword.
To handle ’s-expr, remember that it is implicitly elaborated to (quote s-expr) by Racket.

17.6.3 expand-macros

Now we can expand macros. Ideally, this is designed to be extensible, so we can add new macros at will, or even allow the user to define macros. We don’t force you to design your compiler this way, though.

Our target language is the (uniquified) core language: Racketish-Unique. All macros will be elaborated to existing features. We also elaborate implicit function application into explicit apply.

The transformation we want to define are given below:

(and v ...)
`(and) => #t
`(and ,e) => e
`(and ,e ,es ...) => `(if ,e (and ,es ...) #f)
When and is given no arguments, we expand to #t, since #t is the multiplicative identity element on booleans (like 1 is for integers and multiplication). When given a single argument, we return it, since Racket is falsey. Otherwise, we elaborate to branch on the first argument, and recursively expand and applied to the rest of its arguments.
(or v ...)
`(or) => #f
`(or ,e) => e
`(or ,e ,es ...) => `(let ([,tmp ,e]) (if ,tmp ,tmp (or ,es ...)))
When or is given no arguments, we expand to #f, since #f is the additive identity element on booleans (like 0 is for integers and addition). When given a single argument, we return it, since Racket is falsey. Otherwise, we bind the first argument to an auxiliary variable, return it if its true, and otherwise return the or of the rest of the arguments.
Because we introduce an auxiliary variable, we see why it’s important for macros to work with unique variables. Otherwise, this could accidentally capture any use of a free variable in es ..., just like any compiler pass that introduces auxiliary variables.

(vector e ...) and #(e ...)

`(vector ,es ...)
=>
`(let ([,tmp (make-vector ,(length es))])
   (begin
     (vector-set! ,tmp 0 ,(list-ref es 0))
     ...
     (vector-set! ,tmp ,(sub1 (length es)) ,(list-ref es (sub1 (length es))))
     tmp))

(quote s-expr)
`',v => v
`'(,s-expr ,s-exprs ...)
=>
`(cons ',s-expr '(,s-exprs ...))
(begin e ...)
`(begin) => '(void)
`(begin ,e) => e
`(begin ,e ,es ...)
=>
`(let ([,tmp ,e])
   (if (error? ,tmp)
       ,tmp
       `(begin ,es ...)))
We add a surface-level begin form that handles propagating errors correctly. This is essentially an implementation of do for the monad.

These rewrites are essentially syntax-case macros, recursive pattern-based macros that can perform computation at compile time.

Notice that each macro can generate more uses of macros. As a result, the recursive structure of this elaborate pass is going to be slightly different from other passes.

Exercise 7: Design and implement the function expand-macros. The source language is Racketish-Surface-Unique and the target language is Racketish-Unique.
Try designing the pass so that each of the above macros is defined in its own helper function, and make expand-macros handle macros completely generically. You’ll be part of the way to an implementation of a macro system that supports user-defined macros. If you exposed an interface in Racketish-Surface and used eval...

17.7 )

The end.

Exercise 8: Good luck. Have fun.

← prev up next →

1	CPSC 411 Syllabus
2	Course Calendar
3	The Compiler Design Recipe
4	Organizing Racket Code
5	Compiler 0: x64 to elf64
6	Compiler 1: Paren-x64 to x64
7	Compiler 2: Values-lang to x64
8	Compiler 3: Register Allocation (Original)
9	Compiler 3: Register Allocation (Corrected)
10	Compiler 4: Adding Control Flow (Original)
11	Compiler 4: Adding Control Flow (Corrected)
12	Assignment 5: The CPSC411 Compiler Recovery and Reinvestment Assignment of 2020.
13	Compiler 6: Calling Conventions
14	Compiler 7: Immediate Data Types
15	Compiler 8: Structured Data Types
16	Compiler 9: First-class Functions
17	Compiler 10: Recursive Data and Syntactic Sugar
18	Assignment 11: The CPSC411 Compiler Encore

17.1	Assignment Summary
17.2	Language Diagram
17.3	Preface: What’s wrong with Exprs-Lang v9
17.4	Preliminary Passes
17.5	Resolving Recursive Binding
17.6	Syntactic Sugar
17.7	)