Andrey Listopadov

On Scheme's Dots

@article @programming rust grscheme ~16 minutes read

This is my first public post that I wrote because I’ve finally thought that I’m clever enough to do so. Yet I must mention that I have never actually studied Programming Language Theory (PLT), so I’m not a specialist on the topic I gonna talk here.

Year ago I’ve decided that I should expand my knowledge in programming languages, and software development in general. I’ve already programmed for several years in C for bare metal, and in C++, as well as used some scripting languages, like Awk and Perl. But all these languages are mostly imperative style, so I wanted something different.

I had a friend talked about Lisp, and he showed me this post by Paul Graham and I was inspired by things that it describes. I’ve became interested in Lisp, so I got my digital copy of SICP and started reading it.

That was a year ago.

I’ve just finished chapter one.

Not that I’m stupid or something, but some exercises there are brutal. I remember that I’ve set 3 days on one task where you have to complete two missing lines. It may not be that hard for others, but certainly was very hard for me. So it took me a full year to go through the first chapter, but that only thing already changed how I look on the code today.

Additionally to SICP, I’ve studied Rust by reading TRPL. Because I wanted to learn both Rust and Scheme, I’ve decided that it will be great exercise to write my own Scheme in Rust! Except, I’ve only finished first chapter of SICP, and I don’t know how language works. And, actually, I don’t know anything about language design.

So how do you write a language in Rust, when you never studied PLT, and only programmed in C, barely know Rust, only toyed with different interpreted languages, finished first chapter of SICP, and wrote some very basic scripts in Scheme during it?

Well, that’s a good question, I guess. You may think that I decided to read about Scheme and Lisp, something like this, but no, actually I’ve decided that I’ll just throw some code to the Scheme REPL and see how it behaves.

And that’s where the fun started.

Reverse-Engineering Scheme

I’m not qualified reverse engineer, and this doesn’t actually have that much with real reverse engineering. I called it this way because the process of defining rules by behavior observation is somewhat related to reverse-engineering process, but does not go as deep as real reverse engineering does. I’ve took two languages, one was GNU Guile, which implements R5RS standard, and the other was Racket, which is not strictly a Scheme, but a Programming-Language Programming Language, and started throwing different code in theirs REPLs. The observed behavior of such concept as dot is what motivated me to write this post.

The good thing about Scheme, and actually one of the main reasons why it is so powerful, is that it is very homoiconic, and code directly represents the underlying data structure, which is a list. Lisp itself stands for List Processor, and I never gave this enough thought until I’ve started working on implementation of a language. You take most of information for granted (or at least I do) when it is about something you do not interact with too often.

So what is a list in Scheme? Well, anything that is enclosed inside parentheses. And we know that mostly everything is enclosed with parentheses in Lisp, so what’s the point of this information? Why everything should be a list anyway?

Because lists are easy to manipulate.

We need to know that Scheme has two main data structures: list and atom. Lists consist of atoms or another lists, and can be destructed to individual atoms, and atoms can be combined to lists. For example, digits like 1, 2, 3, are atoms. List of these digits is represented as '(1 2 3).

In Scheme there’s three primitive procedures that work with lists: cons, car, and cdr. Names may be not very representative, but that’s historical names (and Lisp has very long history) so just accept it for now. What these procedures do? cons stands for construction, and it is used to construct lists. Here’s what you will see if you start Racket REPL and input this code:

> (cons 1 '(2 3))
'(1 2 3)

Note: I’ll be using Racket’s REPL format to represent things in this post, because Racket’s default REPL is minimal and only consists out of > character for prompt and output is typically on the next line, compared to Guile’s REPL which is more busy looking with all of it’s additional info. The Guile was taken as reference language though. But in terms of this text, Guile and Racket behave the same.

car retrieves first element of the list and cdr returns all elements of the list after first one:

> (car '(1 2 3))
> (cdr '(1 2 3))
'(2 3)

Combined with recursion and functional style of programming we can traverse list by taking out first element, modify it, and create new lists, leaving the rest of the list for next iteration. That’s may sound like basic thing, but it is really powerful concept, because everything in Scheme is a list.

But I’ve actually lied to you. cons, car, and cdr doesn’t work with lists. Those work with pairs.

Like, what? What is a pair? Previously I’ve said everything is a list or atom, and now I say that there’s something other than that? Well, yes, pair is a pair of atoms, it is written as list, but values are separated by the dot: (1 . 2). Pair is not a list, but a more fundamental thing - a cons cell.

> (cons 1 2)
'(1 . 2)

And lists actually consist out of pairs.

If you have ever wrote a linked list in C or some language that has pointers, you know that any list consists of nodes. Node stores information and pointer to the next node. If there’s no node after the node it stores a pointer to a special location. It may point to itself, or to void or somewhere else, so you could distinguish the end of the list and stop traversing it. How many nodes do linked list has to have to be called list? The answer is one, but without pointer to the next special node you can’t use it, so actually minimal list consists of two elements: the element with the data and empty node or void. So that thing also can be called a pair.

Well, at leas this can be thought that way. And this actually aligns with what happens in Scheme! Look:

> (cons 1 '())

Because lists are essentially nested pairs, and the last pair in the list has '() (empty list) as it’s second atom. We can nest cons to produce longer lists:

> (cons 1 (cons 2 (cons 3 '())))
'(1 2 3)

Why does this work? Because, given that cons produces a pair, if we evaluate the code ourselves, we will get this:

'(1 . (2 . (3 . ())))

Let’s explain why '(1 . (2 . (3 . ()))) evaluates to '(1 2 3). My observations lead to me to think that dot has two rules. The rules are as follows:

  1. Dot must be the second to last element in the list,
  2. Dot can be removed if after the dot comes open parenthesis, alongside with open parenthesis and matching closing parenthesis.

Because (cons 3 '()) produces a pair of dot separated values, and the first one is 3, and the other is '() we get the '(3 . ()) result (because of how quote works we remove ' from '() and get () in the resulting pair). Now let’s apply the dot’s second rule to this code. First let’s remove dot in the most nested example: '(3 . ()) becomes '(3), so now we have '(1 . (2 . (3))). We can repeat this to (2 . (3), producing (2 3). Repeating this the last dot produces the simple list: '(1 2 3).

Note: The rules above may not be actual thing though, because dot is just syntax on top of internal data structure, and it’s not stored in the list, but we’ll talk about it later.

So why does car and cdr work on lists, if those are actually working on pairs? Because car removes the element before the first dot, and cdr returns everything after it!

> (car '(1 . (2 . (3 . ()))))
1 ;; 1 is before the `. (2 . (3 . ()))'
> (cdr '(1 . (2 . (3 . ()))))
'(2 . (3 . ())) ; or '(2 3) for short

So every time we see something like (a b c), the interpreter actually sees (a . (b . (c . ()))). That’s why car and cdr work both with pairs and lists. Because lists consist out of pairs! And that’s really beautiful.

And then I saw pure beauty

Yet, before I’ve fully understood what I seen, I’ve thought that my beautiful observation is not correct at all. But first we need to know how to define procedures (functions) in Scheme:

> (define (sum a b) (+ a b))

This constructs a procedure sum that accepts two arguments a and b and produces the sum of those as a result with the following body: (+ a b). We can call this procedure like this:

> (sum 1 3)

Because I was figuring out the rules behind the dot, which I’ve stated before, I was wondering what would happen if I use dot in this syntax, like this:

(define (foo . bar) bar)

Not only this did not produce any errors, this also works in a strange way:

> (foo 22)

And at first glance this looks like nonsense (well at leas to me, remember, I didn’t finished SICP yet) and I’ve expected it to fail, because dot seem to be in the wrong place but it worked just fine.

“What the hell?” - I’ve thought. I’ve tried to pass two arguments to the function and it still worked:

> (foo 1 2)
'(1 2)

At this point I was thinking that everything I’ve though I understand is wrong, but after some long thinking process I’ve understood this and was shocked how beautiful this thing is.

So what’s the problem here? I’ve shown you the syntax for defining sum procedure: (define (sum a b) (+ a b), but what’s actually going underneath is that anonymous procedure with arguments (a b) is being bound to the name sum like so:

(define sum (lambda (a b) (+a b))

We can write all our procedures this way, but it is tedious to do. So previous form with parentheses around sum a b is a shorthand for this form.

So what’s happening when we put the dot into the function definition? To understand this we have to understand why everything is a list, and why we need primitive operations on lists.

Note: The following part of this post mostly relies on what I’ve already did with my own implementation of scheme-like language and how it works, but I think that the real scheme works somewhat similar. I will write more about my language later this post and in the following post that is yet to come.

To convert procedure from form (define (foo bar) bar) to (define foo (lambda (bar) bar)) we need to know two things: procedure’s name, to put after define and it’s arguments to put into lambda. So we car the list (foo bar) and receive foo. Now we need the argument list and we cdr the (foo bar), and get (bar). Now we can put foo after define and (bar) after lambda: (define foo (lambda (bar) bar).

What will happen if we cdr the list which has dot?

> (cdr '(foo . bar))

Which means that we getting bar without parentheses around it. So doing all previous steps for (define (foo . bar) bar) expression gives us (define foo (lambda bar bar)) where lambda has no parentheses around it’s argument. Let’s try to use such form of lambda directly:

> ((lambda x x) 1)
> ((lambda x x) 1 2)
'(1 2)

This is a procedure with variable argument count! Why? Let’s car and cdr this code!

First we evaluate (lambda x x), which provides #<procedure> in Racket:

(#procedure 1)

Now we can car what we will apply, and cdr it’s arguments:

(car (#procedure 1))
(cdr (#procedure 1))

So the arguments of the function is '(1). Let’s recap what our lambda does. It was defined as (lambda x x), which means that it takes 1 thing and returns it immediately. The thing it takes is this '(1), that’s why we receive '(1) if we call this lambda with 1.

Doing the same for ((lambda x x) 1 2) provides '(1 2) as applied argument.

Why this doesn’t work for lambdas which have their argument in the parentheses? My guess is that it is because we provide list of certain length, and since we’re passing a list, the items in one list gets bound to the items in another, in lambda body.

OK, let’s recap what we know about Scheme so far:

  • Everything is a list;
  • Scratch that, everything is a pair, and every list is a sequence of nested pairs, ending with an empty list;
  • pair is a cons cell, which uses dot to separate elements;
  • car and cdr use dot to decide what to take out of the cons cell.

Applying the Concepts Above In the Interpreter.

When I’ve started writing this post I had no idea that I still will be writing it after I’ll have working interpreter prototype. It is located here, and it uses concepts developed above to evaluate the code. It’s not very good though.

Whole language is implemented around these base functions:

  • parse - parses text into GRScheme data structure,
  • eval - which is recursively traverses the tree,
  • apply - which applies procedure to the tree,
  • get_first_subexpression - which takes first subtree from current tree,
  • get_rest_subexpression - which takes the rest subtrees as a tree.

Other functions in evaluator source code provide builtin facilities like car, +, lambda, define etc.

First thing I have to say, that GRScheme doesn’t have such thing as dot. It is emulated to behave the same way as in real Scheme. I suppose that regular Scheme does not have dot either, but in other sense.

After parsing the code we receive tree structure that has nodes with data, and tree directly represents original code. This allows me to think about the Scheme code in a very native way, because it is already a tree, and it actually simplifies parsing a lot. I store parentheses as tree items, atoms are also tree items, but typically only parentheses have child nodes. This allows later traversing algorithm to check if current node is a procedure call by just checking if current node is an open parenthesis.

The get_first_subexpression and get_rest_subexpression functions are not exactly car and cdr as it may seem, but a more general ones that operate directly on the program AST. But because the dot is emulated in my implementation those currently also check for the . in the tree. If it is presented, get_rest_subexpression returns the subtrees without it.

This allows us destruct pairs and list with these two functions, but we can’t use those directly. Instead car and cdr are implemented in terms of these functions, also applying quoting on the result, but that’s a topic for a different post.

I’ve mentioned that dot is emulated in my implementation of the interpreter. What this means is that I store the dot directly inside the tree as a node. Let’s have a look at how '(a . b) looks in my language’s data structure:

Figure 1: '(a . b) Pair representation in GRScheme

Figure 1: '(a . b) Pair representation in GRScheme

Note: because I write quoted lists in the short form, there is a possibility of misunderstanding why the tree in the figures doesn’t directly match the input, e.g. has more parentheses. For instance '(a . b) is actually (quote (a . b)) and it is represented in the tree this way.

We have a tree, with a root node holding ( value, which has two nodes: quote and open parenthesis. This second ( has three nodes, which are a, ., and b.

In real Scheme, this isn’t stored this way. Because pair is really a pair in scheme, cons cell actually stores two pointers to the data like this:

Figure 2: '(a . b) Pair in Scheme

Figure 2: '(a . b) Pair in Scheme

Lists are a bit different. In my implementation, the structure to which I parse the source code is a tree with any amount of child nodes per node, as seen in the pair example above That’s how '(1 2 3) list is stored in my implementation:

Figure 3: '(1 2 3) List in GRScheme

Figure 3: '(1 2 3) List in GRScheme

This is a tree, which is similar to the pair one, except instead of the dot we have 2. Let’s write list as a sequence of pairs '(1 . (2 . (3 . ()))) and see how it will be stored in my implementation:

Figure 4: Dotted list in GRScheme

Figure 4: Dotted list in GRScheme

We can see, that each subtree has exactly three nodes in it. However in a real scheme, it’s two nodes per tree, and the very same sequence of nested pairs is stored like this:

Figure 5: '(1 2 3) List in Scheme

Figure 5: '(1 2 3) List in Scheme

Because any list can be written as sequence of pairs, we can see that any list in real scheme is actually a binary tree. Not in case of my language though.

The parser that I’ve wrote for GRScheme knows how to deal with dots, thanks to the dot rules I’ve described above, and these two constructions are parsed to exactly same form of:

Figure 6: Both notations in GRScheme

Figure 6: Both notations in GRScheme

And lists that don’t end with empty list, like '(1 2 . 3) and '(1 . (2 . 3)) are parsed like this:

Figure 7: List with pair at the end in GRScheme

Figure 7: List with pair at the end in GRScheme

Which in real scheme will look like this:

Figure 8: List with pair at the end in Scheme

Figure 8: List with pair at the end in Scheme

So in case of car in real scheme we get 1, and in case of cdr we get the subtree '(2 . 3) when we apply it to this list, and 3 when we apply it again to previous result. This works the same way in my implementation, because we analyze the tree when we do cdr to see if there’s a dot in the second position, which may not be optimal, but works.

Dot in GRScheme

I’ve thought on dots during this post, and I think it is a very beautiful syntax solution to describe binary trees that do not end with null pointer. However this also means that dots are actually needed only for that case. Though we can emulate dots by storing those inside tree directly, I don’t think that this is a good idea to do so.

Because the underlying structure in GRScheme is a graph, and not a binary tree, I think pairs are also meaningless, because those describe particular tree case in Schemes, and in GRScheme we do not have this case. So one major difference from real scheme is that in GRScheme there’s no cons cells, and some algorithms may depend on extracting the second element from the cons cell directly.

I’ve thought that it may be good to have special procedure for that case, named along with car and cdr - Contents of Cons cell Representation. ccr should return second element directly if it is the last element in the list, and return the rest of the list otherwise. This way it will work mostly the same as cdr but cdr always returns the list. I’m not sure on how good this solution is, because this may be ambiguous when to use such procedure, so there’s another procedure that may be used instead, which is cadr, that always takes second element from the list.

I’m planning to write second post about GRScheme diving more deeply into the details behind language model I’ve developed. Current implementation is slightly wrong and has many problems, so I have to fix those first.