This is my first public post that I wrote because I’ve finally thought that I’m clever enough to do so. Yet I must mention that I have never actually studied Programming Language Theory (PLT), so I’m not a specialist on the topic I gonna talk here.
Year ago I’ve decided that I should expand my knowledge in programming languages, and software development in general. I’ve already programmed for several years in C for bare metal, and in C++, as well as used some scripting languages, like Awk and Perl. But all these languages are mostly imperative style, so I wanted something different.
I had a friend talked about Lisp, and he showed me this post by Paul Graham and I was inspired by things that it describes. I’ve became interested in Lisp, so I got my digital copy of SICP and started reading it.
That was a year ago.
I’ve just finished chapter one.
Not that I’m stupid or something, but some exercises there are brutal. I remember that I’ve set 3 days on one task where you have to complete two missing lines. It may not be that hard for others, but certainly was very hard for me. So it took me a full year to go through the first chapter, but that only thing already changed how I look on the code today.
Additionally to SICP, I’ve studied Rust by reading TRPL. Because I wanted to learn both Rust and Scheme, I’ve decided that it will be great exercise to write my own Scheme in Rust! Except, I’ve only finished first chapter of SICP, and I don’t know how language works. And, actually, I don’t know anything about language design.
So how do you write a language in Rust, when you never studied PLT, and only programmed in C, barely know Rust, only toyed with different interpreted languages, finished first chapter of SICP, and wrote some very basic scripts in Scheme during it?
Well, that’s a good question, I guess. You may think that I decided to read about Scheme and Lisp, something like this, but no, actually I’ve decided that I’ll just throw some code to the Scheme REPL and see how it behaves.
And that’s where the fun started.
I’m not qualified reverse engineer, and this doesn’t actually have that much with real reverse engineering. I called it this way because the process of defining rules by behavior observation is somewhat related to reverse-engineering process, but does not go as deep as real reverse engineering does. I’ve took two languages, one was GNU Guile, which implements R5RS standard, and the other was Racket, which is not strictly a Scheme, but a Programming-Language Programming Language, and started throwing different code in theirs REPLs. The observed behavior of such concept as dot is what motivated me to write this post.
The good thing about Scheme, and actually one of the main reasons why it is so powerful, is that it is very homoiconic, and code directly represents the underlying data structure, which is a list. Lisp itself stands for List Processor, and I never gave this enough thought until I’ve started working on implementation of a language. You take most of information for granted (or at least I do) when it is about something you do not interact with too often.
So what is a list in Scheme? Well, anything that is enclosed inside parentheses. And we know that mostly everything is enclosed with parentheses in Lisp, so what’s the point of this information? Why everything should be a list anyway?
Because lists are easy to manipulate.
We need to know that Scheme has two main data structures: list and atom.
Lists consist of atoms or another lists, and can be destructed to individual atoms, and atoms can be combined to lists.
For example, digits like
3, are atoms.
List of these digits is represented as
'(1 2 3).
In Scheme there’s three primitive procedures that work with lists:
Names may be not very representative, but that’s historical names (and Lisp has very long history) so just accept it for now.
What these procedures do?
cons stands for construction, and it is used to construct lists.
Here’s what you will see if you start Racket REPL and input this code:
> (cons 1 '(2 3)) '(1 2 3)
Note: I’ll be using Racket’s REPL format to represent things in this post, because Racket’s default REPL is minimal and only consists out of
>character for prompt and output is typically on the next line, compared to Guile’s REPL which is more busy looking with all of it’s additional info. The Guile was taken as reference language though. But in terms of this text, Guile and Racket behave the same.
car retrieves first element of the list and
cdr returns all elements of the list after first one:
> (car '(1 2 3)) 1 > (cdr '(1 2 3)) '(2 3)
Combined with recursion and functional style of programming we can traverse list by taking out first element, modify it, and create new lists, leaving the rest of the list for next iteration. That’s may sound like basic thing, but it is really powerful concept, because everything in Scheme is a list.
But I’ve actually lied to you.
cdr doesn’t work with lists.
Those work with pairs.
What is a pair?
Previously I’ve said everything is a list or atom, and now I say that there’s something other than that?
Well, yes, pair is a pair of atoms, it is written as list, but values are separated by the dot:
(1 . 2).
Pair is not a list, but a more fundamental thing - a cons cell.
> (cons 1 2) '(1 . 2)
And lists actually consist out of pairs.
If you have ever wrote a linked list in C or some language that has pointers, you know that any list consists of nodes.
Node stores information and pointer to the next node.
If there’s no node after the node it stores a pointer to a special location.
It may point to itself, or to
void or somewhere else, so you could distinguish the end of the list and stop traversing it.
How many nodes do linked list has to have to be called list?
The answer is one, but without pointer to the next special node you can’t use it, so actually minimal list consists of two elements: the element with the data and empty node or void.
So that thing also can be called a pair.
Well, at leas this can be thought that way. And this actually aligns with what happens in Scheme! Look:
> (cons 1 '()) '(1)
Because lists are essentially nested pairs, and the last pair in the list has
'() (empty list) as it’s second atom.
We can nest
cons to produce longer lists:
> (cons 1 (cons 2 (cons 3 '()))) '(1 2 3)
Why does this work?
Because, given that
cons produces a pair, if we evaluate the code ourselves, we will get this:
'(1 . (2 . (3 . ())))
Let’s explain why
'(1 . (2 . (3 . ()))) evaluates to
'(1 2 3).
My observations lead to me to think that dot has two rules.
The rules are as follows:
- Dot must be the second to last element in the list,
- Dot can be removed if after the dot comes open parenthesis, alongside with open parenthesis and matching closing parenthesis.
(cons 3 '()) produces a pair of dot separated values, and the first one is
3, and the other is
'() we get the
'(3 . ()) result (because of how
quote works we remove
'() and get
() in the resulting pair).
Now let’s apply the dot’s second rule to this code.
First let’s remove dot in the most nested example:
'(3 . ()) becomes
'(3), so now we have
'(1 . (2 . (3))).
We can repeat this to
(2 . (3), producing
Repeating this the last dot produces the simple list:
'(1 2 3).
Note: The rules above may not be actual thing though, because dot is just syntax on top of internal data structure, and it’s not stored in the list, but we’ll talk about it later.
So why does
cdr work on lists, if those are actually working on pairs?
car removes the element before the first dot, and
cdr returns everything after it!
> (car '(1 . (2 . (3 . ())))) 1 ;; 1 is before the `. (2 . (3 . ()))' > (cdr '(1 . (2 . (3 . ())))) '(2 . (3 . ())) ; or '(2 3) for short
So every time we see something like
(a b c), the interpreter actually sees
(a . (b . (c . ()))).
cdr work both with pairs and lists.
Because lists consist out of pairs!
And that’s really beautiful.
And then I saw pure beauty
Yet, before I’ve fully understood what I seen, I’ve thought that my beautiful observation is not correct at all. But first we need to know how to define procedures (functions) in Scheme:
> (define (sum a b) (+ a b))
This constructs a procedure
sum that accepts two arguments
b and produces the sum of those as a result with the following body:
(+ a b).
We can call this procedure like this:
> (sum 1 3) 4
Because I was figuring out the rules behind the dot, which I’ve stated before, I was wondering what would happen if I use dot in this syntax, like this:
(define (foo . bar) bar)
Not only this did not produce any errors, this also works in a strange way:
> (foo 22) '(22)
And at first glance this looks like nonsense (well at leas to me, remember, I didn’t finished SICP yet) and I’ve expected it to fail, because dot seem to be in the wrong place but it worked just fine.
“What the hell?” - I’ve thought. I’ve tried to pass two arguments to the function and it still worked:
> (foo 1 2) '(1 2)
At this point I was thinking that everything I’ve though I understand is wrong, but after some long thinking process I’ve understood this and was shocked how beautiful this thing is.
So what’s the problem here?
I’ve shown you the syntax for defining
(define (sum a b) (+ a b), but what’s actually going underneath is that anonymous procedure with arguments
(a b) is being bound to the name
sum like so:
(define sum (lambda (a b) (+a b))
We can write all our procedures this way, but it is tedious to do.
So previous form with parentheses around
sum a b is a shorthand for this form.
So what’s happening when we put the dot into the function definition? To understand this we have to understand why everything is a list, and why we need primitive operations on lists.
Note: The following part of this post mostly relies on what I’ve already did with my own implementation of scheme-like language and how it works, but I think that the real scheme works somewhat similar. I will write more about my language later this post and in the following post that is yet to come.
To convert procedure from form
(define (foo bar) bar) to
(define foo (lambda (bar) bar)) we need to know two things: procedure’s name, to put after
define and it’s arguments to put into
car the list
(foo bar) and receive
Now we need the argument list and we
(foo bar), and get
Now we can put
(bar) after lambda:
(define foo (lambda (bar) bar).
What will happen if we
cdr the list which has dot?
> (cdr '(foo . bar)) 'bar
Which means that we getting
bar without parentheses around it.
So doing all previous steps for
(define (foo . bar) bar) expression gives us
(define foo (lambda bar bar)) where
lambda has no parentheses around it’s argument.
Let’s try to use such form of lambda directly:
> ((lambda x x) 1) '(1) > ((lambda x x) 1 2) '(1 2)
This is a procedure with variable argument count!
cdr this code!
First we evaluate
(lambda x x), which provides
#<procedure> in Racket:
Now we can
car what we will apply, and
cdr it’s arguments:
(car (#procedure 1)) #procedure (cdr (#procedure 1)) '(1)
So the arguments of the function is
Let’s recap what our lambda does.
It was defined as
(lambda x x), which means that it takes 1 thing and returns it immediately.
The thing it takes is this
'(1), that’s why we receive
'(1) if we call this lambda with
Doing the same for
((lambda x x) 1 2) provides
'(1 2) as applied argument.
Why this doesn’t work for lambdas which have their argument in the parentheses? My guess is that it is because we provide list of certain length, and since we’re passing a list, the items in one list gets bound to the items in another, in lambda body.
OK, let’s recap what we know about Scheme so far:
Everything is a list;
- Scratch that, everything is a pair, and every list is a sequence of nested pairs, ending with an empty list;
- pair is a cons cell, which uses dot to separate elements;
cdruse dot to decide what to take out of the cons cell.
Applying the Concepts Above In the Interpreter.
When I’ve started writing this post I had no idea that I still will be writing it after I’ll have working interpreter prototype. It is located here, and it uses concepts developed above to evaluate the code. It’s not very good though.
Whole language is implemented around these base functions:
parse- parses text into GRScheme data structure,
eval- which is recursively traverses the tree,
apply- which applies procedure to the tree,
get_first_subexpression- which takes first subtree from current tree,
get_rest_subexpression- which takes the rest subtrees as a tree.
Other functions in evaluator source code provide builtin facilities like
First thing I have to say, that GRScheme doesn’t have such thing as dot. It is emulated to behave the same way as in real Scheme. I suppose that regular Scheme does not have dot either, but in other sense.
After parsing the code we receive tree structure that has nodes with data, and tree directly represents original code. This allows me to think about the Scheme code in a very native way, because it is already a tree, and it actually simplifies parsing a lot. I store parentheses as tree items, atoms are also tree items, but typically only parentheses have child nodes. This allows later traversing algorithm to check if current node is a procedure call by just checking if current node is an open parenthesis.
get_rest_subexpression functions are not exactly
cdr as it may seem, but a more general ones that operate directly on the program AST.
But because the dot is emulated in my implementation those currently also check for the
. in the tree.
If it is presented,
get_rest_subexpression returns the subtrees without it.
This allows us destruct pairs and list with these two functions, but we can’t use those directly.
cdr are implemented in terms of these functions, also applying quoting on the result, but that’s a topic for a different post.
I’ve mentioned that dot is emulated in my implementation of the interpreter.
What this means is that I store the dot directly inside the tree as a node.
Let’s have a look at how
'(a . b) looks in my language’s data structure:
Note: because I write quoted lists in the short form, there is a possibility of misunderstanding why the tree in the figures doesn’t directly match the input, e.g. has more parentheses. For instance
'(a . b)is actually
(quote (a . b))and it is represented in the tree this way.
We have a tree, with a root node holding
( value, which has two nodes:
quote and open parenthesis.
( has three nodes, which are
In real Scheme, this isn’t stored this way. Because pair is really a pair in scheme, cons cell actually stores two pointers to the data like this:
Lists are a bit different.
In my implementation, the structure to which I parse the source code is a tree with any amount of child nodes per node, as seen in the pair example above That’s how
'(1 2 3) list is stored in my implementation:
This is a tree, which is similar to the pair one, except instead of the dot we have
Let’s write list as a sequence of pairs
'(1 . (2 . (3 . ()))) and see how it will be stored in my implementation:
We can see, that each subtree has exactly three nodes in it. However in a real scheme, it’s two nodes per tree, and the very same sequence of nested pairs is stored like this:
Because any list can be written as sequence of pairs, we can see that any list in real scheme is actually a binary tree. Not in case of my language though.
The parser that I’ve wrote for GRScheme knows how to deal with dots, thanks to the dot rules I’ve described above, and these two constructions are parsed to exactly same form of:
And lists that don’t end with empty list, like
'(1 2 . 3) and
'(1 . (2 . 3)) are parsed like this:
Which in real scheme will look like this:
So in case of
car in real scheme we get
1, and in case of
cdr we get the subtree
'(2 . 3) when we apply it to this list, and
3 when we apply it again to previous result.
This works the same way in my implementation, because we analyze the tree when we do
cdr to see if there’s a dot in the second position, which may not be optimal, but works.
Dot in GRScheme
I’ve thought on dots during this post, and I think it is a very beautiful syntax solution to describe binary trees that do not end with null pointer. However this also means that dots are actually needed only for that case. Though we can emulate dots by storing those inside tree directly, I don’t think that this is a good idea to do so.
Because the underlying structure in GRScheme is a graph, and not a binary tree, I think pairs are also meaningless, because those describe particular tree case in Schemes, and in GRScheme we do not have this case. So one major difference from real scheme is that in GRScheme there’s no cons cells, and some algorithms may depend on extracting the second element from the cons cell directly.
I’ve thought that it may be good to have special procedure for that case, named along with
cdr - Contents of Cons cell Representation.
ccr should return second element directly if it is the last element in the list, and return the rest of the list otherwise.
This way it will work mostly the same as
cdr always returns the list.
I’m not sure on how good this solution is, because this may be ambiguous when to use such procedure, so there’s another procedure that may be used instead, which is
cadr, that always takes second element from the list.
I’m planning to write second post about GRScheme diving more deeply into the details behind language model I’ve developed. Current implementation is slightly wrong and has many problems, so I have to fix those first.