Andrey Listopadov

C Needs Better Syntax and Macros

@rant c ~24 minutes read

Today I’ve decided that I want to write a little rant on the topic of language design. Since I’m a C programmer, and I use it on a daily basis, I’ve had many thoughts about C syntax over the past few years. I like C, and it is a good language, that defined modern operating systems, and software. But I think that C still needs a better syntax and macro system. Not because C syntax is inconvenient to use, but because the syntax is a tool of expression. And macros can extend this concept even further.

You know, we, programmers, use a programming language to communicate with computers. When people are communicating with other people, the language is constantly improved, by adding new constructions and removing inconvenient and old ones. This helps language to be up with the current realities, and not die like, for example, the Latin language. No one speaks Latin not because it is inconvenient, but because it is poor. And C syntax is poor as well.

Let’s have a little test. Imagine that you’re a C compiler, I’ll give you the beginning of the line, and you will try to guess what is happening in it:

int something

What could it be? This is certainly a declaration of something, but what exactly? Maybe it is an integer variable declaration, like int something = something_else;. Or maybe it is function declaration, like int something(something_else);. Or maybe something is actually a macro? Then something may actually be something_else. And don’t think that if you see something(something_else); it always will be a function, because it can actually be a function-like macro that expands to whatever it likes to! Ah, that also can be an array pointer - int something[].

There’s no way to tell… unless we look ahead. Sure, it is not a problem for the compiler, and such things as functions, or arrays have obvious and well-known syntax. In the case of macros, well, you can’t know without checking it yourself or using some sort of a tool that can show if the thing under the cursor is macros or not. This may not be a problem for compilers, but for us, humans, this is an issue, because we see code, but can’t immediately see the connection between the code and what actually will happen. This may seem like a minor issue, but when dealing with large codebases which use tons and tons of complex macros, it became really tedious. Most of the time good tooling can help with that, but more certainty would be an improvement.

Of course, we can complain all we want, so instead let’s discuss how to improve this. I’ve programmed for more than 5 years in C at work, mostly writing bare metal software for SoC, so my interest lies in the niche of compiled/system languages. I’ve also occasionally used C++ at work, seen some Java syntax, and recently got into Rust. Rust is somewhat known as a language with quite ugly and verbose syntax, but I don’t think so, and its syntax improves much upon C and C++. And I’m sorry C++ guys, but C++ syntax is pile of trash1.

Let’s compare the example code above with Rust variants for each of our guesses:

  • Function:

    fn something
    
  • Variable:

    let something
    
  • Pointer and array:

    let something
    let something
    

    Uh oh, pointers and arrays are declared exactly how variables are, but that’s not a problem, because in Rust variables hold pointers, and type of the variable is specified after variable name and before assignment. So it is not an issue, because we know, that plain variable, array, pointer, hash-map, and everything else will have the same syntax.

  • Macro:

    something!
    

Bang! Macros always end with !, which makes those easy to see and expect something to happen. This means, that Rust covers 3 out of four complaints I had, and the remaining one is meaningful in terms of the language itself, so this is kind of 4 out of 4. Though these are not the only problems with C syntax, so let’s dive more in-depth.

Type Cyntax

Let’s discuss the type syntax. C features these default types:

  • char - for 8 bit sized data.
  • short - for at least 16 bit sized data.
  • int - for at least 16 bit sized data.
  • long - for at least 32 bit sized data.
  • float - for single precision floating point data.
  • double - for double precision floating point data.

So far so good. Well, sizes are bit strange, like int and short are both at least 16 bit, but this is architecture dependent, and that’s another problem and another topic. There’s one more type that exists but doesn’t quite fit the previous list:

  • long long - for at least 64bit sized data.

Did you notice that? No? Then let me point it out. It uses two words to describe a single idea while previously we had single-worded types, that represented a single concept.

Well, actually all types that were listed above can use two words, e.g. unsigned char, but this describes that this very type is not only for 8bit but also has no sign. There are two concepts that are described with two words. Not for unsigned long long though.

To be honest, all those types except int, and char, can be written using two words that specify length. We can rewrite the previous list as follows:

  • char
  • short int
  • int
  • long int

It makes sense for char not to be short or long because it is widely used to store ASCII characters, which need 256 variants and this always needs 8bits. This also makes sense for int to be short or long, to fit 16 or 32 bits. Not for long long though! See:

  • long long int

Again three words to describe the concept of two. Anyway, I’m nitpicking because there’s a better approach available. And most C programmers use it, I believe:

typedef signed   long long int int64;
typedef unsigned long long int uint64;

And we’re now describing the whole concept with a single word: uint64. Look how easy it is now to get everything we need to know. Size? 64. Has sign? Check for u at the start of the word. This can be repeated for all other types:

typedef signed char   int8;
typedef unsigned char uint8;

typedef signed short   int16;
typedef unsigned short uint16;

typedef signed int   int32;
typedef unsigned int uint32;

typedef signed long long   int64;
typedef unsigned long long uint64;

Notice the common pattern? int repeats in every type name we defined. Rust went further and removed it:

  • i8, u8
  • i16, u16
  • i32, u32
  • i64, u64

I don’t think that i for signed is the best choice, but I’m fine with this. And types are now very consistent and short. Although compound multi-word types have one advantage over a predefined set of types because we have more control over whether we want our type to be signed or long, this is also error-prone, because someone may assume that long long long int is a valid type, while it’s not.

While we’re on types, I think that defining types before objects have little sense too. In Rust type is specified after object name and it reads well, e.g.:

   let    name  :  &str       =       "John";
// |      |     |  |          |       |
// define thing of some type, holding something

Compared to C:

   char *       name   =       "John";
// |    |       |      |       |
// type pointer thing, holding something

And it is known practice in C to read declarations in the opposite direction, e.g. name is pointer to char. There’s also a clockwise spiral reading rule that can be applied to functions and arrays:

str is an array of 10 pointers to a char.

Although for the function pointer it seems that counterclockwise fits better:

fp is a pointer to a function that accepts an int and a pointer to a char and returning a pointer to a char.

English is not RTL, why would we need that stuff in C? Rust uses one more construct to define this and it is :, which in C is used in ?: operator and in case statements. And, case _: is a well-defined syntax, since case is a keyword, so we can’t mix it up with type annotation. ?: is used only after =, so it’s also well defined:

          a     :  int   =       42;
//        |     |  |     |       |
// define thing of type, holding 42

   case 0 :           a     :  int  = cond  ?      1 :  0;
// |    | |           |     |  |    |       |      | |  |
// case 0 then define thing of type holding either 1 or 0, depending on cond

Case with case is like an edge case, and quite uncommon even with current syntax, so I don’t think that this is a problem. There is another use for :, which I’ve never actually seen in my practice, that it can be used for bit size indication instructs, so unfortunately we can’t use it for types in the same way how we use * both for pointers and multiplication:

struct item {
    int f : 28; // 28-bit field
}

But this just means that for bit fields we can specify that it is valid to use : for the second time:

struct item {
    f: int : 28;
}

Just like that in official syntax : as a bit-field specifier is the only valid syntax in struct, the second use of : could be also only valid in struct. This way we could use uniform syntax for type declaration. Here’s a side-by-side comparison of the standard syntax, and what I think is a better alternative:

typedef struct my_IEEE754 {
    union {
        struct sf_t {
            int fraction : 23;
            int exp : 8;
            int sign : 1;
        } sf;
        float hf;
    } f;
} my_float;

int main(int argc, char * argv[])
{
    my_float pi = { .f.hf = 3.1415f };
}

Now with imaginary syntax. I’ve kept those constructs that I’m fine with, such as struct initializer, and I also changed char*[] to [*char] so it could be read as array of pointer to char without falling back RTL:

typedef struct my_IEEE754 {
    f: union {
        sf: struct sf_t {
            fraction: int : 23;
            exp: int : 8;
            sign: int : 1;
        };
        hf: float;
    };
} my_float;

main(argc: int, argv: [*char]) -> int {
    pi: my_float = { .f.hf = 3.1415f }
}

I understand that not everyone will agree with me here. I also prefer trailing return type, as in Rust, and even though -> is reserved for accessing struct fields from pointers, which is also true for C++11, C++11 still can use it for specifying the return type of the function, so I think C also could do that.

Although it seems that without a type at the beginning it’s harder to tell when we’re declaring something, and when we using a defined variable, Rust uses a special keyword let, since types can be omitted:

let mut a = 42; // declaration
        a = 27; // later use

However, in C we always specify type when declaring a variable so let can be omitted:

a: int = 42; // declaration
     a = 27; // usage

Pointers in C also have one interesting error-prone syntax case. Look at this:

int * a, b;

If we read this right to left or use clockwise rule, we will see, that b is of type int and a is a pointer to int. But during my practice, I’ve seen many cases when other programmers read this left to right, e.g. “declaration of the integer pointers a and b”. Unless you really know what’s going on, you can spend a lot of time wondering why your code misbehaves. And one reason for it to misbehave could be that someone forgot to add another asterisk:

int * a, * b;

Because of this problem, there are three different conventions on where to place the asterisk - near the type, near the variable, or in between. I, personally use the last one, as you can see, but this code can also be rewritten in this form:

int* a, * b; // quite ugly
int *a, *b;  // more concise

The first one is quite ugly, and error-prone, because it may seem, that we’ve specified int* type and all objects will have that type. And the second asterisk looks really out of place because of that. The second one is a bit better, but I think that it fails to define an idea that we’re pointing to type. I rarely write such constructions, because it gets messy when you specify values, so I split this into two lines:

int * a; // easy to refactor
int * b;

This is less error-prone and easier to refactor. And I think that I would rework this to be

a: *int;
b: *int;

There’s another interesting moment. C supports multi-word types, but users can declare only single-word types. This is a kinda double standard for me. And even if we could create multi-word types would be a total mess! And multi-word types ruin one major aspect, and it is used in macros.

Macros

Macros in C are not like macros in languages that feature macro systems as part of the language, like Lisp, or even Rust. In C macros are simple text substitutions, and expanded before compilation by using a preprocessor. So if you will write #define five 5 and then use five in several places, right before compilation, preprocessor will substitute five with 5. five is not a variable, but simply a placeholder for specific data. This is the most simple macro, but that’s not all that a preprocessor offers. We can use many things like #if, #ifndef, #else e.t.c. to alter how our source code will be compiled. And this is a source of some problems as well.

The first problem we stated was that macros can mimic the code too well. Because macros in C differ from actual C code and are a foreign thing, there should be a way of knowing what is a macro. The other problem with C macros is that those are text substitutions. Consider the following macro:

#define vec(T)          \
    struct vector_##T { \
        T * data;       \
        size_t size;    \
    }

We can use such a pattern to create generics or templates if you’re familiar with C++. This macro then can be used like this:

int main()
{
    vec(int) vector_of_ints = { .data = NULL, .size = 0 };
    vec(float) vector_of_floats = { .data = NULL, .size = 0 };
}

To understand what’s going on here, let’s substitute the macro call by hand. First, we substitute vec(int), with macro body and remember int:

struct vector_##T {
    T * data;
    size_t size;
} vector_of_ints = { .data = NULL, .size = 0 };

We’ve passed int as a parameter to our macro so we substitute T with it:

struct vector_##int {
    int * data;
    size_t size;
} vector_of_ints = { .data = NULL, .size = 0 };

Next, concatenation will happen at each ##, producing vector_int:

struct vector_int {
    int * data;
    size_t size;
} vector_of_ints = { .data = NULL, .size = 0 };

At this point, we’ve got macro fully expanded, and that’s how main function looks:

int main()
{
    struct vector_int {
        int * data;
        size_t size;
    } vector_of_ints = { .data = NULL, .size = 0 };
    vec(float) vector_of_floats = { .data = NULL, .size = 0 };
}

Then this repeats to vec(float), producing the final code:

int main()
{
    struct vector_int {
        int * data;
        size_t size;
    } vector_of_ints = { .data = NULL, .size = 0 };
    struct vector_float {
        float * data;
        size_t size;
    } vector_of_floats = { .data = NULL, .size = 0 };
}

Pretty cool, right? Wrong. We can’t do the same with long long:

int main()
{
    vec(long long) vector_of_long_longs = { .data = NULL, .size = 0 };
}

Because vector_##long long will produce vector_long long, and we can’t have multi-word user types. We can fix this by creating typedef, but this means that we have to create typedef for any type that we want to pass into such macros. Meaning that the pointers also gonna need typedefs:

typedef char * string;

int main()
{
    vec(string) vector_of_strings = { .data = NULL, .size = 0 };
}

This is one side of the problem. The other side is that we’re operating with text, not AST. To understand the difference let’s look at some Lisp macros, for instance, Clojure time macro:

(defmacro time
  [expr]
  `(let [start# (. System (nanoTime))
         ret# ~expr]
     (prn (str "Elapsed time: "
               (/ (double (- (. System (nanoTime)) start#)) 1000000.0)
               " msecs"))
     ret#))

I’m not going to teach you Lisp here, but still, let’s quickly understand what’s going on here.

This macro does three things, first, it creates local binding start# (# will generate a unique suffix for the name), which will hold current system time in nanoseconds, creates local binding ret#, that holds the result of an expression that we pass to time and then prints line which shows the amount of msecs that it took to evaluate the expression. In the end, the ret# is returned out of this macro.

To understand this, let’s expand macro in this expression (time (+ 1 2 3):

(let*
  [start1 (. System (nanoTime))
   ret1 (+ 1 2 3)] ;; our sum expression
  (prn (str "Elapsed: "
             (/ (double (- (. System (nanoTime)) start1)) 1000000.0)
             " msecs"))
  ret1)

So what macro did is, took the given AST, took (+ 1 2 3) expression out of it, and generated a new AST, where this expression is used in let* form, and the result is returned after time is measured. It’s a simple macro, but it shows some neat things about Lisp macros. Of course, it doesn’t represent Lisp macro system as a whole, but the general idea of Lisp macros is that you take one AST and change it to produce another AST at compile time. And you need to understand that we did not manipulate text here, but the code itself.

This macro is simple enough, so let’s write it in C:

#include <stdio.h>
#include <time.h>

#define time(code)                                           \
    __extension__({                                          \
        clock_t _begin      = clock();                       \
        __typeof(code) _res = __extension__({ code; });      \
        printf("Elapsed: %g sec.\n",                         \
               (double)(clock() - _begin) / CLOCKS_PER_SEC); \
        _res;                                                \
    })

int main(void)
{
    printf("%d\n", time(1 + 2 + 3));
    return 0;
}

Sure it works, and it uses some compile time magic, like __typeof, and GCC statement expression extension __extension__({ … }), but in the end this is still textual work. It will fail if we can’t get __typeof correctly, for example, if we want to measure our for loop:

time(
    for (volatile int i = 0; i < 0x1000; i++) {
        printf("%d\n", i);
    }
);

This will error, because for has no type. The problem is that we can’t return something if there’s nothing to return. If we could analyze AST and produce valid code it would work.

Another example of a C macro, that disables the evaluation of code is this log macro. The idea here is that the function, that we pass to log may be slow, because it needs to get a huge amount of data, so we need a way to disable logging while still keeping all of our code. Before looking at C, here’s the Ruby variant:

$debug = true

def log(&code)
  if $debug
    $stderr.puts code.yield
  end
end

log {"message: #{expensive_logger}"}

So the idea here is that we’re passing a function, and it will be evaluated only if $debug is true. Here’s the C version:

#include <stdio.h>
#include <stdbool.h>

bool debug true

#define log(fmt, ...)                          \
    do {                                       \
        if (debug)                             \
            fprintf(stderr, fmt, __VA_ARGS__); \
    } while (0)

int main(void)
{
    log("message: %s\n", expensive_logger());
    return 0;
}

Here we can see and __VA_ARGS__. These are special ways to pass and access a variadic amount of arguments to the macro. This may seem like a little bit of AST manipulation, but again, this is just textual work, as __VA_ARGS__ expands to the sequence of arguments separated by commas, which means that we can’t do much with it.

To understand why direct access to variadic arguments can be needed in macros, let’s look at another more complex macro from Clojure - thread first macro ->:

(defmacro ->
  [x & forms]
  (loop [x x, forms forms]
    (if forms
      (let [form (first forms)
            threaded (if (seq? form)
                       (with-meta `(~(first form) ~x ~@(next form)) (meta form))
                       (list form x))]
        (recur threaded (next forms)))
      x)))

This is a more complex macro, that transforms AST, as shown in the time example, but in a different way. To understand it, we have to see what form of code it takes:

(-> 1
    (+ 2)
    (* 3 4)
    (/ 5))

And what code it produces:

(/ (* (+ 1 2) 3 4) 5)

What essentially happens here is, that we pass list of expressions to ->, which is (1 (+ 2) (* 3 4) (/ 5)) (lists in Lisp are delimited with parentheses), then we take first out, which is 1, and put it to the second position in (+ 2) thus creating (+ 1 2). Then we repeat it until we produce the final expression. Here’s a table of steps, where x, form, and forms illustrate the values that these variables are holding in the macro above at each step of the loop. When form turns empty, we return x:

xformforms
1(+ 2)((* 3 4) (/ 5))
(+ 1 2)(* 3 4)((/ 5))
(* (+ 1 2) 3 4)(/ 5)()
(/ (* (+ 1 2) 3 4) 5)()()

What’s interesting in this macro, and in Lisp in general, is that your code is your AST. This means that you can take it and change it however you want to, with the language itself. There are no other facilities for this, but those you use when you write regular code. This is not suitable for C, because the relationship between syntax and AST is not 1 to 1 as in Lisp, so we have to use preprocessors.

In the example above, we saw iteration over forms that we are passing to macro. You can do something similar to that in C:

#include <stdio.h>

#define first(x, ...) #x
#define rest(x, ...)  #__VA_ARGS__

#define destructive(...)                              \
    do {                                              \
        printf("first is: %s\n", first(__VA_ARGS__)); \
        printf("rest are: %s\n", rest(__VA_ARGS__));  \
    } while (0)

int main(void)
{
    destructive(1, 2, 3);
    return 0;
}

Because __VA_ARGS__ is just a comma-separated list of arguments, we can pass it to the macro, which takes 1 explicit argument x and a variadic amount of rest arguments. The example above converts arguments to strings, but you can spot the general idea. Executing this code will print:

first is: 1
rest are: 2, 3

But this is not really useful in the context of the language, because there’s no way to continuously iterate over such lists of arguments without defining a finite amount of iterator macros for a given amount of maximum possible arguments 2. So such a thing as thread first macro isn’t really possible. Which is a shame, because if there was a way to loop through macro at compile time something like printf could easily be defined as a macro. I’ve done printing macros based on this technique, which used Assembler primitive functions to display strings and numbers, and it performed way better than classic printf. This may seem like a strange point because usually performance of printf is not something you really want to improve, but it depends on the task, and in my domain, a serious performance drop is noticeable when using printf while working with bare metal emulation, which I do. Also, Rust doesn’t support variadic arguments in functions, but its println! macro surely does, because it is implemented in terms of iteration over the argument list and produces final printing code:

macro_rules! println {
    () => ($crate::print!("\n"));
    ($($arg:tt)*) => ({
        $crate::io::_print($crate::format_args_nl!($($arg)*));
    })
}

What I’m trying to say is that defining Domain Specific Language (DSL) for your project is good both for you and your project. It will make you more productive, and your code will be cleaner and usually much more compact and readable.

And that’s the point of the macro system in the language - it changes language to fit your concrete problem. Most C programmers say that macros are bad and should be avoided, but I think they are just scared of those because the implementation is really bad and you can make your code very tangled up with all #if #else #endif stuff. So what we can do with that?

Macro processor alternatives

While we can’t really change C, it is still possible to fix some issues outside. Let’s take a look at other macro implementations for C:

  • PHP - Use PHP to produce C code. I dare you.
  • Perl Preprocessor - use Perl to produce text or generate code. Not strictly related to C, but can be used in C, as well as in any other language. While I think that Perl is a fine choice, many people probably will not think alike. But Perl is excellent for text processing, so I think it may be a good one for C macros.
  • cmacro - macro processor for C written in Common Lisp. This is the most interesting discovery so far.
  • m4 - GNU M4 macro processor. It should improve the current state of the art in C macros, but I’ve never seen someone using it.
  • Probably tons of other projects.

Cmacro got my attention. It seems to provide good AST manipulation primitives for C code and seems like a complete solution. It also comes with a set of usable macros in a separate package.

The problem is, that if you’re going to use any of the solutions listed above, you will add additional development dependency for your project. This is not a huge issue but may cause problems or inconvenience for other developers that decided to contribute to your project. That’s why built-in support for a good macro system is needed.

While macro processors can be changed and integrated on a per-project basis, it’s not that easy to change syntax. Yes, good macro preprocessors can extend syntax, as Cmacro does, but this will not solve all problems of C syntax, such as pointers to functions, that return pointer to function and take several pointers to functions, that take some arguments, as arguments. Just try to write this, and any Lisp code will look like candy for you. Look here3 if you don’t want to waste your time, as I’ve wasted mine already.

Alternative syntax?

Although I’ve listed some different preprocessors that can fix or enhance the macro system in C, who said that we need to solve the problems of C syntax and macro system independently?

Instead, we can create language, that transpires to valid C. There are many languages that took this approach, and while I think that most of those change syntax too radically, it still may be good for us. To name a few:

  • Gambit Scheme - CGambit is a compiler for Scheme, that generates portable C code.
  • Nim - a language with Python-like syntax that can be compiled into C.
  • Vala and Genie - GNOME projects aimed toward C with improved syntax and support for classes through GObject.

So C as a host language is not a new idea, but which of these languages expand macro abilities, while still staying close to C syntax, just improving it in different areas? None. Vala is more like C++, but the macro system is not enhanced compared to C. Genie and Nim are Python-like, though Nim got pretty good macro support because you actually can change AST. Gambit also greatly enhances the macro system, because it is a Scheme.

However all these languages, except, maybe, Vala, differ from C dramatically, so this may not be a good choice for everyone. I’m sure that there may be a language that suits this better, but I’m unaware of it as of today.

There’s one more language though, that catches my eye as a C developer, and it is ZZ. It features Rust-like syntax, which is good for me, and procedural macros. Though it’s not compiled down to C, it uses C ABI, so this may actually be a good point of transition of your codebase. The language is quite young, and I did not use it personally, so I can’t really recommend it.

Conclusion

To summarize my thoughts through this rant:

  • C is an old language, its syntax is ugly, and it doesn’t have to be this way,
  • Macro system in C is very lacking, and it doesn’t have to be this way either,
  • There are solutions to the problem, but neither of those can fix everything without caveats.
  • New languages, such as Rust and ZZ, are trying to fix problems of C, but the amount of legacy code written in C will not make the adoption easy.

So in my opinion the best solution to all of this is for C developers to provide a new standard with improved syntax and macro system, and keep the name of the language, so new code could be written in it, while still keeping old code working by using compiler flag that specifies standard, obviously. While this is not optimal, and a lot of work, I think the future of C depends on this.


  1. C++ in my opinion has the worst syntax because it has so many of it for a single thing. Like there’s too much different initializer syntax, two ways of specifying return types, a lot of implicit things that are syntactic sugar, and so on. When I was trying to get into C++ it was a mess, like where do I use this initialization, and where do I use that? For professional C++ developers that might be a good thing, but do we really need this kind of thing at all? This is something like a balance between lines of code and amount of different syntax for the same thing, and I would prefer more lines of code wrapped into a macro that is used for this particular task I’m working on, and not several different syntax variants that are available all the time. ↩︎

  2. This SO answer describes how to iterate over variadic arguments in C macro. General idea is that you define a finite amount of reduction macros, and call one depending on the overall amount of parameters that were passed initially. E.g. you define helper1, helper2, helperN, and when you call your macro with 5 arguments it will select helper5 and call it, helper5 then will call helper4, and so on until it reaches the last one. ↩︎

  3. void(*f0(void(*f1)(void(*f2(void(*f3)(void)))(void))))(void(*f4(void(*f5)(void)))(void)) {return f1;} ↩︎