C Needs Better Syntax and Macros
rant c ~24 minutes read

Today I’ve decided that I want to write a little rant on the topic of language design. Since I’m a C programmer, and I use it on a daily basis, I’ve had many thoughts about C syntax over past few years. I like C, and it is a good language, that defined modern operating systems and software. But I think that C still needs a better syntax and macro system. Not because C syntax is inconvenient to use, but because syntax is a tool of expression. And macros can extend this concept even further.

You know, we, programmers, use a programming language to communicate with computer. When people are communicating with other people, the language is being constantly improved, by adding new constructions, and removing inconvenient and old ones. This helps language to be in tact with current realities, and not die, like for example Latin language. No one speaks Latin not because it is inconvenient, but because it is poor. And C syntax is poor as well.

Let’s have a little test. Imagine that you’re C compiler, I’ll give you the beginning of the line, and you will try to guess what is happening in it:

int something...

What could it be? This is certainly declaration of something, but what exactly? Maybe it is a integer variable declaration, like int something = something_else;. Or maybe it is function declaration, like int something(something_else);. Or maybe something is actually a macro? Then something may actually be something_else. And don’t think that if you see something(something_else); it always will be a function, because it can actually be function-like macro that expands to whatever it likes to! Ah, that also can be an array pointer - int something[].

There’s no way to tell… unless we look ahead. Sure, it is not a problem for compiler, and such things like function, or arrays have obvious and well known syntax. In case of macros, well, you can’t know without checking it yourself, or using some sort of a tool that can show if thing under cursor is macros or not. This may not be a problem for compiler, but for us, humans, this is an issue, because we see code, but can’t immediately see the connection between the code and what’s actually will happen. This may seem like a minor issue, but when dealing with large codebases which use tons and tons of complex macros, it became really tedious. Most of the time good tooling can help with that, but more certainty would be an improvement.

Of course we can complain all we want, so instead let’s discuss how to improve this. I’ve programmed for more than 5 years in C at work, mostly writing bare metal software for SoC, so my interest lies in the niche of compiled/system languages. I’ve also occasionally used C++ at work, seen some Java syntax, and recently got into Rust. Rust is somewhat known as language with quite ugly and verbose syntax, but I don’t think so, and its syntax improves much upon C and C++. And I’m sorry C++ guys, but C++ syntax is pile of trash1.

Let’s compare the example code above with Rust variants for each of our guesses:

  • Function:

    fn something...
    
  • Variable:

    let something...
    
  • Pointer and array:

    let something...
    let something...
    

    Uh oh, pointers and arrays are declared exactly how variables are, but that’s not a problem, because in Rust variables hold pointers, and type of the variable is specified after variable name and before assignment. So it is not an issue, because we know, that plain variable, array, pointer, hash-map, and everything else will have the same syntax.

  • Macro:

    something!
    

Bang! Macros are always end with !, which makes those easy to see and expect something to happen. Which means, that Rust covers 3 out of four complains I had, and the remaining one is meaningful in terms of language itself, so this is kind of 4 out of 4. Though these are not the only problems with C syntax, so let’s dive more in depth.

Type Cyntax

Let’s discuss the type syntax. C features these default types:

  • char - for 8 bit sized data.
  • short - for at least 16 bit sized data.
  • int - for at least 16 bit sized data.
  • long - for at least 32 bit sized data.
  • float - for single precision floating point data.
  • double - for double precision floating point data.

So far so good. Well, sizes are bit strange, like int and short are both at least 16 bit, but this is architecture dependent, and that’s another problem and another topic. There’s one more type that exists but doesn’t quite fit to previous list:

  • long long - for at least 64bit sized data.

Did you notice that? No? Then let me point it out. It uses two words to describe single idea while previously we had single-worded types, that represented single concept.

Well, actually all types that were listed above can use two words, e.g. unsigned char, but this describes that this very type is not only for 8bit, but also has no sign. There are two concepts that are described with two words. Not for unsigned long long though.

To be honest, all those types except int, and char, can be written using two words that specify length. We can rewrite previous list as follows:

  • char
  • short int
  • int
  • long int

It makes sense for char not to be short or long because it is widely used to store ASCII characters, which need 256 variants and this always needs 8bits. This also makes sense for int to be short or long, to fit 16 or 32 bits. Not for long long though! See:

  • long long int

Again three words to describe concept of two. Anyway, I’m nitpicking because there’s a better approach available. And most C programmers use it, I believe:

typedef signed   long long int int64;
typedef unsigned long long int uint64;

And we’re now describing whole concept with single word: uint64. Look how easy it is now to get everything we need to know. Size? 64. Has sign? Check for u at the start of the word. This can be repeated for all other types:

typedef signed char   int8;
typedef unsigned char uint8;

typedef signed short   int16;
typedef unsigned short uint16;

typedef signed int   int32;
typedef unsigned int uint32;

typedef signed long long   int64;
typedef unsigned long long uint64;

Notice the common pattern? int repeats in every type name we defined. Rust went further and removed it:

  • i8, u8
  • i16, u16
  • i32, u32
  • i64, u64

I don’t think that i for signed is the best choice, but I’m fine with this. And types are now very consistent and short. Although compound multi word types have one advantage over predefined set of types, because we have more control whether we want our type to be signed or long, this also error prone, because someone may assume that long long long int is valid type, while it’s not.

While we’re on types, I think that defining types before object has little sense too. In Rust type is specified after object name and it reads well, e.g.:

   let    name  :  &str       =       "John";
// |      |     |  |          |       |
// define thing of some type, holding something

Compared to C:

   char *       name   =       "John";
// |    |       |      |       |
// type pointer thing, holding something

And it is known practice in C to read declarations in the opposite direction, e.g. name is pointer to char. There’s also clockwise spiral reading rule that can be applied to functions and arrays:

str is an array of 10 pointers to a char.

Although for function pointer it seems that counter clockwise fits better:

fp is a pointer to a function that accepts an int and a pointer to a char and returning a pointer to a char.

English is not RTL, why would we need that stuff in C? Rust uses one more construct to define this and it is :, which in C is used in ?: operator and in case statements. And, case _: is a well defined syntax, since case is a keyword, so we can’t mix it up with type annotation. ?: is used only after =, so it’s also well defined:

          a     :  int   =       42;
//        |     |  |     |       |
// define thing of type, holding 42

   case 0 :           a     :  int  = cond  ?      1 :  0;
// |    | |           |     |  |    |       |      | |  |
// case 0 then define thing of type holding either 1 or 0, depending on cond

Case with case is like an edge case, and quite uncommon even with current syntax, so I don’t think that this is a problem. There’s another use for :, which I’ve never actually seen in my practice, that it can be used for bit size indication in structs, so unfortunately we can’t use it for types in the same way how we use * both for pointers and multiplication:

struct item {
    int f : 28; // 28-bit field
}

But this just means that for bit fields we can specify that it is valid to use : for the second time:

struct iterm {
    f: int : 28;
}

Just like that in official syntax : as a bit-field specifier is only valid syntax in struct, the second use of : could be also only valid in struct. This way we could use uniform syntax for type declaration. Here’s a side by side comparison of the standard syntax, and what I think is a better alternative:

typedef struct my_IEEE754 {
    union {
        struct sf_t {
            int fraction : 23;
            int exp : 8;
            int sign : 1;
        } sf;
        float hf;
    } f;
} my_float;

int main(int argc, char * argv[])
{
    my_float pi = { .f.hf = 3.1415f };
}

Now with imaginary syntax. I’ve kept those constructs that I’m fine with, such as struct initializer, and I also changed char*[] to [*char] so it could be read as array of pointer to char without falling back RTL:

typedef struct my_IEEE754 {
    f: union {
        sf: struct sf_t {
            fraction: int : 23;
            exp: int : 8;
            sign: int : 1;
        };
        hf: float;
    };
} my_float;

main(argc: int, argv: [*char]) -> int {
    pi: my_float = { .f.hf = 3.1415f }
}

I understand that not everyone will agree with me here. I also prefer trailing return type, as in Rust, and even though -> is reserved for accessing struct fields from pointers, which is also true for C++11, and C++11 still can use it for specifying return type of the function, so I think C also could do that.

Although it seems that without type at the beginning it’s harder to tell when we’re declaring something, and when we using defined variable, so Rust uses special keyword let, since types can be omitted:

let mut a = 42; // declaration
        a = 27; // later use

However in C we always specify type when declaring a variable so let can be omitted:

a: int = 42; // declaration
     a = 27; // usage

Pointers in C also have one interesting error prone syntax case. Look at this:

int * a, b;

If we read this right to left or use clockwise rule, we will see, that b is of type int and a is a pointer to int. But during my practice I’ve seen many cases when other programmers read this left to right, e.g. “declaration of the integer pointers a and b”. Unless you really know what’s going on, you can spend a lot of time wondering why your code misbehaves. And one reason for it to misbehave could be that someone forgot to add another asterisk:

int * a, * b;

Because of this problem there’s three different conventions on where to place the asterisk - near the type, near the variable, or in between. I, personally use the last one, as you can see, but this code can also be rewritten in this form:

int* a, * b; // quite ugly
int *a, *b;  // more consise

First one is quite ugly, and error prone, because it may seem, that we’ve specified int* type, and all objects will have that type. And the second asterisk looks really out of place because of that. Second one is a bit better, but I think that if fails to define an idea that we’re pointing to type. I rarely write such constructions, because it gets messy when you specify values, so I split this to two lines:

int * a; // easy to refactor
int * b;

This is less error prone, and easier to refactor. And I think that I would rework this to be

a: *int;
b: *int;

There’s another interesting moment. C supports multi-word types, but user can declare only single-word types. This is kinda double standard for me. And even if we could create multi-word types would be a total mess! And multi-word types ruin one major aspect, and it is usage in macros.

Macros

Macros in C are not like macros in languages that feature macro system as part of the language, like Lisp, or even Rust. In C macros are simple text substitutions, and expanded before compilation by using preprocessor. So if you will write #define five 5 and then use five in several places, right before compilation, preprocessor will substitute five with 5. five is not a variable, but simply a placeholder for specific data. This is the most simple macro, but that’s not all what preprocessor offers. We can use many things like #if, #ifndef, #else e.t.c. to alter how our source code will be compiled. And this is source of some problems as well.

First problem we stated was that macros can mimic the code too good. Because macros in C differ from actual C code and are foreign thing, there should be a way of knowing what is a macro. The other problem with C macros, is that those are text substitutions. Consider the following macro:

#define vec(T)          \
    struct vector_##T { \
        T * data;       \
        size_t size;    \
    }

We can use such pattern to create generics or templates if you’re familiar with C++. This macro then can be used like this:

int main()
{
    vec(int) vector_of_ints = { .data = NULL, .size = 0 };
    vec(float) vector_of_floats = { .data = NULL, .size = 0 };
}

To understand what’s going on here, let’s substitute macro call by hand. First, we substitute vec(int), with macro body and remember int:

struct vector_##T {
    T * data;
    size_t size;
} vector_of_ints = { .data = NULL, .size = 0 };

We’ve passed int as a parameter to our macro so we substitute T with it:

struct vector_##int {
    int * data;
    size_t size;
} vector_of_ints = { .data = NULL, .size = 0 };

Next, concatenation will happen at each ##, producing vector_int:

struct vector_int {
    int * data;
    size_t size;
} vector_of_ints = { .data = NULL, .size = 0 };

At this point we’ve got macro fully expanded, and that’s how main function looks:

int main()
{
    struct vector_int {
        int * data;
        size_t size;
    } vector_of_ints = { .data = NULL, .size = 0 };
    vec(float) vector_of_floats = { .data = NULL, .size = 0 };
}

Then this repeats to vec(float), producing final code:

int main()
{
    struct vector_int {
        int * data;
        size_t size;
    } vector_of_ints = { .data = NULL, .size = 0 };
    struct vector_float {
        float * data;
        size_t size;
    } vector_of_floats = { .data = NULL, .size = 0 };
}

Pretty cool, right? Wrong. We can’t do the same with long long:

int main()
{
    vec(long long) vector_of_long_longs = { .data = NULL, .size = 0 };
}

Because vector_##long long will produce vector_long long, and we can’t have multi-word user types. We can fix this by creating typedef, but this means that we have to create typedef for any type that we want to pass into such macros. Meaning pointers are also gonna need typedefs:

typedef char * string;

int main()
{
    vec(string) vector_of_strings = { .data = NULL, .size = 0 };
}

This is one side of the problem. The other side is that we’re operating with text not AST. To understand the difference let’s look at some Lisp macros, for instance Clojure time macro:

(defmacro time
  [expr]
  `(let [start# (. System (nanoTime))
         ret# ~expr]
     (prn (str "Elapsed time: "
               (/ (double (- (. System (nanoTime)) start#)) 1000000.0)
               " msecs"))
     ret#))

I’m not going to teach you Lisp here, but still, lets quickly understand what’s going on here.

This macro does three things, first it creates local binding start# (# will generate unique suffix for the name), that will hold current system time in nanoseconds, creates local binding ret#, that holds the result of expression that we pass to time and then prints line which shows amount of msecs that it took to evaluate the expression. At the end the ret# is returned out of this macro.

To understand this, let’s expand macro in this expression (time (+ 1 2 3):

(let*
  [start1 (. System (nanoTime))
   ret1 (+ 1 2 3)] ;; our sum expression
  (prn (str "Elapsed: "
             (/ (double (- (. System (nanoTime)) start1)) 1000000.0)
             " msecs"))
  ret1)

So what macro did is, took given AST, took (+ 1 2 3) expression out of it, and generated new AST, where this expression is used in let* form, and result is returned after time is measured. It’s a simple macro, but it shows some neat things about Lisp macros. Of course it doesn’t represent Lisp macro system as a whole, but the general idea of Lisp macros is that you take one AST and change it to produce another AST at compile time. And you need to understand that we did not manipulated text here, but the code itself.

This macro is simple enough, so let’s write it in C:

#include <stdio.h>
#include <time.h>

#define time(code)                                           \
    __extension__({                                          \
        clock_t _begin      = clock();                       \
        __typeof(code) _res = __extension__({ code; });      \
        printf("Elapsed: %g sec.\n",                         \
               (double)(clock() - _begin) / CLOCKS_PER_SEC); \
        _res;                                                \
    })

int main(void)
{
    printf("%d\n", time(1 + 2 + 3));
    return 0;
}

Sure it works, and it uses some compile time magic, like __typeof, and GCC statement expression extension __extension__({ ... }), but in the end this is still textual work. It will fail if we can’t get __typeof correctly, for example if we want to measure our for loop:

time(
    for (volatile int i = 0; i < 0x1000; i++) {
        printf("%d\n", i);
    }
);

This will error, because for has no type. The problem is that we can’t return something if there’s nothing to return. If we could analyze AST and produce valid code it would work.

Another example of C macro, that is disables the evaluation of code is this log macro. The idea here that function, that we pass to log may be slow, because it need to get huge amount of data, so we need a way to disable logging while still keeping all of our code. Before looking at C, here’s Ruby variant:

$debug = true

def log(&code)
  if $debug
    $stderr.puts code.yield
  end
end

log {"message: #{expensive_logger}"}

So the idea here is that we’re passing a function, and it will be evaluated only if $debug is true. Here’s C version:

#include <stdio.h>
#include <stdbool.h>

bool debug true

#define log(fmt, ...)                          \
    do {                                       \
        if (debug)                             \
            fprintf(stderr, fmt, __VA_ARGS__); \
    } while (0)

int main(void)
{
    log("message: %s\n", expensive_logger());
    return 0;
}

Here we can see ... and __VA_ARGS__. These are special ways to pass and access variadic amount of arguments to the macro. This may seem like a little bit of AST manipulation, but again, this is just textual work, as __VA_ARGS__ expands to the sequence of arguments separated by commas, which means that we can’t do much with it.

To understand why direct access to variadic arguments can be needed in macros, let’s look at another more complex macro from Clojure - thread first macro ->:

(defmacro ->
  [x & forms]
  (loop [x x, forms forms]
    (if forms
      (let [form (first forms)
            threaded (if (seq? form)
                       (with-meta `(~(first form) ~x ~@(next form)) (meta form))
                       (list form x))]
        (recur threaded (next forms)))
      x)))

This is more complex macro, that transforms AST, as shown in time example, but in a different way. To understand it, we have to see what form of code it takes:

(-> 1
    (+ 2)
    (* 3 4)
    (/ 5))

And what code it produces:

(/ (* (+ 1 2) 3 4) 5)

What essentially happens here is, that we pass list of expressions to ->, which is (1 (+ 2) (* 3 4) (/ 5)) (lists in Lisp are delimited with parentheses), then we take first out, which is 1, and put it to the second position in (+ 2) thus creating (+ 1 2). Ten we repeat it until we produce final expression. Here’s table of steps, where x, form and forms illustrate the values that these variables are holding in the macro above at each step of the loop. When form turns empty, we return x:

x form forms
1 (+ 2) ((* 3 4) (/ 5))
(+ 1 2) (* 3 4) ((/ 5))
(* (+ 1 2) 3 4) (/ 5) ()
(/ (* (+ 1 2) 3 4) 5) () ()

What’s interesting in this macro, and in Lisp in general, is that your code is your AST. Which means that you can take it and change however you want to, with the language itself. There’s no other facilities for this, but those you use when you write regular code. This is not suitable for C, because relationship between syntax and AST is not 1 to 1 as in Lisp, so we have to use preprocessors.

In the example above, we saw iteration over forms that we are passing to macro. You can do something similar to that in C:

#include <stdio.h>

#define first(x, ...) #x
#define rest(x, ...)  #__VA_ARGS__

#define destructive(...)                              \
    do {                                              \
        printf("first is: %s\n", first(__VA_ARGS__)); \
        printf("rest are: %s\n", rest(__VA_ARGS__));  \
    } while (0)

int main(void)
{
    destructive(1, 2, 3);
    return 0;
}

Because __VA_ARGS__ is just comma separated list of arguments, we can pass it to the macro, that takes 1 explicit argument x and variadic amount of rest arguments. The example above converts arguments to strings, but you can spot general idea. Executing this code will print:

first is: 1
rest are: 2, 3

But this is not really useful in the context of the language, because there’s no way to continuously iterate over such lists of arguments without defining finite amount of iterator macros for given amount of maximum possible arguments 2. So such thing as thread first macro isn’t really possible. Which is a shame, because if there was a way to loop through macro at compile time something like printf could be easily defined as a macros. I’ve did printing macros based on this technique, that used Assembler primitive functions to display stings and numbers, and it performed way better than classic printf. This may seem like a strange point, because usually performance of printf is not something you really want to improve, but it depends on the task, and in my domain serious performance drop is noticeable when using printf while working with bare metal emulation, which I do. Also, Rust doesn’t support variadic arguments in functions, but its println! macro surely does, because it is implemented in terms of iteration over argument list and produces final printing code:

macro_rules! println {
    () => ($crate::print!("\n"));
    ($($arg:tt)*) => ({
        $crate::io::_print($crate::format_args_nl!($($arg)*));
    })
}

What I’m trying to say is that defining Domain Specific Language (DSL) for your project is good both for you and your project. It will make you more productive, and your code will be cleaner and usually much more compact and readable.

And that’s the point of macro system in the language - it changes language to fit your concrete problem. Most C programmers say that macros are bad and should be avoided, but I think they just scared of those, because the implementation is really bad and you can make your code very tangled up with all #if #else #endif stuff. So what we can do with that?

Macro processor alternatives

While we can’t really change C, it is still possible to fix some issues outside. Let’s take a look on other macro implementations for C:

  • PHP - Use PHP to produce C code. I dare you.
  • Perl Preprocessor - use Perl to produce text or generate code. Not strictly related to C, but can be used in C, as well as in any other language. While I think that Perl is a fine choice, many people probably will not think alike. But Perl is excellent for text processing, so I think it may be good one for C macros.
  • cmacro - macro processor for C written in Common Lisp. This is the most interesting discoveries so far.
  • m4 - GNU M4 macro processor. It should improve current state of the art in C macros, but I’ve never seen someone using it.
  • Probably tons of other projects.

Cmacro got my attention. It seem to provide good AST manipulation primitives for C code, and seem like a complete solution. It also comes with a set of usable macros in separate package.

The problem is, that if you’re going to use any of the solutions listed above, you will add additional development dependency for your project. This is not a huge issue, but may cause problems or inconvenience for other developers that decided to contribute to your project. That’s why builtin support for good macro system is needed.

While macro processors can be changed and integrated on per project basis, it’s not that easy to change syntax. Yes, good macro preprocessors can extend syntax, like cmacro does, but this will not solve all problems of C syntax, such as pointers to functions, that return pointer to function, and take several pointers to functions, that take some arguments, as arguments. Just try to write this, and any Lisp code will look like candy for you. Look here3 if you don’t want to waste your time, as I’ve wasted mine already.

Alternative syntax?

Although I’ve listed some different preprocessors that can fix or enhance macro system in C, who said that we need to solve the problems of C syntax and macro system independently?

Instead we can create language, that transpires to valid C. There are many languages that took this approach, and while I think that most of those change syntax too radically, it still may be good for us. To name a few:

  • Gambit Scheme - CGambit is a compiler for Scheme, that generates portable C code.
  • Nim - a language with Python-like syntax that can be compiled to C.
  • Vala and Genie - GNOME projects aimed towards C with improved syntax and support for classes through GObject.

So C as a host language is not a new idea, but which of these languages expand macro abilities, while still staying close to C syntax, just improving it in different areas? None. Vala is more like C++, but macro system is not enhanced compared to C. Genie and Nim are Python-like, though Nim got pretty good macro support, because you actually can change AST. Gambit also greatly enhances macro system, because it is a Scheme.

However all these language, except, maybe, Vala, differ from C dramatically, so this may not be a good choice for everyone. I’m sure that there may be a language that suits this better, but I’m unaware of it as of today.

There’s one more language though, that catches my eye as a C developer, and it is ZZ. It features Rust-like syntax, which is good for me, and procedural macros. Though it’s not compiled down to C, it uses C ABI, so this may actually be a good point of transition of your codebase. The language is quite young, and I did not use it personally, so I can’t really recommend it.

Conclusion

To summarize my thoughts through this rant:

  • C is old language, its syntax is ugly, and it doesn’t have to be this way,
  • Macro system in C is very lacking, and it doesn’t have to be this way either,
  • There are solutions for the problem, but neither of those can fix everything without caveats.
  • New languages, such as Rust and ZZ, are trying to fix problems of C, but amount of legacy code written in C will not make the adoption easy.

So in my opinion the best solution to all of this is for C developers to provide a new standard with improved syntax and macro system, and keep the name of the language, so new code could be written in it, while still keeping old code working by using compiler flag that specifies standard, obviously. While this is not optimal, and a lot of work, I think the future of C depends on this.


  1. C++ in my opinion has worst syntax because it has so many of it for a single thing. Like there’s too much different initializer syntax, two ways of specifying return types, a lot of implicit things that are syntactic sugar and so on. When I was trying to get into C++ it was a mess, like where do I use this initialization, and where do I use that. For professional C++ developers that might be a good thing, but do we really need this kind of thing at all? This is something like a balance between lines of code and amount of different syntax for the same thing, and I would prefer more lines of code wrapped into a macro that is used for this particular task I’m working on, and not several different syntax variants that are available all the time. ↩︎

  2. This SO answer describes how to iterate over variadic arguments in C macro. General idea is that you define finite amount of reduction macros, and call one depending on the overall amount of parameters that were passed initially. E.g. you define helper1, helper2, helperN, and when you call your macro with 5 arguments it will select helper5 and call it, helper5 then will call helper4, and so on until it reaches the last one. ↩︎

  3. void(*f0(void(*f1)(void(*f2(void(*f3)(void)))(void))))(void(*f4(void(*f5)(void)))(void)) {return f1;} ↩︎