Forest of UNIX

Challenges of writing a compiler (crowlang update year 2)

So its been a while since I decided to blog on my personal website again. But I felt like showing of my recent experiences doing compiler development. So in my previous article about statement/semicolon interference. I cover how I wanted to tackle performing statement interference for my own programming language crowlang.

So crowlang started by borrrowing a ton of code from another project that I worked on called awx. Which was my approach for building POSIX awk compatible interpreter with extensions. But I became unhappy and bored iwth the project, as I had to do large refactors to the parsing. In order to clean up the project and make it ready for command line use. I was also generally unhappy with some of the design choices that I made for the project.

And whilst I had most of the POSIX functionality implemented including most standard library functions. I did not feel like refactoring, and decided that I wanted to go all the way with creating my own systems programming language.

So I created a new repo, copied over most of the old awx code and started doing large scale refactors. I had already implemented lexing, the top down parser and pratt parser so I could reuse all those components.

I started with having to write my own grammar for the language from scratch. I also refactored the lexer as I was not happy with how I implemented it in awx. I also had to throw out large parts of the parser, as I was going for a completely different grammar.

And over the last two years I have been working consistently on and off on the project. So here are some of my struggles and eureka moments tackling a large project like this. This article is more of a collection of ramblings about compilers, than a coherent story.

Initial goals changing

So my first goal when I started programming my own systems language. Could be roughly summarized with a single sentence "Create a safer systems programming language that interops with C and C++ without the need for semicolons". So the idea was to basically write a custom frontend for generating C++ code like cppfront. Now of course there are a million other C++ successor languages. But I really wanted to have a go at it, just because I was curious about learning more about compiler development. As I had built a ton of smaller projects using LLVM and interpreters. But none of them were really large/complete.

But as I continued working on the compiler. I eventually was stuck with two backends. I created an LLVM one which I half finished. And a C++ backend for bootstrapping/prototyping features of the language. And somewhere along the line I had this fun idea, of adding an interoperability backend to the C++ backend. I did this using pybind11.

You see I had always been interested in programming language bindings. So I decided to add this as a cool feature. But once I implemented this and got my first Crowlang functions generated to C++ which could automatically interop with Python. I got really inspired to continue this.

So now I have the added goal/motivation, for adding as many programming language bindings as I can. I want to have these activated using attributes (these still need to be implemented, now we just expose everything).

Big projects and refactoring

I am a bit of a perfectionist and I always have been. One of the biggest issues I had with the project was leaving finished components untouched. I can be very picky about how I word entities in my code.

And I have this habit of revisiting my old code and thinking to myself "This should be named/implemented differently". As a software engineer you have to constantly ask yourself, if your old code still reflects what you want to achieve now.

I had a lot of instances, where I named something and then later went back and did not make any functional changes. But just spent time, renaming the original module to better reflect the use case/goal that module is supposed to reflect.

Project size

This is of course fine for smaller projects, but after really working on this project solo for a while. The size of the project really grew:

===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Autoconf                1           57           34           13           10
 CMake                  61          796          543          148          105
 C++                   105         9128         6189          908         2031
 C++ Header            158         7637         4480         1465         1692
 Python                  1          242          159           21           62
 Shell                   1           89           59           10           20
 Plain Text              1           22            0           22            0
 TOML                    1            8            7            0            1
-------------------------------------------------------------------------------
 Markdown                9          212            0          175           37
 |- Go                   1            7            6            0            1
 |- Shell                1            4            4            0            0
 (Total)                            223           10          175           38
===============================================================================
 Total                 338        18191        11471         2762         3958
===============================================================================

And crowlang now counts roughly 16k lines of C++ code. So here are some changes that I did not have the foresight to see coming.

Compiler, standard library, and project library

So when I first started tackling/writing the project I just dumped everything in src/. Because it was easy and I had no clue that this would not scale well. So I had this consistent issue of where to place utility functions. For example the following utility code is used for adding type overloads to std::visit:

// Overload pattern:
/*!
 *  Helper struct for overloading callables when using @ref std::visit.
 */
template<class... Ts>
struct Overload : Ts... {
  using Ts::operator()...;
};

//! Deduction guide for the overload to work.
template<class... Ts>
Overload(Ts...) -> Overload<Ts...>;

This allows for writing code like so:

std::variant<std::string, bool> var{"true"};

// Lazy implementation I know.
const auto truthy_str{[](const std::string& t_str) -> bool {
    return t_str == "true";
}};

const auto truthy_bool{[](const bool t_b) -> bool {
    return t_b;
}};

std::visit(Overload(truthy_str, truthy_bool), var);

This allows you to overload, the visitor pattern to call different function overloads. For the type of a value contained in a std::variant. This was a super useful, utility function that I ended up using throughout the codebase.

Initially I had the bright idea, of just dumping this piece of code in some header where it was frequently used. With a big TODO above how I should relocate this somewhere else.

Eventually whilst implementing some basic standard library functionality. I started to notice that I needed to start splitting up the compiler, from the standard library and general utilities. This gave birth to the following directory structure:

crowlang/src/crow
crowlang/src/lib
crowlang/src/stdlibcrow

This might seem insignificant and obvious. But performing this step allowed me to start splitting up my generated code into separate CMake targets. Which really helped later down the line, when defining installation/deployment for the compiler.

# Set the different CMake targets.

# Define the possible targets/builds.
set(TARGET_CROW_BIN "crow")    # Compiler binary.
set(TARGET_CROW_LIB "crowlib") # Compiler library.

set(TARGET_CROW_STDLIBCROW "stdlibcrow") # Standard library for generated C++.

set(TARGET_CROW_TESTS "crow_tests") # Unit test binary

# Conditionally set the main target:
set(TARGET_CROW "crow")

# Change the main target if we are building tests:
if(DEFINED CROW_BUILD_TESTS)
set(TARGET_CROW ${TARGET_CROW_TESTS})
endif()

When I expand on this for wanting to create other binaries that I want to ship with my compiler. Like a formatter, language server or package manager I can of course expand on this.

Type system centralization

So every programming language has a type system. I initially had to establish my first notion of a type system. When I started dealing with semantic evaluation and changing type annotations for variables. From strings into an actual type tree.

So my first idea was to have a single type tree system. Using an std::variant, which would contain if a certain symbol was a native type, a function, struct or a normal variable. Which I defined in my semantic module crowlang/src/semantic. This semantic module would perform type checking and type inference. And annotate nodes in my AST with the right type information.

This was lovely and all and I had the bright idea, of creating a separate type tree. Within the AST for annotating the AST without all the superflous information needed for semantic evaluation.

This worked fine honestly but I was not happy with the placement. As I realized, when working on implementing the IR. That I also needed access to the native_types.hpp header and that having it be part of crow/ast/node/node_traits/typing/native_types.hpp. Made zero sense/was illogical.

So I relocated these two distinct type trees to their own toplevel module named types. Which contained crow/types/core and crow/types/semantic. So that I had one clear module which would be responsible for type information throughout the project.

Error handling

So when I started working on lexing, parsing and semantic evaluation I also had to start reporting errors to the user of my compiler. I did this quite lazily and thought you know what I will just make a genercic exception/ module which contains all my errors. I also used the excellent cpptrace library for automatically adding a stacktrace to the error messages for Error, SyntaxError, TypeError. This worked lovely until I started to notice a difference between what an user and a developer would want to see when encountering an error.

You see an user of a compiler does not care about seeing a stack trace. When an user gets a diagnostic error because they defined the same variable twice. They just want to see, where the offending line is. But an user gets astd::exception or a crowlang specific exception. They would need to report it.

So I had to rework the error modules into the following:

crowlang/src/crow/diagnostic
crowlang/src/lib/stdexcept

This way I had diagnostic for reporting errors caused by faulty user input. And I had stdexcept for all standard error cases with exceptions. I added printing of stacktraces to every exception in both these modules.

This way my catch all clause in my main.cpp could be changed to the following.

auto main(const int t_args, char* t_argv[]) -> int {
  try{
    // Irrelevant code removed.
  } catch(CLI::ParseError& e) {
    const auto exit_code{cli_params.m_app.exit(e)};
    std::cerr << std::format("CLI11 exit code: {}\n", exit_code);

    return ExitCode::CLI11_EXCEPTION;
  } catch(diagnostic::DiagnosticError& diag_err) {
    report_diagnostic_error(diag_err);

    return ExitCode::DIAGNOSTIC_ERROR;
  } catch(lib::stdexcept::Exception& except) {
    report_stdexcept(except);

    return ExitCode::STDEXCEPT_EXCEPTION;
  } catch(std::exception& except) {
    std::cerr << except.what() << '\n';
    report_exception(except);

    return ExitCode::STL_EXCEPTION;
  } catch(...) {
    report_unhandled_object();

    return ExitCode::UNCAUGHT_OBJECT;
  }

  return ExitCode::OK;
}

This also allows me to neatly know if an exception I caught comes from within STL and my misuse of it. Or I get a neat stacktrace showing me, where I messed up.

If I catch a lib::stdexcept::Exception or a normal std::exception I print the following. Message after the error message and stack trace.

Reporting an issue:
If you are seeing this message and are not a crowlang contributor.
Then please open a Github Issue.

As this error is likely a result of unintended behavior.
Open an issue at: https://github.com/soerlemans/crowlang/issues/new/choose

As this should never happen in a production release. Neatly letting any users know that they should open a bug report.

Optimizing and getting rid of visitor everywhere idiom

So I wanted to make my programming language order agnostic. I wanted for toplevel functions defined in the same module to be able to call each other without the need for forward declarations. Like in Rust and Golang. I thought this would be a cool feature.

The visitor design pattern

So the visitor design pattern is amazing for compiler development. It allows you to decouple, behavior from AST nodes itself. Now I am not going to explain the visitor pattern itself in this segment. Instead I suggest you read the explanation on refactoring.guru.

But basically the visitor design pattern is used a lot in compiler development. As it prevent you from having to add methods directly to the AST nodes in order to implement functionality. Instead you can specify the behavior in a separate data structure.

Here are some examples I use in the project:

crow/ast/visitor/node_visitor.hpp - Abstract base class for visitor pattern.
crow/ast/visitor/ast_archive.hpp - Used to serialize and deserialize the AST (more on this later).
crow/ast/visitor/ast_printer.hpp - Use to pretty print the entirety of the AST.

crow/semantic/semantic_checker_helper.hpp - Helper inherited class for perform semantic evaluation.

crow/clir/clir_builder/clir_builder.hpp - IR builder class that converts the AST into Crowlang IR.
crow/codegen/cpp_backend/cpp_backend.hpp - C++ backend which transpiles Crowlang code to C++.
crow/codegen/llvm_backend/llvm_backend.hpp - LLVM backend which does not work as generating LLVM directly from the AST is not feasible.

The problem: inefficient lookups

So the visitor design pattern is used to implement algorithms, but not perform lookups. So I wrote a naive implementation for the C++ backend, which would use the visitor pattern to create forward declarations in C++ code. So that the generated C++ functions could call each other, regardless of the order they were defined in (seecrow/codegen/cpp_backend/prototype_generator.hpp).

Now this does work but as you can imagine this does not scale well. So this lookup is of course super inefficient, and not the way to go. Imagine you start implementing multiple modules and you need to check if a variable in another module is defined. Is it then efficient to walk potentially the entire AST for the single variable you want to check the type of? No of course not.

The solution: the symbol table

So what is the solution? Stuff all defined structs, functions, variables in a key value map that contains each scope in the program. This way you can perform lookups way more efficiently. For every software entity defined.

Mistakes: naive implementation

So I honestly made a mistake whilst implementing the symbol table. You see I thought it would be smart and efficient to create the symbol table on the go. During semantic analysis. And this works lovely. But it is not the way to go.

As I now run into the same problem where during semantic analysis. I am still constructing the symbol table, so performing lookups for toplevel functions is still order dependent. During semantic analysis, so I will need to fix this by pre inserting toplevel functions. So that the semantic analysis for toplevel entities is in fact order independent.

Considerations

I also dont use the symbol table yet for generating bindings. Which I should definitely do as it is way more efficient.

Currently the symbol table also only holds a limited amount of metadata about a symbol. Only the type, but not what attributes a symbol has or if it should be exported to another module.

Also the symbol table, might not need to hold all the symbols. Only the toplevel entities, might be enough. As those are the only ones that actually matter when it comes down to exporting or importing symbols.

Exceptions everywhere

So something I do in all my projects is throw an exception for any and all cases that I do not expect. For example, I have a ton of enum to string conversion functions. Which throw on the default case:

// Macros:
#define MATCH(t_key, t_value) \
  case Opcode::t_key:         \
    str = t_value;            \
    break

// Functions:
auto opcode2str(const Opcode t_opcode) -> std::string_view {
  std::string_view str{};

  switch(t_opcode) {
    // Literals:
    MATCH(CONST_F32, "const_f32");
    MATCH(CONST_F64, "const_f64");

    MATCH(CONST_INT, "const_int");
    MATCH(CONST_I8, "const_i8");
    MATCH(CONST_I16, "const_i16");
    MATCH(CONST_I32, "const_i32");
    MATCH(CONST_I64, "const_i64");
    MATCH(CONST_ISIZE, "const_isize");

    MATCH(CONST_UINT, "const_uint");
    MATCH(CONST_U8, "const_u8");
    MATCH(CONST_U16, "const_u16");
    MATCH(CONST_U32, "const_u32");
    MATCH(CONST_U64, "const_u64");
    MATCH(CONST_USIZE, "const_usize");

    MATCH(CONST_STRING, "const_string");

    MATCH(CONST_BOOL, "const_bool");

    // A gazillion more MATCH statements ...

    default:
      // This will throw.
      lib::stdexcept::invalid_argument(
        "Opcode could not be converted to std::string_view.");
      break;
  }

  return str;
}

Instead of returning something like unknown as a string or an empty string. I instantly throw an exception as having any enumeration not being handled in the switch statement. Is likely a grave programming error.

I get a ton of mileage and prevention of silly bugs. By double checking everything and always throwing if I get some kind of unexpected nil state.

Massive bug prevention oversight

So my AST nodes have an optional trait they can inherit from which adds the functionality. To attach type data to an AST node. This is done like so:

// AST node class header:
namespace ast::node::lvalue {
// Aliases:
using container::TextPosition;
using node_traits::Identifier;
using node_traits::NodePosition;
using node_traits::TypeAnnotation;
using node_traits::TypeData;

// Classes:
class Variable : public NodePosition,
                 public Identifier,
                 public TypeAnnotation,
                 public TypeData {
  public:
  Variable(TextPosition t_pos, std::string_view t_identifier);
  Variable(TextPosition t_pos, std::string_view t_identifier,
           std::string_view t_type);

  AST_ARCHIVE_MAKE_TRAITS_ARCHIVEABLE(Variable, NodePosition, Identifier,
                                      TypeAnnotation)
  AST_VISITOR_MAKE_VISITABLE(visitor::NodeVisitor);

  virtual ~Variable() = default;
};
} // namespace ast::node::lvalue

// TypeData header:
namespace ast::node::node_traits {
// Using:
using types::core::TypeVariant;

/*!
 * Class for annotating the AST with type information.
 * This information is then later used during code generation.
 */
class TypeData : virtual public NodeInterface {
  protected:
  TypeVariant m_data;

  public:
  TypeData() = default;

  virtual auto set_type(TypeVariant t_data) -> void;
  virtual auto get_type() const -> const TypeVariant&;

  AST_VISITOR_VISITABLE_PURE_ACCEPT(visitor::NodeVisitor);

  virtual ~TypeData() = default;
};
} // namespace ast::node::node_traits

// Usage in a visitor:
auto SemanticChecker::visit(Variable* t_var) -> Any
{
  const auto id{t_var->identifier()};
  const auto var{get_symbol(id)};

  DBG_INFO("Variable ", std::quoted(id), " of type ", var);

  // Annotate AST.
  t_var->set_type(var.type_variant()); // This is the magic where we set the type data.

  return var;
}

Now you might see this and ask yourself. Okay but how is TypeVariant implemented? Well that is like so:

namespace types::core {
// Using Statements:
using types::core::NativeType;

// Aliases:
using Variant = std::variant<StructTypePtr, FnTypePtr, VarTypePtr, NativeType>;

// Classes:
// FIXME: We probably should probably not inherit from Variant.
// Instead we should prefer composition.
/*!
 * Recursive @ref Variant tree structure that denotes the type tree of a symbol.
 * This class unlike @ref SymbolData solely holds the type information.
 * The type tree is also a separate one from @ref SymbolData.
 */
class TypeVariant : public Variant {
  public:
  // Use the constructors of the parent class.
  using Variant::Variant;

  auto struct_() const -> StructTypePtr;
  auto function() const -> FnTypePtr;
  auto var() const -> VarTypePtr;

  auto native_type() const -> core::NativeTypeOpt;

  virtual ~TypeVariant() = default;
};
} // namespace types::core

Now the question is can you spot the subtle logic bug. Which is very likely to happen as the project grows?

The subtle bug

So sometimes in the logging I would see the following line pop up frequently:

[ERROR][crow/types/semantic/symbol.cpp:23 -> std::ostream& operator<<(std::ostream& t_os, StructTypePtr t_struct)]: (StructTypePtr) nullptr!

And I kept wondering, why do I see this message pop up?

And then I asked myself. What happens if we never call set_type() on an AST node?

Well then the std::variant will keep its default constructed value. Which default constructs the first type in its template argument list. Which in my case is StructTypePtr. Which is a std::shared_ptr. Which when default constructed holds nullptr. So somewhere I am trying to print a default constructed TypeVariant or SymbolData (both type trees use std::variant in a similar manner).

So in order to fix this issue that I have been ignoring forever. I need to debug, what is causing a default constructed variant TypeVariant or SymbolData which is printed. And after fixing this.

I should use std::monostate as the first template argument. So that I can add a visitor overload for when std::monostate should be considered an error. So that I can throw an exception to throw an exception in the future.

TODO's everywhere

So whilst working on the project I started dumping a ton of TODO's everywhere. Because there were a ton of things that I had to start getting back to later. Or things that I should account for in the future.

But now that the project has grown a size-able amount. I cant really keep depending on TODO's everywhere. I tried using github issues to keep track of some issues I was facing. But this does not cover design at all.

So I am slowly getting to a point, where I need to actually start thinking. About project management and about what I should finish. For now I might consider just creating an Excel document. To keep track of stuff I need to account for as its nice and easy to add/quickly see what needs getting done.

Building IR

So at a certain point in time, when I started out with creating the LLVM backend. I ran into some issues, with lowering my AST into LLVM IR. Since my source language is wildly different of something low level like LLVM IR. I needed to add an extra compiler phase, for generating of IR. This way I could cut down on the complexity of the LLVM IR code generation. By adding an in between step.

I also considered that this could be useful, when dealing with the C++ backend. As if there is a piece of code that looks like it wont translate well to C++.

I can always fall back to using my IR for codegen (if this is feasible is yet to be determined).

Other compilers

Something I learned that really helped development. Was whenever I had a road block was thinking. How do other compilers solve this?

So the structure and setup and design of my programming language. Takes a ton of inspiration of various new age compilers. Specifically a lot of architecture and design decisions were inspired from Nim.

As it is similar in the sense that as of writing most bootstrapping is done by generating and compiling C or C++ source code.

More factories

As I started to have to create more complex types or extract certain data from the AST. I noticed that I kept having to create more and more factories to construct these complex types. In most smaller projects I do I rarely anything besides a factory function or method.

But as this project is scaling I see an increase in needing complete classes in order to walk the AST and create other complex data types. This makes me think about the traditional OOP is bad argument. Where you spend more time creating logical OOP code then you spend time actually implementing features.

I dont really believe in this argument as these clear entities allow me to reason about the code better. And isolate concerns more properly. They are a natural next step in development.

What got done?

So after two years of development I did get a ton done. And here is a table covering everything that I implemented. here is a small run through of things I have implemented for crowlang:

Feature	Sub-feature	Finished	Comment
lexing	Documentation comments	Yes	They are lexed and tokenized, but this breaks parsing as of writing.
lexing	strings	Yes
lexing	numerics	Yes	Integer, hex, float.
lexing	identifier	Yes
lexing	keywords	Yes
lexing	operators	Yes	Single char and multi char operators (e.g -, !, &&, ==, >=, etc).
lexing	text location annotation	Yes	Adding of the file, line, lineno, columno metadata to tokens when lexing.
grammar	full grammar	Partially	Currently we miss OOP, switch statements, and anything related to pointers and memory access.
grammar	variables	Yes	Definitions, constants and assignment.
grammar	functions	Yes	Defining and calling.
grammar	unary operators	Yes	Unary plus and minus.
grammar	binary operators	Yes	(e.g ++, --, +, -, , /, %, +=, -=, =, /=, %=).
grammar	boolean constants	Yes	True and False exist and are their own types.
grammar	logical operators	Yes	(e.g !, &&,	).
grammar	comparison operators	Yes	(e.g >, >=, ==, !=, <, <=).
grammar	control flow statements	Yes	Loops, continue, break, if statements, else, elif, defer.
grammar	pointers	No
grammar	structs	No
grammar	enums	No
grammar	switch statements	No
grammar	match statements	No
grammar	member access	No
grammar	array indexing	No
grammar	casting	No
grammar	modules	Yes	Module definitions, and imports are defined but never used.
parsing	full grammar parsing	Partially	So all basic operations work, but structs are not parsed as of writing.
parsing	operator precedence	Yes	Using Pratt parsing.
types	native types	Yes	We have all native types defined for now.
semantic analysis	type checking	Yes	We validate if the types of every variable and the return type of a function do in fact match.
semantic analysis	type promotion integers	Yes	Smaller integers can be promoted to larger integers types in expressions.
semantic analysis	type promotion conditionals	Yes	Integers, unsigned integers and booleans can all be implicitly converted to be truthy (someday I must add pointers and optionals).
semantic analysis	type inference	Yes	We can deduce the type of an expression and annotate that variable to be of that type.
semantic analysis	symbol redefinition	No	I forgot to throw an error for duplicate symbols defined.
semantic analysis	non existent symbol reference	Yes	We error if a variable or function call was not defined.
semantic analysis	using AST annotations	Yes	We read the string which is given as the type of variables and functions and resolve them to now only native types.
semantic analysis	annotating AST with resolved TypeData	Yes	We annotate the AST with type annotations that are resolved.
serialization	AST	Yes	Using the cereal C++ library I can serialize the entirety of the AST.
IR	structure definition	Partial	The IR is slowly taking shape.
IR	building IR from AST	No	Currently I only have if statements, and return statements finished.
codegen	C++ backend	Yes	We generate valid C++ code for the entire AST (except OOP/structs) and it is valid C++ code.
codegen interop	Python interop	Yes	Automatically we export all functions to Python when this option is enabled.
codegen	LLVM backend	No	This is in severly broken state and wont work till I have the IR properly generated.
build phases	translation unit	Yes	We now group a single source file/translation unit in its own class which manages its own compilation pipeline (important for later, when we want to make compilation concurrent.
build phases	build unit	Yes	A build unit can be reused by multiple translation units to compile themselves.
build phases	module unit	No	I want source files with the same module, to merge their AST and then be turned into a single module unit.
build phases	program unit	No	Overarching unit which manages the compiler pipeline for all consisting of module units, translation units, etc.
testing	unit testing	Yes	I have Catch2 intergrated in the build system and can create tests.
testing	tests	No	I have only one test written for AST serialization.
compiler configuration	CLI	Yes	There are various command line arguments you can give, like which backend to run the debug level, etc.
compiler configuration	`crow.toml`	Yes	You can define a `crow.toml` which can be used to configure the compiler, the CLI args do take precedence over the `toml` file.