Challenges of writing a compiler (crowlang update year 2)
So its been a while since I decided to blog on my personal website again. But I felt like showing of my recent experiences doing compiler development. So in my previous article about statement/semicolon interference. I cover how I wanted to tackle performing statement interference for my own programming language crowlang.
So crowlang started by borrrowing a ton of code from another project that I worked on called awx. Which was my approach for building POSIX awk compatible interpreter with extensions. But I became unhappy and bored iwth the project, as I had to do large refactors to the parsing. In order to clean up the project and make it ready for command line use. I was also generally unhappy with some of the design choices that I made for the project.
And whilst I had most of the POSIX functionality implemented including most standard library functions. I did not feel like refactoring, and decided that I wanted to go all the way with creating my own systems programming language.
So I created a new repo, copied over most of the old awx code and started doing large scale refactors. I had already implemented lexing, the top down parser and pratt parser so I could reuse all those components.
I started with having to write my own grammar for the language from scratch. I also refactored the lexer as I was not happy with how I implemented it in awx. I also had to throw out large parts of the parser, as I was going for a completely different grammar.
And over the last two years I have been working consistently on and off on the project. So here are some of my struggles and eureka moments tackling a large project like this. This article is more of a collection of ramblings about compilers, than a coherent story.
Initial goals changing
So my first goal when I started programming my own systems language. Could be roughly summarized with a single sentence "Create a safer systems programming language that interops with C and C++ without the need for semicolons". So the idea was to basically write a custom frontend for generating C++ code like cppfront. Now of course there are a million other C++ successor languages. But I really wanted to have a go at it, just because I was curious about learning more about compiler development. As I had built a ton of smaller projects using LLVM and interpreters. But none of them were really large/complete.
But as I continued working on the compiler. I eventually was stuck with two backends. I created an LLVM one which I half finished. And a C++ backend for bootstrapping/prototyping features of the language. And somewhere along the line I had this fun idea, of adding an interoperability backend to the C++ backend. I did this using pybind11.
You see I had always been interested in programming language bindings. So I decided to add this as a cool feature. But once I implemented this and got my first Crowlang functions generated to C++ which could automatically interop with Python. I got really inspired to continue this.
So now I have the added goal/motivation, for adding as many programming language bindings as I can. I want to have these activated using attributes (these still need to be implemented, now we just expose everything).
Big projects and refactoring
I am a bit of a perfectionist and I always have been. One of the biggest issues I had with the project was leaving finished components untouched. I can be very picky about how I word entities in my code.
And I have this habit of revisiting my old code and thinking to myself "This should be named/implemented differently". As a software engineer you have to constantly ask yourself, if your old code still reflects what you want to achieve now.
I had a lot of instances, where I named something and then later went back and did not make any functional changes. But just spent time, renaming the original module to better reflect the use case/goal that module is supposed to reflect.
Project size
This is of course fine for smaller projects, but after really working on this project solo for a while. The size of the project really grew:
=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== Autoconf 1 57 34 13 10 CMake 61 796 543 148 105 C++ 105 9128 6189 908 2031 C++ Header 158 7637 4480 1465 1692 Python 1 242 159 21 62 Shell 1 89 59 10 20 Plain Text 1 22 0 22 0 TOML 1 8 7 0 1 ------------------------------------------------------------------------------- Markdown 9 212 0 175 37 |- Go 1 7 6 0 1 |- Shell 1 4 4 0 0 (Total) 223 10 175 38 =============================================================================== Total 338 18191 11471 2762 3958 ===============================================================================
And crowlang now counts roughly 16k lines of C++ code. So here are some changes that I did not have the foresight to see coming.
Compiler, standard library, and project library
So when I first started tackling/writing the project I just dumped everything in src/
.
Because it was easy and I had no clue that this would not scale well.
So I had this consistent issue of where to place utility functions.
For example the following utility code is used for adding type overloads to std::visit
:
// Overload pattern: /*! * Helper struct for overloading callables when using @ref std::visit. */ template<class... Ts> struct Overload : Ts... { using Ts::operator()...; }; //! Deduction guide for the overload to work. template<class... Ts> Overload(Ts...) -> Overload<Ts...>;
This allows for writing code like so:
std::variant<std::string, bool> var{"true"}; // Lazy implementation I know. const auto truthy_str{[](const std::string& t_str) -> bool { return t_str == "true"; }}; const auto truthy_bool{[](const bool t_b) -> bool { return t_b; }}; std::visit(Overload(truthy_str, truthy_bool), var);
This allows you to overload, the visitor pattern to call different function overloads.
For the type of a value contained in a std::variant
.
This was a super useful, utility function that I ended up using throughout the codebase.
Initially I had the bright idea, of just dumping this piece of code in some header where it was frequently used.
With a big TODO
above how I should relocate this somewhere else.
Eventually whilst implementing some basic standard library functionality. I started to notice that I needed to start splitting up the compiler, from the standard library and general utilities. This gave birth to the following directory structure:
crowlang/src/crow crowlang/src/lib crowlang/src/stdlibcrow
This might seem insignificant and obvious.
But performing this step allowed me to start splitting up my generated code into separate CMake
targets.
Which really helped later down the line, when defining installation/deployment for the compiler.
# Set the different CMake targets. # Define the possible targets/builds. set(TARGET_CROW_BIN "crow") # Compiler binary. set(TARGET_CROW_LIB "crowlib") # Compiler library. set(TARGET_CROW_STDLIBCROW "stdlibcrow") # Standard library for generated C++. set(TARGET_CROW_TESTS "crow_tests") # Unit test binary # Conditionally set the main target: set(TARGET_CROW "crow") # Change the main target if we are building tests: if(DEFINED CROW_BUILD_TESTS) set(TARGET_CROW ${TARGET_CROW_TESTS}) endif()
When I expand on this for wanting to create other binaries that I want to ship with my compiler. Like a formatter, language server or package manager I can of course expand on this.
Type system centralization
So every programming language has a type system. I initially had to establish my first notion of a type system. When I started dealing with semantic evaluation and changing type annotations for variables. From strings into an actual type tree.
So my first idea was to have a single type tree system.
Using an std::variant
, which would contain if a certain symbol was a native type, a function, struct or a normal variable.
Which I defined in my semantic module crowlang/src/semantic
.
This semantic module would perform type checking and type inference.
And annotate nodes in my AST with the right type information.
This was lovely and all and I had the bright idea, of creating a separate type tree. Within the AST for annotating the AST without all the superflous information needed for semantic evaluation.
This worked fine honestly but I was not happy with the placement.
As I realized, when working on implementing the IR.
That I also needed access to the native_types.hpp
header and that having it be part of crow/ast/node/node_traits/typing/native_types.hpp
.
Made zero sense/was illogical.
So I relocated these two distinct type trees to their own toplevel module named types
.
Which contained crow/types/core
and crow/types/semantic
.
So that I had one clear module which would be responsible for type information throughout the project.
Error handling
So when I started working on lexing, parsing and semantic evaluation I also had to start reporting errors to the user of my compiler.
I did this quite lazily and thought you know what I will just make a genercic exception/
module which contains all my errors.
I also used the excellent cpptrace library for automatically adding a stacktrace to the error messages for Error
, SyntaxError
, TypeError
.
This worked lovely until I started to notice a difference between what an user and a developer would want to see when encountering an error.
You see an user of a compiler does not care about seeing a stack trace.
When an user gets a diagnostic error because they defined the same variable twice.
They just want to see, where the offending line is.
But an user gets astd::exception
or a crowlang specific exception.
They would need to report it.
So I had to rework the error modules into the following:
crowlang/src/crow/diagnostic crowlang/src/lib/stdexcept
This way I had diagnostic
for reporting errors caused by faulty user input.
And I had stdexcept
for all standard error cases with exceptions.
I added printing of stacktraces to every exception in both these modules.
This way my catch
all clause in my main.cpp
could be changed to the following.
auto main(const int t_args, char* t_argv[]) -> int { try{ // Irrelevant code removed. } catch(CLI::ParseError& e) { const auto exit_code{cli_params.m_app.exit(e)}; std::cerr << std::format("CLI11 exit code: {}\n", exit_code); return ExitCode::CLI11_EXCEPTION; } catch(diagnostic::DiagnosticError& diag_err) { report_diagnostic_error(diag_err); return ExitCode::DIAGNOSTIC_ERROR; } catch(lib::stdexcept::Exception& except) { report_stdexcept(except); return ExitCode::STDEXCEPT_EXCEPTION; } catch(std::exception& except) { std::cerr << except.what() << '\n'; report_exception(except); return ExitCode::STL_EXCEPTION; } catch(...) { report_unhandled_object(); return ExitCode::UNCAUGHT_OBJECT; } return ExitCode::OK; }
This also allows me to neatly know if an exception I caught comes from within STL
and my misuse of it.
Or I get a neat stacktrace showing me, where I messed up.
If I catch a lib::stdexcept::Exception
or a normal std::exception
I print the following.
Message after the error message and stack trace.
Reporting an issue: If you are seeing this message and are not a crowlang contributor. Then please open a Github Issue. As this error is likely a result of unintended behavior. Open an issue at: https://github.com/soerlemans/crowlang/issues/new/choose
As this should never happen in a production release. Neatly letting any users know that they should open a bug report.
Optimizing and getting rid of visitor everywhere idiom
So I wanted to make my programming language order agnostic. I wanted for toplevel functions defined in the same module to be able to call each other without the need for forward declarations. Like in Rust and Golang. I thought this would be a cool feature.
The visitor design pattern
So the visitor design pattern is amazing for compiler development. It allows you to decouple, behavior from AST nodes itself. Now I am not going to explain the visitor pattern itself in this segment. Instead I suggest you read the explanation on refactoring.guru.
But basically the visitor design pattern is used a lot in compiler development. As it prevent you from having to add methods directly to the AST nodes in order to implement functionality. Instead you can specify the behavior in a separate data structure.
Here are some examples I use in the project:
crow/ast/visitor/node_visitor.hpp - Abstract base class for visitor pattern. crow/ast/visitor/ast_archive.hpp - Used to serialize and deserialize the AST (more on this later). crow/ast/visitor/ast_printer.hpp - Use to pretty print the entirety of the AST. crow/semantic/semantic_checker_helper.hpp - Helper inherited class for perform semantic evaluation. crow/clir/clir_builder/clir_builder.hpp - IR builder class that converts the AST into Crowlang IR. crow/codegen/cpp_backend/cpp_backend.hpp - C++ backend which transpiles Crowlang code to C++. crow/codegen/llvm_backend/llvm_backend.hpp - LLVM backend which does not work as generating LLVM directly from the AST is not feasible.
The problem: inefficient lookups
So the visitor design pattern is used to implement algorithms, but not perform lookups.
So I wrote a naive implementation for the C++ backend, which would use the visitor pattern to create forward declarations in C++ code.
So that the generated C++ functions could call each other, regardless of the order they were defined in (seecrow/codegen/cpp_backend/prototype_generator.hpp
).
Now this does work but as you can imagine this does not scale well. So this lookup is of course super inefficient, and not the way to go. Imagine you start implementing multiple modules and you need to check if a variable in another module is defined. Is it then efficient to walk potentially the entire AST for the single variable you want to check the type of? No of course not.
The solution: the symbol table
So what is the solution? Stuff all defined structs, functions, variables in a key value map that contains each scope in the program. This way you can perform lookups way more efficiently. For every software entity defined.
Mistakes: naive implementation
So I honestly made a mistake whilst implementing the symbol table. You see I thought it would be smart and efficient to create the symbol table on the go. During semantic analysis. And this works lovely. But it is not the way to go.
As I now run into the same problem where during semantic analysis. I am still constructing the symbol table, so performing lookups for toplevel functions is still order dependent. During semantic analysis, so I will need to fix this by pre inserting toplevel functions. So that the semantic analysis for toplevel entities is in fact order independent.
Considerations
I also dont use the symbol table yet for generating bindings. Which I should definitely do as it is way more efficient.
Currently the symbol table also only holds a limited amount of metadata about a symbol. Only the type, but not what attributes a symbol has or if it should be exported to another module.
Also the symbol table, might not need to hold all the symbols. Only the toplevel entities, might be enough. As those are the only ones that actually matter when it comes down to exporting or importing symbols.
Exceptions everywhere
So something I do in all my projects is throw an exception for any and all cases that I do not expect.
For example, I have a ton of enum to string conversion functions.
Which throw on the default
case:
// Macros: #define MATCH(t_key, t_value) \ case Opcode::t_key: \ str = t_value; \ break // Functions: auto opcode2str(const Opcode t_opcode) -> std::string_view { std::string_view str{}; switch(t_opcode) { // Literals: MATCH(CONST_F32, "const_f32"); MATCH(CONST_F64, "const_f64"); MATCH(CONST_INT, "const_int"); MATCH(CONST_I8, "const_i8"); MATCH(CONST_I16, "const_i16"); MATCH(CONST_I32, "const_i32"); MATCH(CONST_I64, "const_i64"); MATCH(CONST_ISIZE, "const_isize"); MATCH(CONST_UINT, "const_uint"); MATCH(CONST_U8, "const_u8"); MATCH(CONST_U16, "const_u16"); MATCH(CONST_U32, "const_u32"); MATCH(CONST_U64, "const_u64"); MATCH(CONST_USIZE, "const_usize"); MATCH(CONST_STRING, "const_string"); MATCH(CONST_BOOL, "const_bool"); // A gazillion more MATCH statements ... default: // This will throw. lib::stdexcept::invalid_argument( "Opcode could not be converted to std::string_view."); break; } return str; }
Instead of returning something like unknown
as a string or an empty string.
I instantly throw an exception as having any enumeration not being handled in the switch
statement.
Is likely a grave programming error.
I get a ton of mileage and prevention of silly bugs. By double checking everything and always throwing if I get some kind of unexpected nil state.
Massive bug prevention oversight
So my AST nodes have an optional trait they can inherit from which adds the functionality. To attach type data to an AST node. This is done like so:
// AST node class header: namespace ast::node::lvalue { // Aliases: using container::TextPosition; using node_traits::Identifier; using node_traits::NodePosition; using node_traits::TypeAnnotation; using node_traits::TypeData; // Classes: class Variable : public NodePosition, public Identifier, public TypeAnnotation, public TypeData { public: Variable(TextPosition t_pos, std::string_view t_identifier); Variable(TextPosition t_pos, std::string_view t_identifier, std::string_view t_type); AST_ARCHIVE_MAKE_TRAITS_ARCHIVEABLE(Variable, NodePosition, Identifier, TypeAnnotation) AST_VISITOR_MAKE_VISITABLE(visitor::NodeVisitor); virtual ~Variable() = default; }; } // namespace ast::node::lvalue // TypeData header: namespace ast::node::node_traits { // Using: using types::core::TypeVariant; /*! * Class for annotating the AST with type information. * This information is then later used during code generation. */ class TypeData : virtual public NodeInterface { protected: TypeVariant m_data; public: TypeData() = default; virtual auto set_type(TypeVariant t_data) -> void; virtual auto get_type() const -> const TypeVariant&; AST_VISITOR_VISITABLE_PURE_ACCEPT(visitor::NodeVisitor); virtual ~TypeData() = default; }; } // namespace ast::node::node_traits // Usage in a visitor: auto SemanticChecker::visit(Variable* t_var) -> Any { const auto id{t_var->identifier()}; const auto var{get_symbol(id)}; DBG_INFO("Variable ", std::quoted(id), " of type ", var); // Annotate AST. t_var->set_type(var.type_variant()); // This is the magic where we set the type data. return var; }
Now you might see this and ask yourself.
Okay but how is TypeVariant
implemented?
Well that is like so:
namespace types::core { // Using Statements: using types::core::NativeType; // Aliases: using Variant = std::variant<StructTypePtr, FnTypePtr, VarTypePtr, NativeType>; // Classes: // FIXME: We probably should probably not inherit from Variant. // Instead we should prefer composition. /*! * Recursive @ref Variant tree structure that denotes the type tree of a symbol. * This class unlike @ref SymbolData solely holds the type information. * The type tree is also a separate one from @ref SymbolData. */ class TypeVariant : public Variant { public: // Use the constructors of the parent class. using Variant::Variant; auto struct_() const -> StructTypePtr; auto function() const -> FnTypePtr; auto var() const -> VarTypePtr; auto native_type() const -> core::NativeTypeOpt; virtual ~TypeVariant() = default; }; } // namespace types::core
Now the question is can you spot the subtle logic bug. Which is very likely to happen as the project grows?
The subtle bug
So sometimes in the logging I would see the following line pop up frequently:
[ERROR][crow/types/semantic/symbol.cpp:23 -> std::ostream& operator<<(std::ostream& t_os, StructTypePtr t_struct)]: (StructTypePtr) nullptr!
And I kept wondering, why do I see this message pop up?
And then I asked myself.
What happens if we never call set_type()
on an AST node?
Well then the std::variant
will keep its default constructed value.
Which default constructs the first type in its template argument list.
Which in my case is StructTypePtr
.
Which is a std::shared_ptr
.
Which when default constructed holds nullptr
.
So somewhere I am trying to print a default constructed TypeVariant
or SymbolData
(both type trees use std::variant
in a similar manner).
So in order to fix this issue that I have been ignoring forever.
I need to debug, what is causing a default constructed variant TypeVariant
or SymbolData
which is printed.
And after fixing this.
I should use std::monostate as the first template argument.
So that I can add a visitor overload for when std::monostate
should be considered an error.
So that I can throw an exception to throw an exception in the future.
TODO's everywhere
So whilst working on the project I started dumping a ton of TODO's everywhere. Because there were a ton of things that I had to start getting back to later. Or things that I should account for in the future.
But now that the project has grown a size-able amount. I cant really keep depending on TODO's everywhere. I tried using github issues to keep track of some issues I was facing. But this does not cover design at all.
So I am slowly getting to a point, where I need to actually start thinking. About project management and about what I should finish. For now I might consider just creating an Excel document. To keep track of stuff I need to account for as its nice and easy to add/quickly see what needs getting done.
Building IR
So at a certain point in time, when I started out with creating the LLVM backend. I ran into some issues, with lowering my AST into LLVM IR. Since my source language is wildly different of something low level like LLVM IR. I needed to add an extra compiler phase, for generating of IR. This way I could cut down on the complexity of the LLVM IR code generation. By adding an in between step.
I also considered that this could be useful, when dealing with the C++ backend. As if there is a piece of code that looks like it wont translate well to C++.
I can always fall back to using my IR for codegen (if this is feasible is yet to be determined).
Other compilers
Something I learned that really helped development. Was whenever I had a road block was thinking. How do other compilers solve this?
So the structure and setup and design of my programming language. Takes a ton of inspiration of various new age compilers. Specifically a lot of architecture and design decisions were inspired from Nim.
As it is similar in the sense that as of writing most bootstrapping is done by generating and compiling C or C++ source code.
More factories
As I started to have to create more complex types or extract certain data from the AST. I noticed that I kept having to create more and more factories to construct these complex types. In most smaller projects I do I rarely anything besides a factory function or method.
But as this project is scaling I see an increase in needing complete classes in order to walk the AST and create other complex data types. This makes me think about the traditional OOP is bad argument. Where you spend more time creating logical OOP code then you spend time actually implementing features.
I dont really believe in this argument as these clear entities allow me to reason about the code better. And isolate concerns more properly. They are a natural next step in development.
What got done?
So after two years of development I did get a ton done. And here is a table covering everything that I implemented. here is a small run through of things I have implemented for crowlang:
Feature | Sub-feature | Finished | Comment | ||
---|---|---|---|---|---|
lexing | Documentation comments | Yes | They are lexed and tokenized, but this breaks parsing as of writing. | ||
lexing | strings | Yes | |||
lexing | numerics | Yes | Integer, hex, float. | ||
lexing | identifier | Yes | |||
lexing | keywords | Yes | |||
lexing | operators | Yes | Single char and multi char operators (e.g -, !, &&, ==, >=, etc). | ||
lexing | text location annotation | Yes | Adding of the file, line, lineno, columno metadata to tokens when lexing. | ||
grammar | full grammar | Partially | Currently we miss OOP, switch statements, and anything related to pointers and memory access. | ||
grammar | variables | Yes | Definitions, constants and assignment. | ||
grammar | functions | Yes | Defining and calling. | ||
grammar | unary operators | Yes | Unary plus and minus. | ||
grammar | binary operators | Yes | (e.g ++, --, +, -, *, /, %, +=, -=, *=, /=, %=). | ||
grammar | boolean constants | Yes | True and False exist and are their own types. | ||
grammar | logical operators | Yes | (e.g !, &&, | ). | |
grammar | comparison operators | Yes | (e.g >, >=, ==, !=, <, <=). | ||
grammar | control flow statements | Yes | Loops, continue, break, if statements, else, elif, defer. | ||
grammar | pointers | No | |||
grammar | structs | No | |||
grammar | enums | No | |||
grammar | switch statements | No | |||
grammar | match statements | No | |||
grammar | member access | No | |||
grammar | array indexing | No | |||
grammar | casting | No | |||
grammar | modules | Yes | Module definitions, and imports are defined but never used. | ||
parsing | full grammar parsing | Partially | So all basic operations work, but structs are not parsed as of writing. | ||
parsing | operator precedence | Yes | Using Pratt parsing. | ||
types | native types | Yes | We have all native types defined for now. | ||
semantic analysis | type checking | Yes | We validate if the types of every variable and the return type of a function do in fact match. | ||
semantic analysis | type promotion integers | Yes | Smaller integers can be promoted to larger integers types in expressions. | ||
semantic analysis | type promotion conditionals | Yes | Integers, unsigned integers and booleans can all be implicitly converted to be truthy (someday I must add pointers and optionals). | ||
semantic analysis | type inference | Yes | We can deduce the type of an expression and annotate that variable to be of that type. | ||
semantic analysis | symbol redefinition | No | I forgot to throw an error for duplicate symbols defined. | ||
semantic analysis | non existent symbol reference | Yes | We error if a variable or function call was not defined. | ||
semantic analysis | using AST annotations | Yes | We read the string which is given as the type of variables and functions and resolve them to now only native types. | ||
semantic analysis | annotating AST with resolved TypeData | Yes | We annotate the AST with type annotations that are resolved. | ||
serialization | AST | Yes | Using the cereal C++ library I can serialize the entirety of the AST. | ||
IR | structure definition | Partial | The IR is slowly taking shape. | ||
IR | building IR from AST | No | Currently I only have if statements, and return statements finished. | ||
codegen | C++ backend | Yes | We generate valid C++ code for the entire AST (except OOP/structs) and it is valid C++ code. | ||
codegen interop | Python interop | Yes | Automatically we export all functions to Python when this option is enabled. | ||
codegen | LLVM backend | No | This is in severly broken state and wont work till I have the IR properly generated. | ||
build phases | translation unit | Yes | We now group a single source file/translation unit in its own class which manages its own compilation pipeline (important for later, when we want to make compilation concurrent. | ||
build phases | build unit | Yes | A build unit can be reused by multiple translation units to compile themselves. | ||
build phases | module unit | No | I want source files with the same module, to merge their AST and then be turned into a single module unit. | ||
build phases | program unit | No | Overarching unit which manages the compiler pipeline for all consisting of module units, translation units, etc. | ||
testing | unit testing | Yes | I have Catch2 intergrated in the build system and can create tests. | ||
testing | tests | No | I have only one test written for AST serialization. | ||
compiler configuration | CLI | Yes | There are various command line arguments you can give, like which backend to run the debug level, etc. | ||
compiler configuration |
crow.toml | Yes |
You can define a crow.toml which can be used to configure the compiler, the CLI args do take precedence over the toml file. |