Patrick's Notes

A New Document Database

A while back I came across a company/product called Convex. They've built a tool to be used by website frontends to seemlessly keep in sync the frontend, backend, and database state. Of course, as this is focused towards the TypeScript crowd, all of this is defined in TypeScript (the schema, db mutations & queries). They also have a very good (and long) blog post about how their custom database is structured. In fact, it goes into so much detail that it's almost a guide for how to build your own database.

Why? (every post needs one of these I swear)

In the spirit of learning how things work by building them myself I of course wanted to try my hand at building such a thing from scratch (mostly). Just being curious about how their architecture worked was only part of the reason I wanted to build it though. Another huge part of my interest stemmed from being a little amazed at how SQL and relational databases have roughly been unchanged to this day1.

Now of course SQL databases aren't the only ones that exist, Gel (formally EdgeDB)2, MongoDB, SurrealDB, and SpacetimeDB, to name just a few, all exist (and are quite different from each other). However, the defacto default database people choose for almost everything is a database that supports SQL or SQL adjacent query languages. There are probably a multitude of reasons for this, a big one being that having a common syntax that everybody knows to interact with a database helps in hiring people (which isn't a bad thing), but you have to wonder, why do we even use a separate syntax/langugage3 at all to interact with the database?

I bet you can find entire books on distributed systems and the UNIX philosophy that touch on this issue (and they'll probably have very valid and practical arguments). However, none of those arguments could convince me to turn on my distaste of SQL and the like for small, less than 10 user, personal/side projects. When I start a project that needs a database, I just want to work with the structures already present in my code. I don't want to have to build a map of those into some database specific types, then, in an entirely different language, write three different functions to create, update and delete a single entity from it (and three is the minimum, you'll have to write more for nested objects).

You could call me lazy, or say that the extra effort is justified by the nature of data being well organized for cpu-cache & disk read/write efficiency...and you'd have a valid critique of me, but if everyone stopped innovating because things mostly just worked most of the time and ignored even slight inconveniences, then we'd probably still have the XZ Backdoor on our systems, or even worse, SSDs might have never been invented. The point is, even if you are building something that already exists, and it sucks, and by all definition is objectively worse than the mainstream tooling, at least now you've contributed the fact that your idea sucks to society, and that we collectively can try other things or improve on what you've built. Think of this like trying to avoid the bystander effect, except for innovation.

Anyways, enough of my soap box (and me justifying to myself why I should be spending time on this project other than for fun).

Database Goals

My goals for this project are quite simple:

And some non-goals:

Those two non-goals are definitely deal breakers if had I wanted this to be something used in a serious capacity by real people, but remember this is for ergonomically and joyfully working on side projects. I will also be actively dog-fooding the database by migrating my discord bot to using it instead of SQLite.

For the implementation, you won't have to have read the Convex blog post to understand anything I'll be mentioning, but I still recommend it because its a cool write up.

The Interface

Now without further ado, here is what the API interface looks like:

#[derive(Storable)]
struct User {
    name: Name,
    contacts: HashSet<Id<Self>>,
}

#[derive(Storable)]
struct Name {
    first: String,
    last: String,
}

fn main() {
    let db = Database::new();
    let accessor = DatabaseAccessor::new(db);

    let new_user_id = accessor.transact(|t| {
        t.insert(User {
            name: Name {
                first: "Patrick".to_string(),
                last: "Cleavelin".to_string(),
            },
            contacts: HashSet::new(),
        })
    });
}

Look at how easy it is get a project running with the db (I have cleaned up the code for brevity, but it really isn't much different than the real thing). It's pretty a simple interface right? Each transaction is just a closure that runs any number of operations on the db. Anything done inside this closure can be considered an atomic operation. You also can run multiple transactions concurrently by just cloning the DatabaseAccessor into another thread (however the actual commit of the transaction is locked behind a mutex). That's just how inserting data looks though, which is pretty useless on its own, so let's look at how you update data:

let new_user_id = ...;

accessor.transact(|t| {
    t.modify(new_user_id, |mut user| {
        user.name.first = "John".to_string();
        
        user
    });
});

Again, pretty straight forward right? No messing around with SQL, everything is just handled in code so you get all the compiler niceties. Now of course, if you can't fetch any data, no real work is able to be done:

let new_user_id = ...;

let user: Option<User> = accessor.transact(|t| {
    t.get(new_user_id)
});

However, you'll notice this creates a transaction (it won't be like this in the future and is quite a trivial change). You may have also noticed that each transact(...) call allows you to return arbitrary data, in this case, the entity that was requested. It's also possible to fetch individual fields from entities in one of two ways:

let new_user_id = ...;

// Fetching the full entity and only returning the wanted field
let user_first_name: Option<String> = accessor.transact(|t| {
    t.get(new_user_id).map(|user| user.first.name)
});

// Directly fetching wanted field via a path
let user_first_name: Option<String> = accessor.transact(|t| {
    let Some(Value::String(first_name)) =
        t.get_field(new_user_id, "name.first")
    else {
        panic!("expected a string")
    };

    first_name
});

The first way of doing this is 100% type-safe but requires cloning the entire entity to get the field you want, whereas the second option directly gives you the field you want, but requires querying fields from the entity with a path at runtime (which can fail if the path doesn't map to any fields). Which one you would use depends on if you want to trade off memory vs. cpu cycles. Given I'm building this for personal projects with few users, the first option is probably the only one I'd be using, since I'd prefer having the type safety and don't care too much about cloning and throwing away a single entity.

This is all well and useful, but if you need an id to fetch anything from the database, your application isn't going to be able to do anything. Which is why get_all() exists:

let all_users: Vec<User> = accessor.transact(|t| t.get_all());

With this final piece, you have the minimum requirements to have a fully useable database. Will it be a good experience? Probably not, because:

  1. There isn't a built-in way to filter entities, you have to just load the entire list of entities into memory (twice technically, because the database is already in memory) and filter them yourself.
  2. Schema migrations are 100% manual and done outside the main application (you'd have to write another application that moves data around accordingly). This is one of the primary focuses for me on this project. I want it to be simple and safe to migrate data arbitrarily.
  3. There is no such concept of "foreign keys" currently. I also want to address this.

That's it for the core API interface for the database. I'm still working on some implementation details for stuff like migrations, performant filtering, and a form of materialized views (or aggregate data fetching), but I do have enough implemented of what I showed here to go more in-depth in future posts, such as:

  1. What that #[derive(Storable)] does
  2. How transactions are commited and why they are atomic
  3. How the db is serialized to disk

  1. Obviously they aren't truly unchanged, a lot of work has been put into them to make them efficient and more ergonomic.

  2. Funnily (or sadly?) enough, during the writing of this post Gel was actually absorbed by Vercel and they don't plan on continuing the database anymore.

  3. You could also bring up ORMs as a complete ready-made solution to my use-case here, but then I wouldn't be building anything new and having fun :)

  4. With a small caveat of arrays not being fully first class. Any change to an array currently writes the entire array again as part of the transaction. So you really shouldn't store an array of other entities, but instead store a reference to them. Obviously this won't scale and I'm working on a better way to handle this.