One of the initial explorations that started this book was how small and simple a WebAssembly compiler implemented in JavaScript could be.
Here's a Tweet from October 1st 2022 with my first "WebAssembly compiler in a Tweet".
If you follow my own replies, at some point the compiler stopped being the simplest one and became one that could do something useful.
The most natural thing to compile if I wanted to keep the code small was an RPN (Reverse Polish Notation) calculator, and that's what I started playing with on July 7th 2023 with a 269 byte implementation.
The reason to pick RPN is that the order of tokens in the code appear in the same order as WebAssembly encodes its bytecode, because of this the "compiler" is just a map
from
tokens to bytecode.
From the initial calculator together with Patrick Dubroy we started code golfing to make it smaller, every now and then we would come back and shave some bytes, the last improvements were done some weeks ago with the help of orthoplex who provided many tricks that got the final implementation at 192 bytes.
In this post we will "undo" one trick at a time and explain it. We hope that in the process you learn something about WebAssembly, obscure JavaScript features and some new code golfing tricks.
But first let's see it in action:
What it does
If you paste "the compiler" in a JavaScript REPL like node
, deno
, bun repl
or the browser's development console:
let c=(b,l)=>WebAssembly.instantiate(new Int8Array([,97,115,109,1,,,,1,5,1,96,,1,127,3,2,1,,7,4,1,,,,10,l=(b=b.split` `.flatMap(t=>t>-1?[65,t]:107+'-*/'.indexOf(t))).length+4,1,l-2,,...b,11]))
And then evaluate an expression like the following:
(await c('11 11 1 - + 4 * 2 /')).instance.exports['']()
It will return the result of evaluating the provided expression, in this case it will return 42
.
The second line compiles the expression '11 11 1 - + 4 * 2 /'
to a WebAssembly module and calls the exported function (whose name is the empty string) to get the result of the calculation back.
The code for all steps and a test for each step is available on stackblitz here: 192-byte Wasm Compiler in JS @ StackBlitz
Let's see how it works by deobfuscating it one step at a time.
The Code
We start with the original code (line breaks added):
let c=(b,l)=>WebAssembly.instantiate(new Int8Array(
[,97,115,109,1,,,,1,5,1,96,,1,127,3,2,1,,7,4,1,,,,10,
l=(b=b.split` `.flatMap(t=>t>-1?[65,t]:107+'-*/'.indexOf(t)))
.length+4,1,l-2,,...b,11]))
Format
The first thing we can do to make it more readable is to format it:
let c1 = (b, l) =>
WebAssembly.instantiate(
new Int8Array([
, 97, 115, 109, 1, , , , 1, 5, 1, 96, , 1, 127, 3, 2, 1, , 7, 4, 1, , , , 10,
(l = (b = b.split` `.flatMap(
(t) => t > -1 ? [65, t] : 107 + "-*/".indexOf(t)
)).length + 4),
1, l - 2, , ...b, 11
])
);
While still unreadable we can at least identify different parts of the code.
Remove Assignment Expression
The first trick to undo is the assignment expression.
In JavaScript the assignment operator is an expression, this means that it generates a result after evaluating, you can see it in the following examples:
let a, b;
console.log('a', a = 42);
a = b = 43;
console.log('b', b);
The code above will output:
a 42
b 43
This is because a = 42
assigns 42
to a
and the whole assignment expression evaluates to the value being assigned.
In a = b = 43
we assign the result of evaluating b = 43
to a
, this equivalent expression may be easier to understand: a = (b = 43)
.
In our code we use this trick to reuse variables and update their value in places where we can also use the value being assigned.
This trick allows us to have our compiler in a single expression, avoiding the need for curly braces, semicolons and return statements.
To undo it we turn the body of our function into a block and do each assignment in its own line:
let c2 = (b, l) => {
b = b.split` `.flatMap(
(t) => (t > -1 ? [65, t] : 107 + "-*/".indexOf(t))
);
l = b.length + 4;
return WebAssembly.instantiate(
new Int8Array([
, 97, 115, 109, 1, , , , 1, 5, 1, 96, , 1, 127, 3, 2, 1, , 7, 4, 1, , , ,
10, l, 1, l - 2, , ...b, 11
]),
);
};
Undo Variable Tricks
Now the assignments are easier to identify but the meaning of variables and function arguments are still hard to understand, let's fix that by undoing a couple of variable tricks.
The first one is to stop using single letter variables and give them descriptive names.
The next one is to stop reusing variables, in our case b
initially holds the code to compile but once we don't need that any more we reuse it to contain the bytecode instructions.
To undo this we are going to introduce a new instrs
variable and rename b
to code
.
We will also rename the variable formerly known as l
to len
, this variable contains a value that is close to the number of bytecodes.
By declaring l
in the body we can remove it from the function argument's list, we did this
as a trick to avoid the need to declare it with let
or const
, saving some bytes and the need for a function body.
The trick works by adding unused arguments at the end of the function argument list and using them as local variables.
Our compiler function expects a single argument with the code, l
is there for us to use since we don't expect the caller to provide any value for it.
Here's the code without this tricks:
let c3 = (code) => {
const instrs = code.split` `.flatMap(
(t) => (t > -1 ? [65, t] : 107 + "-*/".indexOf(t))
);
const len = instrs.length + 4;
return WebAssembly.instantiate(
new Int8Array([
, 97, 115, 109, 1, , , , 1, 5, 1, 96, , 1, 127, 3, 2, 1, , 7, 4, 1, , , ,
10, len, 1, len - 2, , ...instrs, 11
]),
);
};
Add Missing Zeros
If you looked at the array in our code you may have noticed that there are many commas followed by another comma instead of a value, this syntax allows to define "sparse arrays", let's see an example:
const a1 = [,,];
console.log(a1.length); // Output: 2
console.log(a1); // Output: [ <2 empty items> ]
Which is equivalent to:
const a2 = new Array(2);
console.log(a2.length); // Output: 2
console.log(a2); // Output: [ <2 empty items> ]
We use this syntactic trick to save one byte each time we need to put a 0
in the array of bytes.
This works because Typed Arrays coerce all array items to numbers:
new Int8Array([0, null, undefined,,0])
Produces:
Int8Array(5) [ 0, 0, 0, 0, 0 ]
Let's undo this trick by adding all zeroes back:
let c4 = (code) => {
const instrs = code.split` `.flatMap(
(t) => (t > -1 ? [65, t] : 107 + "-*/".indexOf(t))
);
const len = instrs.length + 4;
return WebAssembly.instantiate(
new Int8Array([
0, 97, 115, 109, 1, 0, 0, 0, 1, 5, 1, 96, 0, 1, 127, 3, 2, 1, 0, 7, 4, 1, 0, 0, 0,
10, len, 1, len - 2, 0, ...instrs, 11
]),
);
};
Remove + 4
on Length Definition
In our code we have a variable len
that contains a number that is close to the number
of bytecodes in the compiled expression but not exactly:
const len = instrs.length + 4;
This comes from the original code, where we chain assignment expressions to reassign b
to the bytecodes and then chain another assignment with the length of b
plus 4
, we do that
in the place where we use the result of that calculation and keep it in l
to later calculate a similar number l - 2
, which is the same as b.length + 2
.
This is all a trick to avoid having to write b.length
twice.
[..., l=(b=b.split` `.flatMap(t=>t>-1?[65,t]:107+'-*/'.indexOf(t))).length+4,1,l-2, ...]
The two places where we use l
are conceptually instrs.length + 4
and instrs.length + 2
so let's write something closer to that:
let c5 = (code) => {
const instrs = code.split` `.flatMap(
(t) => (t > -1 ? [65, t] : 107 + "-*/".indexOf(t))
);
const len = instrs.length;
return WebAssembly.instantiate(
new Int8Array([
0, 97, 115, 109, 1, 0, 0, 0, 1, 5, 1, 96, 0, 1, 127, 3, 2, 1, 0, 7, 4, 1, 0, 0, 0,
10, 4 + len, 1, 2 + len, 0, ...instrs, 11
]),
);
};
Remove String Template Literal Instead of Function Call
The next trick to undo is code.split` `
, in this case we use the Tagged Template feature of String Template Literals.
Let's see how it works by creating a simple tagged template that turns the string to uppercase:
function upper(s) {
return s[0].toUpperCase();
}
And use it:
upper`Hello, World!`
> "HELLO, WORLD!"
As you can see the first argument to the tagged template function is an array, but luckily for us the first argument of String.prototype.split is handled in the following way:
All values that are not undefined or objects with a
[Symbol.split]()
method are coerced to strings."
And coercing an array with one string in it is the same as the string itself:
["hello"].toString()
> "hello"
[" "].toString()
> " "
Since the function we want to call takes a single string argument we can use it as a tagged template and save the parenthesis in the function call.
Let's write it as a function call instead:
let c6 = (code) => {
const instrs = code.split(' ').flatMap(
(t) => (t > -1 ? [65, t] : 107 + "-*/".indexOf(t))
);
const len = instrs.length;
return WebAssembly.instantiate(
new Int8Array([
0, 97, 115, 109, 1, 0, 0, 0, 1, 5, 1, 96, 0, 1, 127, 3, 2, 1, 0, 7, 4, 1, 0, 0, 0,
10, 4 + len, 1, 2 + len, 0, ...instrs, 11
]),
);
};
Remove the Ternary Operator
Next, let's undo the Ternary Operator and turn it into an if statement.
The ternary operator has expressions on each branch saving us the return
statements.
The code using an if statement:
let c7 = (code) => {
const instrs = code.split(" ").flatMap((t) => {
if (t > -1) {
return [65, t];
} else {
return 107 + "-*/".indexOf(t);
}
});
const len = instrs.length;
return WebAssembly.instantiate(
new Int8Array([
0, 97, 115, 109, 1, 0, 0, 0, 1, 5, 1, 96, 0, 1, 127, 3, 2, 1, 0, 7, 4, 1, 0, 0, 0,
10, 4 + len, 1, 2 + len, 0, ...instrs, 11
]),
);
};
Remove Number Check With Coercion
The next trick to undo is the one present twice in the following code:
if (t > -1) {
return [65, t];
}
First we use coercion in t > -1
to check if the token t
is a string representing a
positive number.
Then we use coercion again in [65, t]
to let JavaScript turn t
into a Number
in
the Int8Array
:
new Int8Array([65, '42'])
Evaluates to:
Int8Array(2) [ 65, 42 ]
Let's write the parsing and checking explicitly:
let c8 = (code) => {
const instrs = code.split(" ").flatMap((t) => {
const num = parseInt(t, 10);
if (Number.isFinite(num)) {
return [65, num];
} else {
return 107 + "-*/".indexOf(t);
}
});
const len = instrs.length;
return WebAssembly.instantiate(
new Int8Array([
0, 97, 115, 109, 1, 0, 0, 0, 1, 5, 1, 96, 0, 1, 127, 3, 2, 1, 0, 7, 4, 1, 0, 0, 0,
10, 4 + len, 1, 2 + len, 0, ...instrs, 11
]),
);
};
The semantics of our compiler change a little bit here, the original version will only accept
positive integers as input, if you want a negative number you have to subtract from zero: 0 - 1
to get -1
.
The new version allows negative numbers since it checks with Number.isFinite(num)
instead of t > -1
.
Remove indexOf -1 Trick
The next trick is in the else brach:
return 107 + "-*/".indexOf(t);
Our calculator compiler only accepts four arithmetic operations: +
, -
, *
and /
but
in the code above you can only see three: -*/
and a magical number 107
, here's how it works, these are the bytecode numbers for arithmetic operations in WebAssembly:
+
:106
-
:107
*
:108
/
:109
We enter this branch if the token t
is not a number, then it can only be
one of the arithmetic operators above, we need to replace the token with the arithmetic symbol with its bytecode, since the first one (+
) is 106
we could do 106 + "+-*/".indexOf(t)
, that is, we find the symbol's index in the string:
+
:0
-
:1
*
:2
/
:3
And add 106
to it to get the bytecode number.
But when t
is not in the string "+-*/"
indexOf
returns -1
. We can use that -1
to mean "plus or any other token":
+
:-1
(any other token will be-1
too)-
:0
*
:1
/
:2
And that's why we add 107
instead of 106
, let's undo the -1
trick:
let c9 = (code) => {
const instrs = code.split(" ").flatMap((t) => {
const num = parseInt(t, 10);
if (Number.isFinite(num)) {
return [65, num];
} else {
return 106 + "+-*/".indexOf(t);
}
});
const len = instrs.length;
return WebAssembly.instantiate(
new Int8Array([
0, 97, 115, 109, 1, 0, 0, 0, 1, 5, 1, 96, 0, 1, 127, 3, 2, 1, 0, 7, 4, 1, 0, 0, 0,
10, 4 + len, 1, 2 + len, 0, ...instrs, 11
]),
);
};
Here again the semantics change a little bit, before if the token t
wasn't found then the expression would evaluate to 107 + -1
which would map to an addition, now it will evaluate to 106 + -1
which will map to bytecode 105
which is the popcnt
instruction.
But don't worry, we will fix it in the next step.
Remove indexOf Trick
After explaining how the indexOf
trick works and removing the -1
part, let's
proceed to remove the trick completely.
To do it we are going to create an object that maps from arithmetic operation token to its bytecode:
const OP_TO_BYTECODE = {
"+": 106,
"-": 107,
"*": 108,
"/": 109,
};
let c10 = (code) => {
const instrs = code.split(" ").flatMap((t) => {
const num = parseInt(t, 10);
if (Number.isFinite(num)) {
return [65, num];
} else {
return OP_TO_BYTECODE[t] ?? 106;
}
});
const len = instrs.length;
return WebAssembly.instantiate(
new Int8Array([
0, 97, 115, 109, 1, 0, 0, 0, 1, 5, 1, 96, 0, 1, 127, 3, 2, 1, 0, 7, 4, 1, 0, 0, 0,
10, 4 + len, 1, 2 + len, 0, ...instrs, 11
]),
);
};
To keep the initial semantics, if the token is not a valid operation we return the bytecode for +
: in OP_TO_BYTECODE[t] ?? 106
.
Remove the Empty Export Name
From the usage example at the beginning of the post you may have noticed that the exported function's name is the empty string:
(await c('11 11 1 - + 4 * 2 /')).instance.exports['']()
We did this to save us the bytes needed to specify the export name but also to save an extra byte because with the length of the export name being 0
we can use the sparse array trick to leave that spot in the array empty.
To revert this trick we are going to name the exported function as a
, which in ASCII is the byte 97
.
const OP_TO_BYTECODE = {
"+": 106,
"-": 107,
"*": 108,
"/": 109,
};
let c11 = (code) => {
const instrs = code.split(" ").flatMap((t) => {
const num = parseInt(t, 10);
if (Number.isFinite(num)) {
return [65, num];
} else {
return OP_TO_BYTECODE[t] ?? 106;
}
});
const len = instrs.length;
return WebAssembly.instantiate(
new Int8Array([
0, 97, 115, 109, 1, 0, 0, 0, 1, 5, 1, 96, 0, 1, 127, 3, 2, 1, 0, 7, 5, 1, 1, 97, 0, 0,
10, 4 + len, 1, 2 + len, 0, ...instrs, 11
]),
);
};
We can now call it with a nicer name:
(await c11('11 11 1 - + 4 * 2 /')).instance.exports.a()
Implicit Design Decisions
Our initial implementation only supported positive numbers, but it's not the only number restriction in our compiler.
To keep WebAssembly modules as small as possible, numbers are encoded using a variable-length encoding algorithm called LEB128.
You can tell we are not implementing the whole algorithm by looking at the part of the code that encodes numbers: [65,t]
.
This is because we are assuming the number being encoded fits in 7 bits, the shortest possible LEB128 representation.
Let's try the limits of our implementation:
(await c('63')).instance.exports['']();
> 63
(await c('64')).instance.exports['']();
> -64
This means the only numbers that will be parsed correctly are from 0
to 63
.
(await c('127')).instance.exports['']();
> -1
(await c('128')).instance.exports['']();
Fails with:
Uncaught CompileError: WebAssembly.instantiate(): Compiling function #0 failed: function body must end with "end" opcode @+33
In the last one we went over the 7 bits and the module was rejected during validation.
Explaining and implementing LEB128 takes a lot of text and code, if you want to read more about it we have a whole deep dive on LEB128 in our book.
A Trick that Almost Worked
During the code golfing phase I had a literal shower thought but sadly it didn't work.
The idea was to simplify 106 + "+-*/".indexOf(t)
by using the ASCII character code plus an offset like this: 63+t.charCodeAt()
and saving 3 bytes in the process.
The reason for why it didn't work is because the characters +-*/
don't appear in the same order in ASCII and WebAssembly bytecode.
Explaining the Numbers in the Array
The last part to expand/explain is the array of numbers used to build the WebAssembly module.
It takes a big part of a specification to explain every byte in the array but here's a commented version that should give you a high level idea of what each part does:
[
// Wasm module magic number '\0asm'
[0, 97, 115, 109],
// Wasm version 1.0
[1, 0, 0, 0],
// ----- type section -----
1, // Section identifier
5, // Section size in bytes
1, // Number of entries that follow
// type section - entry 0
96, // Type `function`
0, // Number of parameters
1, // Number of return values
127, // return type i32
// ----- function section -----
3, // Section identifier
2, // Section size in bytes
1, // Number of entries that follow
// function section - entry 0
0, // Index of the type section entry
// ----- export section -----
7, // Section identifier
5, // Section size in bytes
1, // Number of entries that follow
// export section - entry 0
1, // Name size in bytes
97, // String as utf-8 bytes for 'a'
0, // Export type `function`
0, // Function Index
// ----- code section -----
10, // Section identifier
4 + len, // Section size in bytes
1, // Number of entries that follow
// code section - entry 0
2 + len, // Entry size in bytes
0, // Number of local variables
...instrs,
11, // `end` instruction
]
Next Steps
We could extend our compiler to support numbers larger than 7 bits, handle errors better, have nicer syntax, conditionals, loops, variables and more.
If you would like to know how you may want to check our book: WebAssembly from the Ground Up.