llama.cpp is orders of magnitude easier. Rather than controlling token by token,...

simonw · on March 11, 2024

I find these imperative statements much less intimidating than llama.cpp grammars - the Python code I copied here looks a lot more approachable to me than the gbnf syntax from https://raw.githubusercontent.com/ggerganov/llama.cpp/master...

I don't think the key idea here is to run llama.cpp itself in WASM - it's to run LLMs in native code, but have fast custom-written code from end-users that can help pick the next token. WASM is a neat mechanism for that because many different languages can use it as a compile target, and it comes with a robust sandbox by default.

mmoskal · on March 11, 2024

It's only the controller that runs in Wasm, not the inference.

The pyctrl is a just a sample controller, you can write a controller that takes any kind of grammar (eg., a yacc grammar [0] - the python code in that example is only used for glueing).

Llama.cpp grammars were quite slow in my testing (20ms per token or so, compared to 2ms for the yacc grammar referenced above).

[0] https://github.com/microsoft/aici/blob/main/controllers/pyct...