Each of those is great, and more sophisticated strategies are being developed all the time: incorporating static analysis at inference time to ensure code is more than just syntactically correct [0]; to multi-LLM call agent-based frameworks [1]; decomposing hard tasks into many "easier" LLM tasks [2]; analyzing internal activations to identify potential hallucinations [3]; activation steering to control generations [4]; and "Language Model Arithmetic" to compose biases for style, vocab, etc [5].
But each also either requires tight integration with an LLM inference/serving engine (to access low-level internals of an inference engine); or adds a lot of overhead (many individual LLM calls) if they don't have tight integration.
The AI Controller Interface is creating an abstraction layer that exposes the low-level primitives so that all the above strategies can be implemented without each one needing to dive into the internals of every LLM engine. AICI doesn't support all the necessary primitives for all of these yet (e.g., we don't know what the right way is to represent internal activations); and not everything will end up fitting in a WASM module anyway.
Its a start at thinking about a new layer in the inference stack
Kind of. The usual mechanism for those sorts of libraries is a grammar-informed-token-filter. You get an output adhering to the grammar (or an error), but the mechanism is reshaping some intermediate conditional probabilities.
That's problematic because the entire point of an LLM (in most scenarios with non-trivial output sizes) is sampling from a calibrated probability distribution. The conditional mask in generation allows you to factor that into per-token work (given the preceding text, what's the probability of the next token being token X?), but none of that holds even approximately when you start tweaking the sampling parameters (the same reason why playing with "temperature" gives very bad results for a lot of complicated problems -- the greedy solution we're executing isn't globally correct).
For a small, illustrative example that I've seen, suppose your trying to parse a freeform text input into a tagged text input (podcast transcripts or something), squishing the result into a json schema or something so that you can actually use it in your end application. For longer text fields, many SOTA LLMs are prone to close the field early and append an ellipses rather than flesh it out in its entirety. Constrained schema generation will have a high probability of closing the (broken) string and a low probability of continuing past the (incorrect) ellipses, contrasted with an approach of retrying till the result doesn't crash a parser, which usually will not emit a closing quote after an ellipses and this crashes all those times the output would have been invalid. The net result is that you still have an invalid result, but nothing crashed, so now you don't know that fact.
This particular library also suffers from that same flaw (though the flaw is important and common, so I think it's worth repeating myself about it once in awhile), but it has two very nice features I'd like to touch on (contrasted with spec/grammar-approaches), one of which partially mitigates that flaw:
1. You can interact with partial results via a turing-complete language. It's really messy to use weaker languages to encode "this one json field is probably broken if it ends in an ellipses, but other strings probably can and should", and
2. You have your results available directly in your language of choice without an additional parsing/extraction step. Resource-wise I don't know that it matters, but tons of simple problems are much more elegantly described as "I want these things" instead of "I want this composite object -> post-process that composite object".
That said, using it for json in particular looks terrible, like you said (but you could presumably wrap one of the other json libraries with this and reap some of the benefits of both approaches).
> That's problematic because the entire point of an LLM
Could you detail what the results of that problematic behavior are? Do you suspect structured generation performs worse than unstructured?
> squishing the result into a json schema
The strategies I've seen for dealing with structured generation seem to be ensuring that the structured is represented in the prompt, so the model isn't squishing anything.
> For longer text fields, many SOTA LLMs are prone to close the field early and append an ellipses rather than flesh it out in its entirety. Constrained schema generation will have a high probability of closing the (broken) string and a low probability of continuing past the (incorrect) ellipses,
Part of structured generation is specifying exactly the range you want for the field, so I'm not sure how this issue arises. With structured generation you force the LLM to have at least some number of characters, so this seems like an argument in favor of structured generation rather than against. If unstructured LLMs want to "close the field", then structured can force it to remain open.
> Could you detail what the results of that problematic behavior are?
I'll contrast two sampling approaches (ignoring prompting and whatnot since that's orthogonal and can be applied to either): (1) repeat till the answer adheres to a grammar, and (2) filter the set of possible next tokens to those which could adhere to that grammar.
The former preserves the same relative probabilities in valid answers that the base model would have, whereas the latter as an unknown, not intuitively explainable, and wildly differing distribution.
Anecdotally, one way in which that manifests is that when the model inevitably makes a grammar-admissible mistake on some token, the base model is more likely to also make a mistake in the grammar than it otherwise would have. Sampling strategy (1) throws that mistake away, where sampling strategy (2) silently forces the rest of the answer to comply, yielding an incorrect but "valid" answer. The long-string-ellipses problem I described falls into that class of problems.
The specific behaviors are hard to qualitatively describe in full generality though because of the huge number of ways in which a "wrong" probability distribution can be wrong.
> Do you suspect structured generation performs worse than unstructured?
Yes, often, not always. E.g, if your schema is enum{Red,Blue,Green}, you can prove that the two distributions are in fact equal, so structured generation would be strictly better because it's cheaper. For specialized problems, structured generation is another meta-parameter, and despite having no intuitive explanation for _why_ it performs better, if it accidentally performs better for your particular problem then that's a huge win. Even if it increases the error rate, it's also cheaper than the alternatives (and most reasonable applications of an LLM assume a nontrivial error rate anyway, so that's not necessarily a huge cost), so it might be "better" for an application despite lower quality results.
For complicated schemas though, I'd definitely at least want to measure the difference. Anecdotally, structured generation as a sampling procedure performs worse for me on complicated problems than unstructured generation.
> The strategies I've seen for dealing with structured generation seem to be ensuring that the structured is represented in the prompt, so the model isn't squishing anything.
"Squishing" was a bit of a colloquialism. For the vast majority of problems I've seen, fitting the grammar into the prompt, choosing a very very simple grammar, and filtering non-confirming responses is a pretty good approach. Your success rate is decently high (i.e., not much more expensive than sampling-based approaches), you get something sufficiently machine-readable to fit into your pipeline, and you have a sampling distribution matching the underlying LLM. For sampling-based approaches though, you skew the result distribution in the way described above, which I called "squishing".
> Part of structured generation is specifying exactly the range you want for the field, so I'm not sure how this issue arises. With structured generation you force the LLM to have at least some number of characters, so this seems like an argument in favor of structured generation rather than against. If unstructured LLMs want to "close the field", then structured can force it to remain open.
If you can enumerate all the failure modes (or likely failure modes), absolutely. My biggest counter-arguments are:
1. That's hard to enumerate in general (hence why we have LLMs instead of grammar-rule-machines).
2. Even for the ellipses example, if you want to use one of those structured json libraries, how hard is it to require json and also require that "certain" strings (over X length, with certain corresponding keys, ...) can't be ended as `..."`? I haven't done it, but I have enough programming experience to be pretty sure it'll be a pain in the ass unless you want to fork the library for your use-case or re-write most of its json grammar.
3. Suppose you get that constraint (running example, certain strings can't end via `..."`) into the grammar, what exactly does that do? Those algorithms are greedy, and when the model first inserts a mistaken period, it's already made a mistake. Periods are allowed in strings though, so the grammar continues. The model, having differed from the text it's transcribing and inserted a period, inserts another period because it's obviously creating an ellipses. This is still valid in the grammar, so text generation continues. If your grammar just banned ellipses, you'd get some other nonsense character or an end-of-string-quotation at this point, and if you banned string-ending-ellipses you'd almost certainly get another period, at least one nonsense character, and then an end-of-string-quotation. Despite the fact that you banned the bad behavior, in all the cases you would have seen it without structured generation, you still get broken outputs with structured generation and don't know that they're broken. Contrast that with letting the string complete and re-trying if it ends as `..."`. Every time the model mistakenly adds a period trying to end the string, that is explicitly caught and tracked. It's still "structured", but the result distribution is different and better.
Isn't this something that Outlines [0], Guidance [1] and others [2] already solved much more elegantly?
0. https://github.com/outlines-dev/outlines
1. https://github.com/guidance-ai/guidance
2. https://github.com/sgl-project/sglang