Do not be fooled by the simplicity; The magic itself is in the many Q, K and V matrices (each of which is huge) which are learned and depend on the language(s). This is just the form of the application of those matrices/transformations: Making the embedding for the last token of a context "attend to" (hence attention) all information (at all layers of meaning and not just syntactic or semantic meaning but logical, scientific, poetic, discoursal, etc. => multi-head attention) contained in the context so far.
Any complex function can be made to look simple in some representation (e.g its Fourier series or Taylor series, etc.).
In my experience, with very few notable exceptions (e.g. Feynmann), researchers are the worst when it comes to clearly explaining to others what they're doing.
I'm at the point where I'm starting believe that pedagogy and research generally are mutually exclusive skills.
It's extraordinarily difficult to imagine how it feels not to understand something. Great educators can bridge that gap. I don't think it's correlated with research ability in any way. It's just a very rare skill set, to be able to empathize with people who don't understand what you do.
I think he had a good, intuitive understanding that he wanted to communicate and he made it come through.
I like how he was able to avoid going into the weeds and stay focused on leading you to understanding. I remember another video where I got really hung up on positional encoding and I felt like I could t continue until I understood that. Or other videos that overfocus on matrix operations or softmax, etc.
Not sure if you mean it as rhetorical question but I think it's an interesting question. I think there are at least three factors why most people are confused about Transformers:
1. The standard terminology is "meh" at most. The word "attention" itself is just barely intuitive, "self-attention" is worse, and don't get me started about "key" and "value".
2. The key papers (Attention is All You Need, the BERT paper, etc.) are badly written. This is probably an unpopular opinion. But note that I'm not diminishing their merits. It's perfectly compatible to write a hugely impactful, transformative paper describing an amazing breakthrough, but just don't explain it very well. And that's exactly what happened, IMO.
3. The way in which these architectures were discovered was largely by throwing things at the wall and seeing what sticked. There is no reflection process that ended on a prediction that such an architecture would work well, which was then empirically verified. It's empirical all the way through. This means that we don't have a full understanding of why it works so well, all explanations are post hoc rationalizations (in fact, lately there is some work implying that other architectures may work equally well if tweaked enough). It's hard to explain something that you don't even fully understand.
Everyone who is trying to explain transformers has to overcome these three disadvantages... so most explanations are confusing.
I wouldn't say so. Historically it's quite common. Maxwell's EM papers used such convoluted notation it it quite difficult to read. It wasn't until they were reformulated in vector calculus that they became infinitely more digestible.
I think though your third point is the most important; right now people are focused on results.
Pedagogy requires empathy, to know what it's like to not know something. They'll often draw on experiences the listener is already familiar with, and then bridge the gap. This skill is orthogonal to the mastery of the subject itself, which I think is the reason most descriptions sound confusing, inadequate, and/or incomprehensible.
Often, the disseminating medium is a one-sided, like a video or a blog post, which doesn't help, either. A conversational interaction would help the expert sense why someone outside the domain find the subject confusing ("ah, I see what you mean"...), discuss common pitfalls ("you might think it's like this... but no, it's more like this...") etc.
2. It's not malice. The longer you have understood something the harder it is to explain it, since you already forgot what it was like to not understand it.