I finally understand this! Why did every other video make it so confusing!

chrishare · on April 15, 2024

It is confusing, 3b1b is just that good.

visarga · on April 15, 2024

At the same time it feels extremely simple

attention(Q,K,V) = softmax (Q K^T √ dK ) @ V

is just half a row; the multi-head, masking and positional stuff just toppings

we have many basic algorithms in CS that are more involved, it's amazing we get language understanding from such simple math

diedyesterday · on April 17, 2024

Do not be fooled by the simplicity; The magic itself is in the many Q, K and V matrices (each of which is huge) which are learned and depend on the language(s). This is just the form of the application of those matrices/transformations: Making the embedding for the last token of a context "attend to" (hence attention) all information (at all layers of meaning and not just syntactic or semantic meaning but logical, scientific, poetic, discoursal, etc. => multi-head attention) contained in the context so far.

Any complex function can be made to look simple in some representation (e.g its Fourier series or Taylor series, etc.).

bilsbie · on April 15, 2024

For me I never had too much trouble understanding the algorithm. But this is the first time I can see why it works.

ur-whale · on April 15, 2024

> Why did every other video make it so confusing!

In my experience, with very few notable exceptions (e.g. Feynmann), researchers are the worst when it comes to clearly explaining to others what they're doing.

I'm at the point where I'm starting believe that pedagogy and research generally are mutually exclusive skills.

namaria · on April 15, 2024

It's extraordinarily difficult to imagine how it feels not to understand something. Great educators can bridge that gap. I don't think it's correlated with research ability in any way. It's just a very rare skill set, to be able to empathize with people who don't understand what you do.

thomasahle · on April 15, 2024

I'm someone who would love to get better at making educational videos/content. 3b1b is obviously the gold standard here.

I'm curious what things other videos did worse compared to 3b1b?

bilsbie · on April 15, 2024

I think he had a good, intuitive understanding that he wanted to communicate and he made it come through.

I like how he was able to avoid going into the weeds and stay focused on leading you to understanding. I remember another video where I got really hung up on positional encoding and I felt like I could t continue until I understood that. Or other videos that overfocus on matrix operations or softmax, etc.

thinkingtoilet · on April 15, 2024

Grant has a gift of explaining complicated things very clearly. There's a good reason his channel is so popular.

Al-Khwarizmi · on April 15, 2024

Not sure if you mean it as rhetorical question but I think it's an interesting question. I think there are at least three factors why most people are confused about Transformers:

1. The standard terminology is "meh" at most. The word "attention" itself is just barely intuitive, "self-attention" is worse, and don't get me started about "key" and "value".

2. The key papers (Attention is All You Need, the BERT paper, etc.) are badly written. This is probably an unpopular opinion. But note that I'm not diminishing their merits. It's perfectly compatible to write a hugely impactful, transformative paper describing an amazing breakthrough, but just don't explain it very well. And that's exactly what happened, IMO.

3. The way in which these architectures were discovered was largely by throwing things at the wall and seeing what sticked. There is no reflection process that ended on a prediction that such an architecture would work well, which was then empirically verified. It's empirical all the way through. This means that we don't have a full understanding of why it works so well, all explanations are post hoc rationalizations (in fact, lately there is some work implying that other architectures may work equally well if tweaked enough). It's hard to explain something that you don't even fully understand.

Everyone who is trying to explain transformers has to overcome these three disadvantages... so most explanations are confusing.

cmplxconjugate · on April 15, 2024

>This is probably an unpopular opinion.

I wouldn't say so. Historically it's quite common. Maxwell's EM papers used such convoluted notation it it quite difficult to read. It wasn't until they were reformulated in vector calculus that they became infinitely more digestible.

I think though your third point is the most important; right now people are focused on results.

maleldil · on April 15, 2024

> This is probably an unpopular opinion

There's a reason The Illustrated Transformer[1] was/is so popular: it made the original paper much more digestible.

[1] https://jalammar.github.io/illustrated-transformer/

Solvency · on April 15, 2024

Because:

1. good communication requires an intelligence that most people sadly lack

2. because the type of people who are smart enough to invent transformers have zero incentive to make them easily understandable.

most documents are written by authors subconsciously desperate to mentally flex on their peers.

penguin_booze · on April 15, 2024

Pedagogy requires empathy, to know what it's like to not know something. They'll often draw on experiences the listener is already familiar with, and then bridge the gap. This skill is orthogonal to the mastery of the subject itself, which I think is the reason most descriptions sound confusing, inadequate, and/or incomprehensible.

Often, the disseminating medium is a one-sided, like a video or a blog post, which doesn't help, either. A conversational interaction would help the expert sense why someone outside the domain find the subject confusing ("ah, I see what you mean"...), discuss common pitfalls ("you might think it's like this... but no, it's more like this...") etc.

WithinReason · on April 15, 2024

2. It's not malice. The longer you have understood something the harder it is to explain it, since you already forgot what it was like to not understand it.