Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I finally understand this! Why did every other video make it so confusing!


It is confusing, 3b1b is just that good.


At the same time it feels extremely simple

attention(Q,K,V) = softmax (Q K^T √ dK ) @ V

is just half a row; the multi-head, masking and positional stuff just toppings

we have many basic algorithms in CS that are more involved, it's amazing we get language understanding from such simple math


Do not be fooled by the simplicity; The magic itself is in the many Q, K and V matrices (each of which is huge) which are learned and depend on the language(s). This is just the form of the application of those matrices/transformations: Making the embedding for the last token of a context "attend to" (hence attention) all information (at all layers of meaning and not just syntactic or semantic meaning but logical, scientific, poetic, discoursal, etc. => multi-head attention) contained in the context so far.

Any complex function can be made to look simple in some representation (e.g its Fourier series or Taylor series, etc.).


For me I never had too much trouble understanding the algorithm. But this is the first time I can see why it works.


> Why did every other video make it so confusing!

In my experience, with very few notable exceptions (e.g. Feynmann), researchers are the worst when it comes to clearly explaining to others what they're doing.

I'm at the point where I'm starting believe that pedagogy and research generally are mutually exclusive skills.


It's extraordinarily difficult to imagine how it feels not to understand something. Great educators can bridge that gap. I don't think it's correlated with research ability in any way. It's just a very rare skill set, to be able to empathize with people who don't understand what you do.


I'm someone who would love to get better at making educational videos/content. 3b1b is obviously the gold standard here.

I'm curious what things other videos did worse compared to 3b1b?


I think he had a good, intuitive understanding that he wanted to communicate and he made it come through.

I like how he was able to avoid going into the weeds and stay focused on leading you to understanding. I remember another video where I got really hung up on positional encoding and I felt like I could t continue until I understood that. Or other videos that overfocus on matrix operations or softmax, etc.


Grant has a gift of explaining complicated things very clearly. There's a good reason his channel is so popular.


Not sure if you mean it as rhetorical question but I think it's an interesting question. I think there are at least three factors why most people are confused about Transformers:

1. The standard terminology is "meh" at most. The word "attention" itself is just barely intuitive, "self-attention" is worse, and don't get me started about "key" and "value".

2. The key papers (Attention is All You Need, the BERT paper, etc.) are badly written. This is probably an unpopular opinion. But note that I'm not diminishing their merits. It's perfectly compatible to write a hugely impactful, transformative paper describing an amazing breakthrough, but just don't explain it very well. And that's exactly what happened, IMO.

3. The way in which these architectures were discovered was largely by throwing things at the wall and seeing what sticked. There is no reflection process that ended on a prediction that such an architecture would work well, which was then empirically verified. It's empirical all the way through. This means that we don't have a full understanding of why it works so well, all explanations are post hoc rationalizations (in fact, lately there is some work implying that other architectures may work equally well if tweaked enough). It's hard to explain something that you don't even fully understand.

Everyone who is trying to explain transformers has to overcome these three disadvantages... so most explanations are confusing.


>This is probably an unpopular opinion.

I wouldn't say so. Historically it's quite common. Maxwell's EM papers used such convoluted notation it it quite difficult to read. It wasn't until they were reformulated in vector calculus that they became infinitely more digestible.

I think though your third point is the most important; right now people are focused on results.


> This is probably an unpopular opinion

There's a reason The Illustrated Transformer[1] was/is so popular: it made the original paper much more digestible.

[1] https://jalammar.github.io/illustrated-transformer/


Because:

1. good communication requires an intelligence that most people sadly lack

2. because the type of people who are smart enough to invent transformers have zero incentive to make them easily understandable.

most documents are written by authors subconsciously desperate to mentally flex on their peers.


Pedagogy requires empathy, to know what it's like to not know something. They'll often draw on experiences the listener is already familiar with, and then bridge the gap. This skill is orthogonal to the mastery of the subject itself, which I think is the reason most descriptions sound confusing, inadequate, and/or incomprehensible.

Often, the disseminating medium is a one-sided, like a video or a blog post, which doesn't help, either. A conversational interaction would help the expert sense why someone outside the domain find the subject confusing ("ah, I see what you mean"...), discuss common pitfalls ("you might think it's like this... but no, it's more like this...") etc.


2. It's not malice. The longer you have understood something the harder it is to explain it, since you already forgot what it was like to not understand it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: