Looks good, but why not dump is as a flame graph?

thechao · on July 24, 2017

I'm not sure why you're being downvoted without comment: that sucks. I think the fact this project can show per-node costs of the plan is interesting---and certainly a killer feature---but I think it's only part of the story. I think the structure is just as important.

Unfortunately, flame graphs don't show (much) structure.

Now, as a second note, I do a lot of performance work (systems programming in stacks starting at hand-assembled machine code, working all the way up to scripting languages). Personally, I find flame graphs not as useful as a traditional (inverted-)back-trace. I think the structure encoded in a traditional (inverted-)back-trace tells you a lot about poorly designed algorithms and bad systems interactions in a way that flame graphs (mostly) erase.

fnord123 · on July 24, 2017

I'm not sure how you mean flame graphs don't show much structure. They appear to show a lot of structure so I'm surprised by the criticism. I haven't used them as much as I would have hoped; but found them useful. Otherwise I normally see @brandangregg banging on about how good they are so could you elaborate more on your criticisms?

Thanks

cormacrelf · on July 24, 2017

Flame graphs are based on high frequency linear sampling of a (at least conceptually) single thread of execution, where each sample records the stack.

SQL query execution does not follow this structure. Query plans are executed in whatever order or degree of parallelism is appropriate. And linearising an SQL query plan is actually wrong. If you've ever watched SQL Server's Live Query Statistics play out, you'll know that the engine does whatever work it can when possible. It would be incorrect to think of the query execution as a stack the CPU can get stuck deep down in, and a flame graph would create an incorrect model.

As an exercise, where would you put self-time in a flame graph representation of an execution plan for a filtering node, or a sorting node? The engine doesn't report that, because filters happen inline (at least in batches) and sorts are optimised to happen progressively and can be scheduled in between IO handlers. Ask an expert about this, I'm not your guy.

Also, the data you want to see isn't represented by one dimension on the flame graph (how wide is this function call?), it is represented by many more: cost, CPU time, number of rows, IO operations, and estimations of all the above. The execution plan needs to be roomier, and include all this information, not merely wall clock time.

cormacrelf · on July 24, 2017

I should add, in my experience optimising queries, a time breakdown is generally the first step of many. Flamegraphs are great at telling you where in 1,000 invocations your time is spent, but your execution plan only has ~20 nodes max, and the percentages do that job well enough. At that point, you want to know why it's slow, and a flamegraph representation of those percentages won't help you with that.

cyphar · on July 24, 2017

I'm not batting for either side, but brandangregg did sort of invent flame graphs, so I'm not surprised that he finds them very good. [No offense to brendangregg, I've met him and he doesn't strike me as an egotistical person, but everyone does love their pet thing.]

felixge · on July 24, 2017

Do you have an example of "traditional (inverted-)back-trace"?

fnord123 · on July 24, 2017

I think thechao means the stuff you get from perf where it looks a bit like the output of pstack. But I'd like a clarification too.

thechao · on July 24, 2017

The tooling I use records backtraces at some sampling frequency. The obvious sorting is from the root call site (_start), down. However, while that sorting shows the worst offending single backtrace, it can be a bit impractical once the top offenders (in terms of perf) have been fixed. Instead, I like to see an inverted tree: reverse every backtrace then create a tree. This method tends to show systemic issues.

felixge · on July 24, 2017

I've looked into creating a flame graph visualization for PostgreSQL EXPLAIN output before, but CTEs that are scanned multiple times complicate things a bit. That's because the first scan on a CTE node will cause the CTE query to be executed, while further scans will simply replay the cached result-set from the first execution.

I'm sure these problems can be overcome, but it's a bit more complicated than I had imagined when I tried to do it myself. I also suspected nested-loop-joins on CTEs could further complicate the flame graph construction.