Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I think the intent is to calculate medians on longer running streams of data than that. Why would you even use this algorithm for 5 values?

I don't know if you realize but these kinds of comments are incredibly frustrating to respond to. It's like you didn't even try to understand the comment before replying. I was not saying "it's terrible because it easily fails on 5 elements". I was saying "it's terrible because it easily fails, and to illustrate, here's an example with 5 elements that will help you understand the problem I'm talking about". Does that make sense? Surely it's not rocket science to see that the number 5 wasn't special here? Surely you can see how, say, if your list starts at 2E9 and goes up to 3E9, the exact same problem would occur for the billion-element list, and hence that my gripe was probably not about 5-element lists?



It's an online algorithm. It's meant to be used on essentially infinite streams of data, such as a live dashboard of latencies of a running server. In that context, providing a 5 element, or any finite list as an example seems like a non sequitur.

Your examples do relate to problems the algorithm actually has in this application, but they manifest as things like "extremely large warmup time" and "adjusting to a regime change in the distribution taking time proportional to the change". For instance if your data is [1000, 1001, 1000, 1001...] then it takes 1000 steps to converge, which may be longer than the user has patience for.

However, the algorithm does always converge eventually, as long as the stream is not something pathological like an infinite sequence of consecutive numbers (for which the median is undefined anyhow).


If you have a monotonically increasing stream of numbers no algorithm is going to converge on a meaningful average, even if it's "correct" in a technical sense. This algorithm is intended to give a (very) cheap estimate for values with some kind of organic distribution (probably normal).

It doesn't have to be accurate for all possible edge cases because that's not the point of cheap estimation like this.


What I said has absolutely nothing to do with monotonicity. Pick any permutation of [2N, 2N + 1, ..., 3N - 1] for N as large as you want and you'll see the algorithm estimate the median as N, which is below even the minimum.


> [2N, 2N + 1, ..., 3N - 1]

This is practically the definition of monotonicity. (https://en.wikipedia.org/wiki/Monotonic_function)

You should read the paper, it doesn't have the problems you think it has when used on real world data.


The parent mentioned permutation of those numbers. Does that affect your response?

Also note that the parent wasn't commenting on the algorithm in this HN submission, but rather the algorithm described in the top-level comment.


Quote from the paper cited in the post describing this algorithm:

> These algorithms do not perform well with adversarial streams, but we have mathematically analyzed the 1 unit memory algorithm and shown fast approach and stability properties for stochastic streams

Your criticism is really pretty strident for saying something the authors probably completely understand and agree with.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: