Curse of dimensionality

dnquark · on May 31, 2012

What finally made the curse of dimensionality "click" for me was recognizing that the volume of an n-sphere goes to zero as n grows large. In 2D, if I want to see which of my data points lie close to some reference point, I can draw a circle around my reference point and count the proportion of the data points inside. In 3D, I could do the same with a sphere. But in 20D, the volume of my hypersphere will be miniscule, and it will probably not even contain any points, so this procedure becomes useless for clustering. Note that I can't just pick the hypersphere of a larger radius: suppose all my points are contained inside a unit n-cube. The volume of the unit n-sphere, the largest one I could inscribe inside the n-cube will _still_ be approximately zero!

This result is pretty cool and somewhat surprising, and there's a neat and short derivation of it; in fact (shameless plug) I have a youtube video where I derive it in 3 minutes: http://www.youtube.com/watch?v=QkMn_5QpsX8 -- feel free to check it out!

mokus · on May 31, 2012

Very nice video. For those less able to follow the calculus, another way to get at least a little intuition about this result is to think about how far the farthest points in a unit cube are from the origin - in 20 dimensions, the point (1,1,...1,1) is sqrt(20) ~= 4.47 units from the origin, so the 19-sphere only "reaches" 1/sqrt(20) ~= 0.22 of the way to the corners of the 20-cube.

chuckbot · on May 31, 2012

Thank you, Sir! You are a scholar and a gentleman.

stiff · on May 31, 2012

Suppose you want to build a system that using some M features of the current weather (humidity, temperature, pressure etc.) will predict whether it will be sunny in 3 hours or not - to this end you manually collect a dataset of N measurements (samples) of those features together with the outcome 3 hours after taking the measurement. This dataset can be used to find a function that will map the current features of weather to a real variable from 0.0 to 1.0 - the likelihood of it being sunny in 3 hours given the samples we collected. Of course, since we cannot manually collect measurements of all possible combinations of features, we want to extend the domain of this function beyond just the combinations of features that occured in our measurements - we want to "generalize". The values of the function for those "generalized" inputs are our guesses about how the likelihood function behaves for measurements other than the ones we have taken. The guess for a given measurement is much more likely to be correct if we at least have _similar_ measurements in our collection.

Now, if there are too few features (M is too low), it might not be possible to do a good prediction. The basic aspect of the "curse of dimensionality" is that as the number M of features grows (linearly), the number of all possible combinations of the features grows exponentially and hence the number N of measurements you need to do to get a good representation of the whole space of possible feature combinations also grows exponentially. So either you have to collect much more measurements, or the function that is inferred will have to "guess" much more often, without having good evidence from the dataset in the form of similar cases.

What I refer to as "all possible combinations of features", can be thought as a high-dimensional space and hence it is customary and sometimes useful to use geometric language in analysing the problem.

arjunnarayan · on May 31, 2012

Here's my best attempt at a simple.wikipedia.org article. In accordance with both simple.wikipedia.org rules and the oft-repeated Einstein/Feynman/Michael Scott quotes about explaining things to 5-year olds being the best sign of understanding something yourself, I've kept the vocabulary simple: Feel free to remix and rework the text to improve it.

"Machine learning" is often the problem of taking a bunch of data and coming up with a simple summary. For example, you might want a machine to look at pictures (which are made up of millions of data points called "pixels") and come up with a single word definition of the subject (say "dog", or "cat", or "human", or "spoon", or "chair"). Now we will pick an even simpler problem to use as an example. We have collected data on two objects. We don't know what the objects are, but we do know that there are two-distinct objects in our data. We want our machine to _cluster_ (this is a specific important machine learning problem called _clustering_) the given input and tell us which ones are object 1 and which ones are object 2. So, for example, we could give it lots of data on dogs and cats, and we want it to separate the dog-data from the cat-data.

However sometimes when we have a lot of data, our algorithms stop working well. This is called the "curse of dimensionality". Consider a more abstract version of the problem than the one above. You see a group of points in 2-dimensions and you want to draw a single line through it that separates the points. For example, see this picture http://en.wikipedia.org/wiki/File:KMeans-density-data.svg I will refer to this picture many times, so keep it open on a tab to the side. The algorithm (the one used in the picture is called "K-Means" but we can ignore the specifics. The point here is that _any_ clustering algorithm will face the same issues, so just focus on the problem itself) has separated the points into two groups (red and blue) and drawn a line between them. The idea is we now have detected that the blues are different from the reds in some pattern of the "features" (which is what we call the 2 axes --- we have 2 "features").

Now we could try and "generalize" this algorithm --- that is, come up with some modification that works when we have lots of features. So lets consider three dimensions. Now first you will notice that a line doesn't "split" three dimensions. You need a "plane". A plane is a generalization of a line into 3-space. Visualize a room. A single pipe doesn't split the room in two. But a very large piece of stiff cardboard will. Similarly, when we are in higher dimensions we will talk of "hyperplanes". Notice that a plane sort of "exists" in a 2-d world of it's own (like the stiff cardboard), even though the cardboard "lives" in 3-d. This is not a coincidence. We can, in principle, use a (n-1) dimensional structure to separate cleanly a n-dimensional structure into two parts. While you cannot visualize this in higher dimensions, the n-1 separation principle is important. [2]

So since we are looking to do just that --- "cluster" our data into two separate parts separated by a single "hyperplane" --- we will try and draw one in 3-dimensions. So maybe it is easy. From the picture above, imagine that the points are in a 3-dimensional room. So we have the two dimensions as in the picture, but now there is a third one, say that is just completely useless --- so the points are randomly distributed on that dimension. To visualize, imagine you laid that figure on the floor in your room and the points were all at different random heights off the floor, suspended in the room. The important thing is that if you viewed the room from directly above, (and far enough away so that you couldn't see the heights at all) you would see the original 2-d picture in the k-means example picture). But from any other angle, you would see a bunch of extra noise clouding the clean separation that is visible in that one specific from-the-ceiling angle.

So already you can see when you have a single useless dimension it makes life difficult. You have to somehow figure out the right angle to see things so that it doesn't cloud your judgment and you can "see" the separation. This intuition is important, because that is what will happen to the algorithm.

Now we will talk about k-means, the specific algorithm used in that example to do clustering. K-means is a powerful algorithm. How k-means works is it compares the distance between points and some "centers" that it thinks "cluster" the data. So it guesses two centers, and then draws a dividing line between the two (at the mid point). You can see this in the example picture. There is a blue center and a red center and a line that divides.

K-means then computers a "average" (it uses a little more fancy metric, but "average" will suffice for our intuition here) of the distances of all the points on it's side of the line to the center. Basically, it is trying to see how far all the points are from the center. If the points are all far away, then it is not really the center. In fact, the center will _minimize_ the distance from all the points --- that's why we like centers. So it repeats this a few times until it finds the right center. Importantly, you can also use this calculation to "point" you in the right direction. I found a nice picture that shows this step by step http://itee.uq.edu.au/~comp4702/lectures/k-means_bis_1.jpg as it _figures out_ where the centers should be, at each step redrawing the lines.

Now, the important thing here is that we use the distances from the points to the center to calculate how "good" the center is. So I will unfortunately have to introduce a little bit of math here to show that.

When we have two points x and y in n dimensions, we can compute (x_1 - y_1)^2 + (x_2 - y_2)^2 + (x_3 - y_3)^2 + … + (x_n - y_n)^2 as the distance between the two points.

Now suppose in our example, we have only 2 dimensions that "matter". So n = 2. So we have our distance metric as (x_1 - y_1)^2 + (x_2 - y_2)^2 Now let's assume that x_1 and x_2 tell us important things about the data. Maybe, for example, x_1 and x_2 are key properties of the objects we want to cluster. (weight and number of teeth). Suppose we are clustering between dogs and cats. Clearly weight and number of teeth will tell us important things that help us separate the dogs from the cats. So we call these "features" _informative_. They are valuable! Running k-means will give us a good separation.

Now suppose, in addition to that, we have some extra dimensions that are useless. Say "color", "date of birth", first name of owner", "number of kids in the house where it lives". Maybe these "features" really are just random noise. They're just getting in the way. So we will assume that the data is "randomly distributed with respect to these features".

Now our distance metric is: (x_1 - y_1)^2 + (x_2 - y_2)^2 + (x_3 - y_3)^2 + … + (x_n - y_n)^2 Since we assumed for k>2 that x_k is random, we have our distance as (x_1 - y_1)^2 + (x_2 - y_2)^2 + (random-random)^2 + … + (random-random)^2

Now we will avoid some math here, but the important point is that the "random - random" starts to shout out over the first two dimensions as we increase the number of randomly distributed features.[3]

Intuitively, unless you know to "see" from the exact right angle (like the top of the room in the example) you will be lost in the sea of points in the space. But how do you decide which is the right angle? That is itself a hard problem. And oftentimes, it is _the_ hard problem. Deciding that "first name of the owner" wasn't as important as "weight" _is_ the hard problem. So k-means doesn't work as well.

This problem is often called the "curse" of dimensionality, because in a very high-dimensional space, "all points are equally far".

[1] A proper simple.wikipedia article should use a different word here. Suggestions?

[2] Side distraction: We can actually, with a little effort visualize a 4-d space. Say the 4 dimensions are our usual 3 + the extra time-dimension. We can have a 3-d object separate us into two separate 4-d worlds. So we're waiting waiting waiting as time passes, and then suddenly, the entire world is made of concrete. It obliterates everything in the universe. Then, in a flash, just as soon as it arrived, the concrete disappears. That concrete barrier is the 3-d hyperplane in the 4-d space.

Now to make it convenient to visualize, we have made our 3-d object "axis-aligned". But it may not be axis aligned. But the problem will be that the answer we want won't always be conveniently axis-aligned. Imagine a 2-d concrete barrier that starts at one end of the universe and "moves" through the universe as we move time. It crushes everything in its path. You can't get past it. It will eventually get to you because it is moving perpendicular to it's surface. Eventually it will destroy everything. So you still have a 3-d (this barrier exists in 2 spacial dimensions and 1 time dimension) concrete barrier separating the 4-d space without being axis aligned.

[3] I'm trying to avoid having to say that while the expected value of (random-random) is 0, the variance is nonzero and positive. And as we have more random features, the expected variance of the sum of the random features grows. I.E. Var(N(mu, sigma) - N(mu, sigma))

pacala · on May 31, 2012

Why not cluster one dimension at a time? K-means will meaningfully cluster for 2 of these dimensions and fail for the other 998. Measure how good the separation is and stick with the best X dimensions.

mokus · on May 31, 2012

The problem is that in many dimensions that often just won't work. The cloud of points can be smeared in directions that don't line up with any axes, in such a way that no matter which 2 axes you choose the clusters will overlap (but choosing a plane that doesn't align with any axis will clearly separate them). Even in 3 dimensions, you can see this problem by considering your populations to lie on 2 parallel planes, x+y+z=1 and x+y+z=2. Looking at this in x-y, x-z, or y-z projections, the point clouds will often be inseparable. And yet if you know where to look, all it takes is a projection to one dimension (f(x,y,z) = x+y+z) to separate them cleanly.

The curse of dimensionality appears here in the fact that the "number" of possible ways of projecting the space increases dramatically with the dimension.

pangram · on May 31, 2012

There are a variety of techniques to reduce the dimensionality of data (i.e., Principal Component Analysis, among other ways), and you can get better results sometimes by running K-means (or whatever) on the dimensionally-reduced data.

arjunnarayan · on May 31, 2012

Yes that will work. But, there are 1000C2 such combinations (that is, O(2^n)), and you have to try them all since you don't know which two dimensions are needed. And you cannot do each of the 100 separately in one-dimension and later hope to "combine" them together. In fact, you might not even know that it is 2 of the dimensions that are special. Maybe it is 7. So you have to try each of the 2^100 combinations separately. The algorithm you propose is not polynomial in the number of features.

tveita · on May 31, 2012

I may be misunderstanding you, but the number of ways to pick 2 out of n dimensions is Θ(n²).

Even if we looked at up to d dimensions, (bruteforceable at Θ(n^d)), that doesn't imply we need to generalize to do all 1000.

The post you replied to actually suggested clustering just one dimension at a time. I think that's a reasonable solution given the situation you posited, where you have some dimensions that do cluster well, even individually, and some who are mostly noise.

I would say that the curse of dimensionality is more fundamental than the problem you describe. The problem doesn't lie just in separating noise from signal - in a truly high-dimensional space there may not be a "right angle" to see it from.

For a simple example, you may take randomly distributed points. Even at uniform random distribution, it is relatively easy to for instance index two dimensional points in such a way that you can later locate points that are close to any particular spot. For 20-dimensional points, this is very hard.

jimbokun · on May 31, 2012

I think that's a great explanation! However, I do have some exposure to the topic, so I would be interested to know if it makes sense to someone totally new to the subject.

colinhevans · on May 31, 2012

The number of parameters to be fit or estimated in a statistical or machine learning model will always be a function of the dimensionality. For example, fitting a Gaussian (a Bell Curve) with a fixed variance of 1.0 to an n-dimensional space requires estimating the mean, which is n parameters. If we want to estimate an nXn covariance matrix for our n-dimensional Gaussian, we need to estimate n+n^2 parameters. For linear or logistic regression, we'll need to estimate n+1 parameters, and so on.

The curse of dimensionality is that if you add dimensions but don't add data, your functions will overfit the data because you don't have sufficient samples to estimate your model parameters. The worst-case estimate is that for n dimensions, you will need on the order of 2^n samples. This comes from the combinatoric increase of relative "distances" as dimensions increase.

This 2^n bound assumes that your data has a high degree of uncorrelated variance across all dimensions. In practice, the curse of dimensionality often isn't a problem. This is because most high-dimensionality data residing in an n-dimensional space actually doesn't have uniform variance across all n dimensions, and can be mapped or otherwise transformed into a k-dimensional subspace where k is much smaller than n with a minimal loss of variance.

Dimensionality reduction approaches include principle component analysis (PCA), minimum message length methods (MML), various feature selection approaches, virtually every clustering algorithm. Anything that removes dimensions while retaining the essential information content will do the trick.

tylerneylon · on May 31, 2012

When it comes to finding approximate neighbors, there is a nice simple trick to help address the curse of dimensionality.

An easy algorithm to find nearby neighbors is to split space into the generalization of squares / cubes, aka hypercubes. Then look at all the surrounding cubes and compile a list of all the neighbors in them. You can do this with a precomputed hash table mapping cubes to lists of points in that cube.

This suffers the curse because a cube in n dimensions has 3^n - 1 neighbors (it is in the middle of a 3x3x..x3 hypercube). Make it a little better with hash table from every cube corner to a list of points in all surrounding cubes, a benefit of 3^n down to 2^n lookups -- still exponential.

Simple idea: Split up space using triangles instead of squares (aka simplices instead of hypercubes). We just went from 2^n lookups down to n+1 (one for each corner) -- pretty good.

The questions are then: (1) how do you tile n-space with n-dim'l triangles/pyramids (simplices), and (2) how do we get highly regular simplices to do that? [Regularity is good because you usually want a sphere-like radius-based lookup.]

It turns out there's a vertex-transitive simplex that tiles any dimension, and as a bonus is easy to work with computationally. ("Vertex-transitive" is a type of regularity where basically every vertex "looks the same" as every other.) So if you care about point neighbors within a certain approximate radius, these tricks are extremely helpful.

Reference: http://www.siam.org/proceedings/soda/2010/SODA10_094_neylont...

ColinWright · on May 31, 2012

More information about this in the discussion of this item: http://news.ycombinator.com/item?id=3995615

There it talks about thinking of high-dimensional spheres as "spikey" to help avoid the common traps your 3D intuition will lead you into.

ntumlin · on May 30, 2012

No article in simple.wikipedia.org, could someone please explain this in a bit simpler terms?

francoisdevlin · on May 31, 2012

As my numerics professor said, "The more dimensions, the more you curse" :)

This has to do with trying to get a finer estimate of a problem. Suppose you have a one dimensional problem you're trying to solve. You take the line segment, break it into 100 pieces. These techniques usually a matrix with about 100 rows in it, and you have to invert it / find and eigenvalue, etc. These are O(n^2) or O(n^3) depending on the problem.

Now, let's get a better estimate, and use 200 points. We've doubled the number of points, so our solution now takes 4 to 8 times longer, depending on the type of problem.

Switch gears to a 2d problem. You now have a 100 X 100 mesh, that creates a matrix with 10,000 elements. If you want to improve the accuracy, you need a 200 X 200 mesh, which creates a 40,000 element matrix. Our algorithms cost O(n^4) and O(n^6) to get a better measurement now.

Now consider 3 dimensions and beyond. Your 100 x 100 x 100 system becomes a 200 x 200 x 200 system. This makes what is technically known a FREAKING HUGE matrix. Your algorithm is effectively O(n^6) or O(n^9), depending on the problem type.

You can see how this gets nasty quickly. HTH

i2pi · on May 31, 2012

There are two things going on here. First, as other commenters have noted, that as you increase the number of dimensions, the search space grows exponentially. The second, and deeper problem, is that our intuitions about 'volume' fail for higher dimensional problems. As you increase the number of dimensions, the outer shell of any hypercube holds much more volume than the inner portion of the cube. This means that if you were to distribute points with a Gaussian distribution in 2D, most of the mass is near the mean point of the distribution. As you increase the number of dimensions, more and more mass is contained within 'outliers'. In high dimensional space, things that seem unlikely in our usual 3D world become far more probable.

aktau · on May 31, 2012

I find this (and the other comment referencing the volumes of hypercubes) to be interesting. Perhaps exactly because it is so unintuitive. Maybe I could reason it up myself, but what exactly do you see as the "volume" of the inner portion and the "volume" of the shell? Are we talking about the same units here? Are there generic formulae?

Brashman · on May 31, 2012

One example is suppose I'm trying to tune some parameters and I'm going to do this by sampling. If I have only one parameter which is between 0 and 1 to tune, then I can find the best choice to within .01 by trying 100 values (each .01 increment).

Now instead, consider what happens when I have 100 parameters (high dimensionality). Now with 100 samples, I can't really learn much. Even if I wanted to try only 2 values in dimension (say 0 and 1), there would be 2^100 (roughly 10^30) different possibilities to try.

I would say the key line from Wikipedia is that, "when the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse."

thirdhaf · on May 31, 2012

I'm a big fan of concrete examples so here's one: Let's say you have two points on a line labeled 0 and 1. Let's also suppose there's a special property you're interested in at one of these locations, it minimizes some function, there's a pot of gold there, it's infested with lions, doesn't matter. Since there are only two points it's easy to just check both points and empirically find which one you want.

We can extend this idea easily into 2 dimensions, now we have two axes each consisting of two points along perpendicular lines. Enumerated the points are [0,0], [0,1], [1,0], [1,1] and as you can see there are four of them.

It should also be pretty easy to see that if we extend this further into 3 dimensions that you once again double the number of points in your space to 8. You can see this by starting to enumerate them in the same way: [0,0,0], [0,0,1], [0,1,0], [0,1,1], [1,0,0]...[1,1,1]

From this it should follow that if you extend the problem into n-dimensions you end up with 2^n different points in your search space. If n is a small number, say 30 it's still possible to do an exhaustive search with a computer since 2^30 ~ 10^9 but soon you run out of resources regardless of how much money you're throwing at the problem.

This is the heart of the curse of dimensionality, starting with stupidly easy problems in one dimension quickly gets you in trouble when you want to extend it. This is not to say that we can NEVER deal with high-dimensional problems, two wonderful areas where we are VERY good at solving problems are linear programming and quadratic programming, in the former problems with hundreds of thousands of dimensions are feasible and thousands of dimensions in the case of the latter.

polm23 · on May 31, 2012

The other replies are correct, but why do you care? The trick is what's counter-intuitive about this.

If you work in a widget factory and are trying to figure out why widgets go bad, you might sort your widgets by size and color. If sorting by that doesn't seem to help you find a pattern in the bad widgets, you might add more data - whether the widgets are right or left handed, metric or English units, or something.

The problem with this is that by adding more data you're actually making your problem harder since you've added an extra dimension, which as other commentors described will blow up your search space pretty quickly. The extra data will be the most help if it allows you to ignore the previous data. The more dimensions you add, the less likely it is the next dimension makes things simpler rather than just more complicated.

One more reason to keep things simple.

sageikosa · on May 31, 2012

I seem to remember F-tests or something like that when I did multi-dimensional variant analysis as part of my statistical geography courses back in the day.

Even though you may have data along 7 or 8 dimensions, you may be able to develop a statistcal model that captures 95% predictive value using only 3 or 4 dimensions (but the 3 or 4 dimensions do not co-relate perectly to any of the 7 or 8 dimensions you've collected), instead each predictive dimension is based on part of each of the others.

Or if I summarize: the underlying (statistically useful) dimensionality of a dataset can be smaller than the number of variables collected as many of the variables co-relate with each other.

scottfr · on May 31, 2012

Principle Component Analysis is the main example of this:

http://en.wikipedia.org/wiki/Principal_component_analysis

kylebrown · on May 31, 2012

Another reference to PRML I presume?

To this page, I'd add a direct link to Mahalanobis Distance. I was just harping about MD in some HN comments the other day on a Bayes article. There's no pictoral representation on the wiki page, but it is intuitive and its one of those simple concepts with foundational implications.

MD rests on the assumption of gaussian/normal population. The math is straightforward, starting with the covariance matrix. The original (Rubine 1991) gesture recognizer used the MD of some 16 feature vectors between an input gesture and the collection of trained gestures.

anusinha · on May 31, 2012

Another big problem with studying high-dimensional spaces is that you run into many configurations that simply do not make sense in three dimensions/correspond to completely nonphysical shapes. This makes it difficult to study real (3-D) structures using many parameters because results in higher dimensional spaces do not necessarily correspond to a real shape.