Machine Learning Notes - Linear Regression

darkxanthos · on Aug 15, 2013

I'm just wrapping up a full semester course on multiple regression and reading this is a great and different perspective on it.

I definitely appreciate the simple approach in the article. If the OP is like myself, perhaps he's posting this to better his understanding and leaving artifacts for others to follow as they learn. I have to point out, there's so much more happening in regression. To do it well, read further on it.

As a concrete example of why- The author mentions the R^2 value but doesn't seem to warn that adding more variables to your model will artificially increase it. For this reason, a better value is the "Adjusted R^2" which adjusts for that. Also testing the validity of your model, building it up from scratch, understanding you can't predict outside of the domain of your independent variables, etc.

With that out of the way, I very much enjoyed seeing some of the math behind this. My class was entirely focused on just learning to use a statistical package to run regression. That's perfectly adequate, fine, and all I'll use on a day to day basis. Understanding what's going on beneath the covers has always just enabled me to be more powerful at the given task.

Thanks!

tadasv · on Aug 15, 2013

Thanks for you comment. I was trying to keep everything as simple as possible, so you could easily bootstrap your python project with code examples. You must explore the topic further yourself! It's very easy to keep adding more and more stuff to the post to the point where nobody will ever read it, or the information just becomes incorrect. So yes, you need to drill down on your own.

pyoung · on Aug 15, 2013

Not to bag on the write up, as it was well done, but does linear regression really qualify as machine learning? Almost every stats 101 course covers the topic, and probably 99% of people who use linear regressions in their day-to-day work would not call it machine learning. I know that linear regression is sometimes presented in Machine Learning courses, but I always thought is was done as a refresher, and not as actual course material of any significant weight.

dpmehta02 · on Aug 15, 2013

The 99% of people who use linear regression but would not call it machine learning probably aren't very well versed in regularization, cross-validation, non-linear transformations, feature engineering, data snooping, bagging, boosting, generalization, etc.

A finely tuned linear regression is a devastating machine learning algorithm.

saraid216 · on Aug 15, 2013

Why does it matter? What's the utility of determining whether or not a particular topic qualifies as part of another particular topic?

pyoung · on Aug 15, 2013

In this case, it's not a huge deal, but if someone wrote on their resume that they had some experience with machine learning, and it turned out that they had only done linear regression modeling, they might be poorly received.

yummyfajitas · on Aug 15, 2013

Why wouldn't it be machine learning? It certainly fits the definition of supervised learning (aka regression).

Admittedly you do usually learn it in unsexy statistics classes rather than sexy machine learning classes...

aet · on Aug 15, 2013

The lines have always been a little blurry for me. You have statistics, machine learning, data mining, artificial intelligence. It all seems to overlap heavily. I tend to consider statistics and data mining to be concerned mostly with classical statistics, and machine learning and artificial artificial intelligence to be concerned more with bayesian methods, and algorithms. That being said, logistic regression seems like the quintessential machine learning technique. So who knows, any experts care to comment?

christopheraden · on Aug 15, 2013

As someone who did graduate work in Bayesian methods for a statistics master's degree, I take offense to saying machine learning is not a concern of statistics and data mining (but not really)! The hesitance towards Bayesian methods seems more related to the discipline, and it seems that places that call what they do "machine learning" tend to be less hostile towards the explicit subjectivity of Bayes (I would highlight the word "explicit" in that sentence--Frequentism has it's fair share of subjectivity as well).

There was a great post on Stats.SE a few years ago about the difference between statistics and machine learning[1]. Leo Breiman once argued that statistics tends to focus more on model fitting and checking, while machine learning looked at prediction accuracy. The exchange between Andy Gelman and Brendan O'Connor is pretty funny. It has been my personal experience however that many people that apply a method that they brand as "machine learning" are not as bothered with assumptions as my fellow conservative statisticians.

But statistics and machine learning are quite similar in foundation. Barring the differences in terminology, as a professional statistician, I find I have as little difficulty read machine learning papers and algorithms as I do reading statistics ones.

http://stats.stackexchange.com/questions/6/the-two-cultures-...

dpmehta02 · on Aug 15, 2013

All of these terms are used interchangeably, so the following definitions probably wont be too helpful in the real world, but in my experience:

Artificial Intelligence is an umbrella academic term which encapsulates the study and design of intelligent machines. It's not well defined because AI is evolving so rapidly.

Machine Learning is a branch of AI that is concerned specifically with learning from data; the results of learning are usually used to predict future events. (Think linear regressions, random forests, etc.)

Though not specifically a part of AI, Statistics is the field that formed many of the algorithms used in ML. Stats informs ML research design (e.g., how large of a sample size do I need), generates mathematical solutions from proofs and equations, etc. With the rise of big data, it's slowly merging with ML.

Data Mining is a mix of ML, Stats and Data Engineering. It's more concerned with structuring and extracting patterns from data than necessarily learning from it. It is often a task within an ML project.

elchief · on Aug 15, 2013

Because it was done in pen and paper by Gauss 200 years ago :)

donquichotte · on Aug 15, 2013

Linear regression can be a quite powerful tool, especially if (maybe counter-intuitively at first) it is used to fit exponentials [1] or polynomials to data. [1] http://mathworld.wolfram.com/LeastSquaresFittingExponential....

mbq · on Aug 15, 2013

The difference between statistical modelling and machine learning is that the first aims to make models that have a "right" accuracy on a given data-model complexity balance while the second those that work well on "unseen data". Thus if you use AIC to find best subset of variables for a linear model its statistical modelling, but if you use cross-validation for the same purpose it is machine learning (while, to be honest, both ideas are wrong).

mathattack · on Aug 15, 2013

I was in your camp until the weight of the discussion convinced me otherwise. I learned regression as an advanced statistics topic, and it wasn't touched on in my (beyond awful) AI class.

However... If linear regression is being used to predict the next value in a dataset, and the caliber of the regression improves as more data is gathered, then it counts as machine learning. For all I know, Netflix could be improving their picks of my movies with a grand regression.

usamec · on Aug 15, 2013

Totally bad article. It encourages bad practices like checking validity on the same set as model was trained on. You should do some cross-validation or at least split data into two parts, train model on the first part and test it on the second part.

textminer · on Aug 15, 2013

Nice write up! I'd caution just checking for an exactly-zero determinant. Read up about ill-conditioned matrices, and maybe check that conditioning number (or a determinant below a certain threshold) first. Also, work hard to never, ever have to actually fully invert a matrix.

lp251 · on Aug 15, 2013

Checking that the determinant is below a threshold is not a valid test for conditioning. Take epsilon*Identity, for example.

christopheraden · on Aug 15, 2013

As lp251 says, the determinant method is not a good idea.

Checking the condition number of the covariance matrix of the LS estimates (X'X)^{-1} is the way R does it.

urschrei · on Aug 15, 2013

Not bad, but I'd rather have seen statsmodels[1] (which is more intuitive to use, and gives you more data, as well as methods for displaying it) than sklearn used for the library. I understand the choice given that it's "machine learning", but as the comments are demonstrating, the distinction's not actually that clear.

[1] http://statsmodels.sourceforge.net/stable/gettingstarted.htm...