More on statistics and machine learning

I’m thinking of a client problem right now, and I thought that something that we need to predict can be modelled as a function of a few other things that we will know.

Initially I was thinking about it from the machine learning perspective, and my thought process went “this can be modelled as a function of X, Y and Z. Once this is modelled, then we can use X, Y and Z to predict this going forward”.

And then a minute later I context switched into the statistical way of thinking. And now my thinking went “I think this can be modelled as a function of X, Y and Z. Let me build a quick model to see if the goodness of fit, and whether a signal actually exists”.

Now this might reflect my own biases, and my own processes for learning to do statistics and machine learning, but one important difference I find is that in statistics you are concerned about the goodness of fit, and whether there is a “signal” at all.

While in machine learning as well we look at what the predictive ability is (area under ROC curve and all that), there is a bit of delay in the process between the time we model and the time we look for the goodness of fit. What this means is that sometimes we can get a bit too certain about the models that we want to build without thinking if in the first place they make sense and there’s a signal in that.

For example, in the machine learning world, the concept of R Square is not defined for regression –  the only thing that matters is how well you can predict out of sample. So while you’re building the regression (machine learning) model, you don’t have immediate feedback on what to include and what to exclude and whether there is a signal.

I must remind you that machine learning methods are typically used when we are dealing with really high dimensional data, and where the signal usually exists in the interplay between explanatory variables rather than in a single explanatory variable. Statistics, on the other hand, is used more for low dimensional problems where each variable has reasonable predictive power by itself.

It is possibly a quirk of how the two disciplines are practiced that statistics people are inherently more sceptical about the existence of signal, and machine learning guys are more certain that their model makes sense.

What do you think?

2 thoughts on “More on statistics and machine learning”

  1. Machine learning vs. statistics is a false dichotomy. A competent math-y practitioner like you adopts whatever tools necessary to solve the problem. So, the analysis should be along a 2-by-2 with competent — incompetent on one axis and pragmatist — dogmatist on the other (one could add a 3rd axis: applied — theoretician, if necessary).

    You, as a math-y guy intending to deliver the best and solve a client’s problems would obviously be way up in the competent — pragmatist quadrant. Most theoreticians (except the best!) would probably be in the competent — dogmatists quadrant. Plug-and-chug ML practitioners and NN Taleb’s “psychophalasters” would be in the incompetent — dogmatists quadrant. If I had to guess, the incompetent — pragmatist quadrant is pretty sparsely populated.

Leave a Reply to MCancel reply