cat articles/basic-statistics

Reading Basic Statistics by Kimio Miyakawa: statistics before machine learning

created 2021-12-13

I started learning machine learning this year. For roughly the first six months, I learned mainly from typical machine learning books: basic supervised learning such as regression and classification, unsupervised learning such as clustering and dimensionality reduction, how to read results, how to process well-formed data, simple neural networks such as perceptrons, fully connected layers, CNNs, and RNNs implemented from scratch, and model building with TensorFlow and Keras. When abstracted libraries existed, I used them while thinking about which model was appropriate for the problem.

In practice, however, before building the model you need, you first have to look at the data and think. Explanations of exploratory data analysis often skip, especially for beginners, the question of what can be understood from data in the first place. They start from ideas such as correlation and distribution, assuming that background knowledge already exists. I could do something that looked like EDA, but in reality I was doing it without really understanding what I was looking at.

The same thing happens when building a model and validating it with an A/B test. Many explanations say something like "use a chi-square test and check statistical significance", and you end up validating things without understanding them well. What are degrees of freedom? What is a t statistic? Can you ignore degrees of freedom because internet data has a large sample size? Why is variance divided by n - 1? I did not understand even these basic points. And in reality, you are not always looking at one A/B test once. You may want to know whether repeated results are significant, where degrees of freedom matter because the number of trials is small, what you want to call significant, and what result you expected before running the test.

These are only examples, but I lacked the underlying premises. Because of that, my understanding was shallow and I could not always choose an appropriate method. I often did not understand basic terms that appeared in library documentation. It took me about half a year to realize that this "background knowledge I was missing" was statistics. The foundations needed for machine learning are calculus, linear algebra, and statistics. I had at least a minimal handle on calculus and linear algebra because I studied their basics in high school, and linear algebra also appeared in 3D programming, where I had implemented related code before.

Statistics, on the other hand, was almost absent from my working knowledge. I may have taken a university credit for it, but I had forgotten it completely. I did not even understand basic ideas such as looking at the mean and variance of data, standardizing a value, and knowing that a normal distribution falls within -1.96 to 1.96 about 95% of the time.

So I decided to learn the basics of statistics properly. At first, though, I did not know where to begin. Looking around bookstores, I found many all-in-one books combined with SQL or Python, but I could not tell which books would let me actually learn statistics. I tried O'Reilly's Practical Statistics for Data Scientists, but because I did not understand the underlying basics of statistics, I could not really get started.

Returning to the basics, I skimmed textbook-style books. Basic Statistics from University of Tokyo Press honestly felt too difficult for me, and I could not imagine finishing it. Around that time, I happened to see a video that recommended Basic Statistics, 4th Edition by Kimio Miyakawa. I tried it without much expectation, but it was extremely clear, and the example problems were excellent. I read it almost every day, worked through exercises with a pen, notebook, and scientific calculator, and finished it over a little under three months. I almost never finish this kind of textbook, so it must have matched me very well.

The explanations are concise and easy to understand. When enough time has passed that you might have forgotten something, the book gives page references and supplementary explanations, so it does not leave you behind. The exercises are also easy to imagine in real-world terms, for example: "If the defect rate of a product is 2%, what is the probability that 2 defective items are included among 200 products?" or "In an experiment, the average time until 10 fuses blew under a 25% overload was 9.2 minutes, with a standard deviation of 2.5 minutes. Estimate the mean time until this type of fuse blows under a 25% overload with a 99% confidence coefficient."

As you can see from the table of contents, the book covers mean and variance, frequency distributions, regression and correlation analysis, probability, random variables and probability distributions, major probability distributions, sampling distributions, estimation, hypothesis testing, and the statistical logic of regression. It teaches probability and regression, which are important foundations for machine learning algorithms, probability distributions that matter when looking at data, and estimation and testing for checking whether hypotheses hold. As I learned these topics gradually, I think my practical ability to look at data and form hypotheses improved substantially.

Looking back, for me the efficient timing would have been about three months after I started machine learning, once I had become able to use machine learning tools such as scikit-learn and TensorFlow at a basic level. In short, statistics is one of the foundations of machine learning, and it is better to learn at least the minimum basics early. I recommend Kimio Miyakawa's Basic Statistics as a clear way to learn those foundations. I am grateful to Professor Miyakawa for writing such a good book.