Data analysis is one of the most popular and in demand industries in both business and science, but it is not taught at university in the same way as, for example, medicine or history. Modern data science specialists are as different as the pieces of a patchwork quilt – with different education and different knowledge base. In this article we will talk about how much mathematics is needed for data science.
Modern data science has already accumulated a rich software toolkit: there are many libraries and built-in functions for information processing, even for the use of machine learning methods themselves. To run your own model, a novice data scientist will only need to familiarize himself with the order of performing certain procedures and the presence of the necessary methods in any of the libraries. The model to start and will work, even will give out the result, and, quite probably, not bad, if not at all terrific. However, it is unlikely that such a model will at least respond to reality and give out forecasts similar to the truth – the use of methods of machine learning without understanding the deep essence and without proper preliminary processing of data can contribute to retraining the model or undertraining. It is therefore very important to understand the theoretical foundations (which are inseparable from mathematical formulas and concepts) of all aspects of data science.
Mathematics, the queen of all sciences, holds a high position here. A young professional needs to master at least the following sections in order to become a good data scientist:
What it takes to touch math
Norma and scalar work of vectors.
Matrix definition. Matrix operations.
Matrix rank and determinant.
Systems of linear equations.
Types of matrices.
Own vectors and own values.
Functions and their properties.
Function limit (basic views).
Derived function (+ its geometric and mechanical meaning).
Derivative of a complex function.
Function extremands. Expansion of the function.
Private derivatives and gradient.
Gradient in optimization tasks.
Derivative in the direction.
Tangent plane and linear approximation.
Methods of optimization:
Optimization of smooth functions (+ problem of local minimums).
Annealing simulation method.
Genetic algorithms. Algorithm of differential evolution.
Mathematical statistics and probability theory:
Determination of probability. Properties of Probability.
Conditional Probabilities. Full probability formula. Bayes formulas.
Discrete random variables.
Continuous random variables.
Evaluation of sample distribution. Statisticians.
Characteristics of Distribution.
Important statistics (sampling average, median, fashion, dispersion, interfamily size).
Central limit theorem.
All 4 directions can be walked on k-5 math
Suppose you already have all of these theoretical knowledge (or at least most of it), but how to apply it in practice? Linear algebra is useful both for initial work with data and when you want to understand complex methods of machine learning. At the initial stages, knowledge about matrices, their properties and operations with them will help to understand how Numpy library methods work, how important statistical values (for example, correlation) for large data are considered.
If you work with neural networks, you have probably already heard the concept of tensor, which is based on the concept of multidimensional vector. Every self-respecting data analyst tries to understand to the maximum the information he is going to work with. Knowledge of mathematical analysis and statistics helps touch math. Building different distributions and functions for numerical data, applying histograms to factor and category values can help to see important regularities or errors in data that can strongly influence the final result of a prediction.
The concept of gradient, tangent and knowledge of optimization methods is used when applying and configuring algorithms to reduce loss function for machine learning tasks, this is a separate important part of data science. Such a method as reinforcement learning will not give good results in the absence of a well-thought-out method of optimizing the target function.
Probability theory and mathematical statistics are inseparable from data science, in fact, the analysis of data began with the appearance of statistical studies and attempts to find regularities in them. Sampling, determining the type of machine learning method suitable for specific data, understanding of metrics is impossible without understanding the main aspects of these areas of mathematics. And how many more subtleties there are in each method!
Of course, you do not need to be an academic in mathematics to become a good data scientist, but first of all, it is essential to have an understanding of the above areas and the basic concepts. In the future, if you want to improve yourself, you cannot do without mathematical training.