What is Data Mining?
Data mining refers to discovery and extraction of patterns and knowledge from large data sets of structured and unstructured data. Data mining techniques have been around for many decades, however, recent advances in ML (Machine Learning), computer performance, and numerical computation have made data mining methods easier to implement on the large data sets and in business-centric tasks. Growing popularity of data mining in business analytics and marketing is also due to the proliferation of Big Data and Cloud Computing. Large distributed databases and methods for parallel processing of data such as MapReduce, make huge volumes of data manageable and useful for companies and academia. Similarly, the cost of storing and managing data is reduced by cloud service providers (CSPs) who offer a pay-as-you-go model to access virtualised servers, storage capacities (disc drives), GPUs (Graphic Processing Unit), and distributed databases. As a result, companies can store, process, and analyze more data getting better business insights.
By themselves, state-of-the-art data mining methods are powerful in many classes of tasks. Some of them are anomaly detection, clustering, classification, association rule learning, regression, and summarization. Each of these tasks plays a crucial role in a whatever setting one might think of. For example, anomaly detection techniques help companies protect against network intrusion and data breach. In their turn, regression models are powerful in the prediction of business trends, revenues, and expenses. Clustering techniques have the highest utility in grouping huge volumes of data into cohesive entities that tell about patterns, dependencies both within and among them without the prior knowledge of any laws that govern observations. As these examples illustrate, data mining has the power to put data into the service of businesses and entire communities.
Data Mining Models
There exist numerous ways to organize and analyze data. Which approach to select depends much on our purpose (e.g. prediction, the inference of relationships) and the form of data (structured vs. unstructured). We can end up with a particular configuration of data which might be good for one task, but not so good for another. Thus, to make data usable one should be aware of theoretical models and approaches used in data mining and realize possible trade-offs and pitfalls in each of them.
Parametric and Non-Parametric Models
One way of looking at a data mining model is to determine whether it has parameters or not. In terms of parameters, we have a choice between parametric and non-parametric models. In the first type of models, we select a function that, in our view, is the best fit to the training data. For instance, we may choose a linear function of a form F (X) = q0 + q1 x1 + q2 x2 + . . . + qp xp, in which x’s are features of the input data (e.g house size, floor, a number of rooms) and q’s are unknown parameters of the model. These parameters may be thought of as weights that determine a contribution of different features (e.g house size, floor, number of rooms) to the value of the function Y (e.g house price). The task of a parametric model is then to find parameters Q using some statistical methods, such as linear regression or logistic regression.
The main advantage of parametric models is that they contain intuition about relationships among features in our data. This makes parametric models an excellent heuristic, inference, and prediction tool. At the same time, however, parametric models have several pitfalls. If the function we have selected is too simple, it may fail to properly explain patterns in the complex data. This problem, known as underfitting, is frequent in linear functions used with the non-linear data. On the other hand, if our function is too complex (e.g with polynomials), it may end up in the overfitting, a scenario in which our model responds to the noise in data rather than actual patterns and is not generalizable to new examples.
Figure #1 Examples of normal, underfit, and overfit models.
Non-parametric models are free from these issues because they make no assumptions about the underlying form of the function. Therefore, non-parametric models are good in dealing with unstructured data. On the other hand, since non-parametric models do not reduce the problem to the estimation of a small number of parameters, they require very large datasets in order to obtain a precise estimate of the function.
Restrictive vs. Flexible Methods
Data mining and ML models may also differ in terms of flexibility. Generally speaking, parametric models, such as linear regression, are considered to be highly restrictive, because they need structured data and actual responses (Y) to work. This very feature, however, makes them suitable for inference – finding relationships between features (e.g how the crime rate in the neighborhood affects house prices). Because of this, restrictive models are interpretable and clear. This observation, though, is not true for flexible models (e.g non-parametric models). Because flexible models make no assumptions about a form of the function that controls observation, they are less interpretable. In many settings, however, the lack of interpretability is not a concern. For example, when our only interest is a prediction of stock prices, we should not care about the interpretability of the model at all.
Supervised vs. Unsupervised Learning
Nowadays, we hear a lot about supervised and unsupervised Machine Learning. New neural networks based on these concepts are making progress in image and speech recognition, or autonomous driving on a daily basis. A natural question, though, what is the difference between unsupervised and supervised learning approaches? The main difference is in a form of data used and techniques to analyze it. In a supervised learning setting, we use a labeled data that consists of
features/variables and dependent variables (Y or response). This data is then fed to the learning algorithm that searches for patterns, and a function that controls relationships between independent and dependent variables. The retrieved function may be then applied for the prediction of future observations. In the unsupervised learning, we also observe a vector of features (e.g. house size, floor). The difference with supervised learning, though, we don’t have any associated results (Y). In this case, we cannot apply a linear regression model since there are no response values to predict. Thus, in an unsupervised setting, we are working blind in some sense.
Data Mining Methods
In this section, we are going to describe technical details of several data mining methods. Our choice fell on linear regression, classification, and clustering methods. These methods are one of the most popular in data mining because they solve a wide variety of tasks, including inference and prediction. Also, these methods perfectly illustrate key features of data mining models described above. For example, linear regression and classification (logistic regression) are examples of parametric, supervised, and restrictive methods, whereas clustering (k-means) belongs to a subset of non-parametric unsupervised methods.
Linear Regression for Machine Learning
Linear Regression is a method of finding a linear function that reasonably approximates the relationship between data points and dependent variable. In other words, it finds an optimized function to represent and explain data. Contemporary advances in processing power and computation methods allow using linear regression in combination with ML algorithms to produce quick and efficient function optimization. In this section, we will describe an implementation of the linear regression with gradient descent to produce algorithmic fitting of data to linear function.
Image #1 Linear regression
For this task, let’ s take a case of a house price prediction. Let’s assume we have a training set of 100 house examples (m=100). Each house in this sample may be defined as x1 x2, x3 … xm. Correspondingly, each house has a set of features or properties, such as house size and floor. Features may be thought of as variables that determine a house price. So, for example, the variable
would refer to the size of the first house in the training sample. Finally, our training sample has a list of prices for each house denoted as y1, y2, ...ym..
This data tell much by itself (e.g we may apply some methods of descriptive statistics to interpret it), however, in order to run a linear regression, we should first formulate the initial hypothesis. Our hypothesis may be defined as a simple linear function with three parameters (Q).
hQ(x) = Q0 + Q1x1+ Q2x2
Image#2 Gradient Descent
where x1 and x2 are the features (house size and floor) and Q parameters of the function we want to predict with the linear regression. In essence, this hypothesis says that a house price is determined by house size and floor parametrized by certain parameters Q. Thus, we have a confirmation that linear regression is a parametric model, in which we try to fit right parameters to find a configuration that best explains the data.
However, what method should we apply to determine the right parameters? Intuitively, our task is to fit parameters that ensure our hypothesis h(x) for each house is close to y, which is a real-world price. For that purpose, we have to define a cost function that evaluates the difference between a predicted value and an actual value.
Equation#1 Cost Function for Linear Regression
The right-most part of this equation is a version of a popular least-squares method that calculates a squared difference between the training value y and the value predicted by our hypothesis function hq(x).
Then, our task is to minimize the cost function so that the predicted error is small as possible. One of the most popular solutions to this problem is the gradient descent algorithm based on the mathematic properties of the gradient. Gradient is a vector- valued function that points in the direction of the greatest rate of increase of our function. In the case of a
multi-variate function, its gradient is the vector whose components are partial derivatives of f.
Since gradient is a vector that points in the direction of the function’s growth, it may be used to find parameters that minimize our function. To achieve this, we simply need to move in the backward direction. The technique of the gradual movement down the function to find a local or global minimum is known as a gradient descent and demonstrated in the image above.
To implement gradient descent for our linear regression, we need to start with random parameters Q, and then repeatedly update them in the direction opposite to the gradient vector until convergence with the global minimum (our linear function guarantees that such global minimum actually exists). The gradient descent procedure is defined by the update algorithm
Equation#2 Gradient Descent Algorithm
where a is the learning rate at which we set our algorithm to learn. The learning rate should not be too large, so that we don’t jump over the global minimum, and should not be too small because then the process would take much time.
A partial derivative in the right-most part of the equation is calculated for each parameter to construct a gradient vector. It is derived in the following way:
Equation#3 Partial Derivative of Gradient Descent
Putting partial derivatives and learning rate together produces the final update rule
Equation #4 Gradient Descent Update Rule
that should be repeated until our algorithm finds the global minimum. That would be the point at which our parameters (Q) produce the function that best explains the training data. This learned function now may be used to predict house prices for houses not included in the training sample and employed in the inference of various relationships between features of our model. This functionality makes the linear regression with gradient descent a powerful technique both in data mining and machine learning.
Classification with Logistic Regression
Classification is a process of determining a class/category to which the object belongs. Classification techniques implemented via machine learning algorithms have numerous applications ranging from email spam filtering to medical diagnostics and recommender systems. Similarly to linear regression, in a classification problem, we work with a labeled training set that includes some features. However, observations in the data set map not to the quantitative value as in linear regression, but to a categorical value (e.g class). For example, patients’ medical records may determine two classes of patients: those with benign and malignant cancers. The task of the classification algorithm is then to learn a function that best predicts what type of cancer (malignant vs. benign) a patient has. If there are only two classes, the problem is known as a binary classification. In contrast, multi-class classification may be used when we have more classes of data.
One of the most common classification techniques in data science and ML is the logistic regression. Logistic regression is based on the sigmoid function that has an interesting property: it maps any real number to the (0,1) interval. As a result, it may be effectively used to evaluate the probability (between 0 and 1) than an observation falls within a certain category. For example, if we define benign cancer as 0 and malignant cancer as 1, a logistic value of .6 would mean that there is a 60% chance that a patient’s cancer is malignant. These properties make sigmoid function useful for binary classification, but multi-class classification is also possible.
Image#3 Sigmoid Function
The formula for the sigmoid function is:
Equation#5 Sigmoid Function
where e is the constant e (2,71828) – a base of the natural logarithm with a property that its natural logarithm is equal to 1.
To build a working classification model, we should put our hypothesis into the sigmoid function. Remember that our hypothesis function has a form of hQ(x) = Q0 + Q1x1+ Q2x2 For convenience, it may be written in the vector form z = Qt x, where superscript t refers to the matrix transpose of the vector parameters Q. As a result of this transformation, we get the following function
Equation#6 Logistic regression model
where z refers to the vector representation of our initial hypothesis hq(x).
In order to fit parameters Q to our logistic regression model, we should first redefine it in probabilistic terms. This is also needed to leverage the power of sigmoid function as a classifier.
Equation# 7 Probabilistic Interpretation of Classification
The above definition stems from the basic rule that probability always adds up to 1. So, for example, if the probability of the malignant cancer is 0.7, the probability of benign cancer is automatically 1- 0.7 = 0.3. The equation above formalizes this obvious observation.
Now, as we defined the hypothesis and probabilistic assumptions, it’s time to construct a cost function in the same way we did for linear regression. For that purpose, we need to transform our original sigmoid function, because it’s a complex non-convex function that can produce many local minima. If used with the least-squares method similar to the linear regression model below, a sigmoid function will have a hard time to converge.
Equation #8 Linear Regression Cost Function
Instead, to represent the intuition behind sigmoid function, we may use log-probability applied to the above-mentioned probabilistic definition of the classification problem.
Equation#9 Logistic Regression Cost Function
The graph below illustrates that log function assigns a high cost if our hypothesis is wrong, and no cost if the hypothesis is right. If y=1 and hq(x) = 1, then the cost ? 0. In contrast, if y=1 and hqx is 0, then the cost goes to infinity. The opposite happens if y=0.
Image #4 - log(x) Function
We can make a simplified version of the cost function by merging these two cases together. The final cost function ready for use with logistic regression has the following form:
Equation #10 Logistic Regression Cost Function (Simplified)
Now, when the cost function is formulated, we can easily use the gradient descent identical to the one applied in linear regression.
Equation # 11 Logistic Regression Update Rule
A similar technique may be applied to the multi-class classification problem with more than two classes. In general, multi-class problems use a one-vs-all approach in which we choose one class and then lump all others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.
Image #5 One-vs-all or Multiclass Classification
As we have seen, clustering is an unsupervised method that is useful when the data is not labeled or if there are no response values (y). Clustering observations of a data set involves partitioning them into distinct groups so that observations within each group are quite similar to each other, while observations in the different groups have less in common. To illustrate this method, let’s take an example from marketing. Assume that we have a big volume of data about consumers. This data may involve median household income, occupation, distance from the nearest urban area, and so forth. This information may be then used for market segmentation. Our task is to identify various groups of customers without the prior knowledge of commonalities that may exist among them. Such segmentation may be then used for tailoring marketing campaigns that target specific clusters of consumers.
There are many different clustering techniques to do this, but the most popular are k-mean clustering algorithm and hierarchical clustering. In this section, we are going to describe k-means method, a very efficient algorithm that covers a wide range of use cases. In the k-means clustering, we want to partition observations into a pre-specified number of clusters. Although setting a number of clusters before clustering is considered to be a limitation of the k-means algorithm, it is still a very powerful technique. In our clustering problem, we are given a training set of xi,...,x(m) consumers with individual features xj.. Features are vectors of variables that describe various properties of consumers, such as median income, age, gender, and so forth. The rule is that each observation (consumer) should belong to exactly one cluster and no observations should belong to more than one cluster.
Image #6 Illustration of the Clustering Process
The idea behind the k-means clustering is that a good cluster is the one for which the within-cluster variation or within-cluster sum of squares (WCSS) (variance) is minimal. In other words, consumers in the same cluster should have more in common with each other than with consumers from other clusters. To achieve this configuration, our task is to algorithmically minimize WCSS for all pre-specified clusters. This task is implemented in the following equation:
Equation #12 Within-cluster Sum of Squares (WCSS)
where |Ck | denotes the number of observations in the kth cluster
In words, the equation above says that we want to partition the observations into K clusters so that the total within-cluster variation, summed over all K clusters, is as small as possible. The within- cluster variation for the kth cluster is the sum of all of the pairwise squared Euclidean distances between the observations in this cluster, divided by the total number of observations in the kth cluster. As in the case with linear regression, to minimize this function we should start from some initial guess. Our task is to find cluster centroids (average position of all points in the space) or, means for each cluster.
This may be achieved via three-step algorithm in which we:
1.Randomly initialize K cluster centroids – m1, m2 … mk
2.We assign each observation in the data set to the cluster that yields the least WCSS. Intuitively, the least WCSS is the ‘nearest’ mean (centroid). To find the ‘nearest’ centroid we should calculate the Euclidean distances between observations and centroids and select the one with the smallest distance.
Equation #13 Assigning observation to the closest centroid
3.The next step is moving to a new centroid m by computing the centroid for each K cluster. The centroid of the kth cluster may be defined as a vector of the p feature means for the observations in the kth cluster. So, for example, if we have x1, x2, x3 and they belong to the same cluster (c2), then the centroid m2 is defined by the average:
Equation #14 Centroid Calculation Example
Since arithmetic mean is a good least-squares estimator, this step also minimizes the within-cluster sum of squares. This means that as the algorithm runs, the clustering obtained will continually improve until the result no longer changes. When this happens, the local optimum has been reached and clusters become stable.
Data mining models and methods described in this paper, allow data scientists to perform a wide array of tasks, including inference, prediction, and analysis. Linear regression is powerful in the
prediction of trends and inference of relationships between features. In its turn, logistic regression may be used in the automatic classification of behaviors, processes, and objects, which makes it useful in business analytics and anomaly detection. Finally, clustering allows to make insights about unlabeled data and infer hidden relationships that may drive effective business decisions and strategic choices.