#### Data Mining: Models and Methods

What is Data Mining? Data mining refers to discovery and extraction of patterns and knowledge from large data sets of structured and unstructured data. Data mining techniques have been around for many decades, however, recent advances in ML (Machine Learning), computer performance, and numerical computation have made data mining methods easier to implement on the large data sets and in business-centric tasks. Growing popularity of data mining in business analytics and marketing is also due to the proliferation of Big Data and Cloud Computing. Large distributed databases and methods for parallel processing of data such as MapReduce, make huge volumes of data manageable and useful for companies and academia. Similarly, the cost of storing and managing data is reduced by cloud service providers (CSPs) who offer a pay-as-you-go model to access virtualised servers, storage capacities (disc drives), GPUs (Graphic Processing Unit), and distributed databases. As a result, companies can store, process, and analyze more data getting better business insights. By themselves, state-of-the-art data mining methods are powerful in many classes of tasks. Some of them are anomaly detection, clustering, classification, association rule learning, regression, and summarization. Each of these tasks plays a crucial role in a whatever setting one might think of. For example, anomaly detection techniques help companies protect against network intrusion and data breach. In their turn, regression models are powerful in the prediction of business trends, revenues, and expenses. Clustering techniques have the highest utility in grouping huge volumes of data into cohesive entities that tell about patterns, dependencies both within and among them without the prior knowledge of any laws that govern observations. As these examples illustrate, data mining has the power to put data into the service of businesses and entire communities. Data Mining Models There exist numerous ways to organize and analyze data. Which approach to select depends much on our purpose (e.g. prediction, the inference of relationships) and the form of data (structured vs. unstructured). We can end up with a particular configuration of data which might be good for one task, but not so good for another. Thus, to make data usable one should be aware of theoretical models and approaches used in data mining and realize possible trade-offs and pitfalls in each of them. Parametric and Non-Parametric Models One way of looking at a data mining model is to determine whether it has parameters or not. In terms of parameters, we have a choice between parametric and non-parametric models. In the first type of models, we select a function that, in our view, is the best fit to the training data. For instance, we may choose a linear function of a form F (X) = q0 + q1 x1 + q2 x2 + . . . + qp xp, in which x’s are features of the input data (e.g house size, floor, a number of rooms) and q’s are unknown parameters of the model. These parameters may be thought of as weights that determine a contribution of different features (e.g house size, floor, number of rooms) to the value of the function Y (e.g house price). The task of a parametric model is then to find parameters Q using some statistical methods, such as linear regression or logistic regression. The main advantage of parametric models is that they contain intuition about relationships among features in our data. This makes parametric models an excellent heuristic, inference, and prediction tool. At the same time, however, parametric models have several pitfalls. If the function we have selected is too simple, it may fail to properly explain patterns in the complex data. This problem, known as underfitting, is frequent in linear functions used with the non-linear data. On the other hand, if our function is too complex (e.g with polynomials), it may end up in the overfitting, a scenario in which our model responds to the noise in data rather than actual patterns and is not generalizable to new examples. Figure #1 Examples of normal, underfit, and overfit models. Non-parametric models are free from these issues because they make no assumptions about the underlying form of the function. Therefore, non-parametric models are good in dealing with unstructured data. On the other hand, since non-parametric models do not reduce the problem to the estimation of a small number of parameters, they require very large datasets in order to obtain a precise estimate of the function. Restrictive vs. Flexible Methods Data mining and ML models may also differ in terms of flexibility. Generally speaking, parametric models, such as linear regression, are considered to be highly restrictive, because they need structured data and actual responses (Y) to work. This very feature, however, makes them suitable for inference – finding relationships between features (e.g how the crime rate in the neighborhood affects house prices). Because of this, restrictive models are interpretable and clear. This observation, though, is not true for flexible models (e.g non-parametric models). Because flexible models make no assumptions about a form of the function that controls observation, they are less interpretable. In many settings, however, the lack of interpretability is not a concern. For example, when our only interest is a prediction of stock prices, we should not care about the interpretability of the model at all. Supervised vs. Unsupervised Learning Nowadays, we hear a lot about supervised and unsupervised Machine Learning. New neural networks based on these concepts are making progress in image and speech recognition, or autonomous driving on a daily basis. A natural question, though, what is the difference between unsupervised and supervised learning approaches? The main difference is in a form of data used and techniques to analyze it. In a supervised learning setting, we use a labeled data that consists of features/variables and dependent variables (Y or response). This data is then fed to the learning algorithm that searches for patterns, and a function that controls relationships between independent and dependent variables. The retrieved function may be then applied for the prediction of future observations. In the unsupervised learning, we also observe a vector of features (e.g. house size, floor). The difference with supervised learning, though, we don’t have any associated results (Y). In this case, we cannot apply a linear regression model since there are no response values to predict. Thus, in an unsupervised setting, we are working blind in some sense. Data Mining Methods In this section, we are going to describe technical details of several data mining methods. Our choice fell on linear regression, classification, and clustering methods. These methods are one of the most popular in data mining because they solve a wide variety of tasks, including inference and prediction. Also, these methods perfectly illustrate key features of data mining models described above. For example, linear regression and classification (logistic regression) are examples of parametric, supervised, and restrictive methods, whereas clustering (k-means) belongs to a subset of non-parametric unsupervised methods. Linear Regression for Machine Learning Linear Regression is a method of finding a linear function that reasonably approximates the relationship between data points and dependent variable. In other words, it finds an optimized function to represent and explain data. Contemporary advances in processing power and computation methods allow using linear regression in combination with ML algorithms to produce quick and efficient function optimization. In this section, we will describe an implementation of the linear regression with gradient descent to produce algorithmic fitting of data to linear function. Image #1 Linear regression For this task, let’ s take a case of a house price prediction. Let’s assume we have a training set of 100 house examples (m=100). Each house in this sample may be defined as x1 x2, x3 … xm. Correspondingly, each house has a set of features or properties, such as house size and floor. Features may be thought of as variables that determine a house price. So, for example, the variable would refer to the size of the first house in the training sample. Finally, our training sample has a list of prices for each house denoted as y1, y2, ...ym.. This data tell much by itself (e.g we may apply some methods of descriptive statistics to interpret it), however, in order to run a linear regression, we should first formulate the initial hypothesis. Our hypothesis may be defined as a simple linear function with three parameters (Q). hQ(x) = Q0 + Q1x1+ Q2x2 Image#2 Gradient Descent where x1 and x2 are the features (house size and floor) and Q parameters of the function we want to predict with the linear regression. In essence, this hypothesis says that a house price is determined by house size and floor parametrized by certain parameters Q. Thus, we have a confirmation that linear regression is a parametric model, in which we try to fit right parameters to find a configuration that best explains the data. However, what method should we apply to determine the right parameters? Intuitively, our task is to fit parameters that ensure our hypothesis h(x) for each house is close to y, which is a real-world price. For that purpose, we have to define a cost function that evaluates the difference between a predicted value and an actual value. Equation#1 Cost Function for Linear Regression The right-most part of this equation is a version of a popular least-squares method that calculates a squared difference between the training value y and the value predicted by our hypothesis function hq(x). Then, our task is to minimize the cost function so that the predicted error is small as possible. One of the most popular solutions to this problem is the gradient descent algorithm based on the mathematic properties of the gradient. Gradient is a vector- valued function that points in the direction of the greatest rate of increase of our function. In the case of a multi-variate function, its gradient is the vector whose components are partial derivatives of f. Since gradient is a vector that points in the direction of the function’s growth, it may be used to find parameters that minimize our function. To achieve this, we simply need to move in the backward direction. The technique of the gradual movement down the function to find a local or global minimum is known as a gradient descent and demonstrated in the image above. To implement gradient descent for our linear regression, we need to start with random parameters Q, and then repeatedly update them in the direction opposite to the gradient vector until convergence with the global minimum (our linear function guarantees that such global minimum actually exists). The gradient descent procedure is defined by the update algorithm Equation#2 Gradient Descent Algorithm where a is the learning rate at which we set our algorithm to learn. The learning rate should not be too large, so that we don’t jump over the global minimum, and should not be too small because then the process would take much time. A partial derivative in the right-most part of the equation is calculated for each parameter to construct a gradient vector. It is derived in the following way: Equation#3 Partial Derivative of Gradient Descent Putting partial derivatives and learning rate together produces the final update rule Equation #4 Gradient Descent Update Rule that should be repeated until our algorithm finds the global minimum. That would be the point at which our parameters (Q) produce the function that best explains the training data. This learned function now may be used to predict house prices for houses not included in the training sample and employed in the inference of various relationships between features of our model. This functionality makes the linear regression with gradient descent a powerful technique both in data mining and machine learning. Classification with Logistic Regression Classification is a process of determining a class/category to which the object belongs. Classification techniques implemented via machine learning algorithms have numerous applications ranging from email spam filtering to medical diagnostics and recommender systems. Similarly to linear regression, in a classification problem, we work with a labeled training set that includes some features. However, observations in the data set map not to the quantitative value as in linear regression, but to a categorical value (e.g class). For example, patients’ medical records may determine two classes of patients: those with benign and malignant cancers. The task of the classification algorithm is then to learn a function that best predicts what type of cancer (malignant vs. benign) a patient has. If there are only two classes, the problem is known as a binary classification. In contrast, multi-class classification may be used when we have more classes of data. One of the most common classification techniques in data science and ML is the logistic regression. Logistic regression is based on the sigmoid function that has an interesting property: it maps any real number to the (0,1) interval. As a result, it may be effectively used to evaluate the probability (between 0 and 1) than an observation falls within a certain category. For example, if we define benign cancer as 0 and malignant cancer as 1, a logistic value of .6 would mean that there is a 60% chance that a patient’s cancer is malignant. These properties make sigmoid function useful for binary classification, but multi-class classification is also possible. Image#3 Sigmoid Function The formula for the sigmoid function is: Equation#5 Sigmoid Function where e is the constant e (2,71828) – a base of the natural logarithm with a property that its natural logarithm is equal to 1. To build a working classification model, we should put our hypothesis into the sigmoid function. Remember that our hypothesis function has a form of hQ(x) = Q0 + Q1x1+ Q2x2 For convenience, it may be written in the vector form z = Qt x, where superscript t refers to the matrix transpose of the vector parameters Q. As a result of this transformation, we get the following function Equation#6 Logistic regression model where z refers to the vector representation of our initial hypothesis hq(x). In order to fit parameters Q to our logistic regression model, we should first redefine it in probabilistic terms. This is also needed to leverage the power of sigmoid function as a classifier. Equation# 7 Probabilistic Interpretation of Classification The above definition stems from the basic rule that probability always adds up to 1. So, for example, if the probability of the malignant cancer is 0.7, the probability of benign cancer is automatically 1- 0.7 = 0.3. The equation above formalizes this obvious observation. Now, as we defined the hypothesis and probabilistic assumptions, it’s time to construct a cost function in the same way we did for linear regression. For that purpose, we need to transform our original sigmoid function, because it’s a complex non-convex function that can produce many local minima. If used with the least-squares method similar to the linear regression model below, a sigmoid function will have a hard time to converge. Equation #8 Linear Regression Cost Function Instead, to represent the intuition behind sigmoid function, we may use log-probability applied to the above-mentioned probabilistic definition of the classification problem. Equation#9 Logistic Regression Cost Function The graph below illustrates that log function assigns a high cost if our hypothesis is wrong, and no cost if the hypothesis is right. If y=1 and hq(x) = 1, then the cost ? 0. In contrast, if y=1 and hqx is 0, then the cost goes to infinity. The opposite happens if y=0. Image #4 - log(x) Function We can make a simplified version of the cost function by merging these two cases together. The final cost function ready for use with logistic regression has the following form: Equation #10 Logistic Regression Cost Function (Simplified) Now, when the cost function is formulated, we can easily use the gradient descent identical to the one applied in linear regression. Equation # 11 Logistic Regression Update Rule A similar technique may be applied to the multi-class classification problem with more than two classes. In general, multi-class problems use a one-vs-all approach in which we choose one class and then lump all others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction. Image #5 One-vs-all or Multiclass Classification Clustering Methods As we have seen, clustering is an unsupervised method that is useful when the data is not labeled or if there are no response values (y). Clustering observations of a data set involves partitioning them into distinct groups so that observations within each group are quite similar to each other, while observations in the different groups have less in common. To illustrate this method, let’s take an example from marketing. Assume that we have a big volume of data about consumers. This data may involve median household income, occupation, distance from the nearest urban area, and so forth. This information may be then used for market segmentation. Our task is to identify various groups of customers without the prior knowledge of commonalities that may exist among them. Such segmentation may be then used for tailoring marketing campaigns that target specific clusters of consumers. There are many different clustering techniques to do this, but the most popular are k-mean clustering algorithm and hierarchical clustering. In this section, we are going to describe k-means method, a very efficient algorithm that covers a wide range of use cases. In the k-means clustering, we want to partition observations into a pre-specified number of clusters. Although setting a number of clusters before clustering is considered to be a limitation of the k-means algorithm, it is still a very powerful technique. In our clustering problem, we are given a training set of xi,...,x(m) consumers with individual features xj.. Features are vectors of variables that describe various properties of consumers, such as median income, age, gender, and so forth. The rule is that each observation (consumer) should belong to exactly one cluster and no observations should belong to more than one cluster. Image #6 Illustration of the Clustering Process The idea behind the k-means clustering is that a good cluster is the one for which the within-cluster variation or within-cluster sum of squares (WCSS) (variance) is minimal. In other words, consumers in the same cluster should have more in common with each other than with consumers from other clusters. To achieve this configuration, our task is to algorithmically minimize WCSS for all pre-specified clusters. This task is implemented in the following equation: Equation #12 Within-cluster Sum of Squares (WCSS) where |Ck | denotes the number of observations in the kth cluster In words, the equation above says that we want to partition the observations into K clusters so that the total within-cluster variation, summed over all K clusters, is as small as possible. The within- cluster variation for the kth cluster is the sum of all of the pairwise squared Euclidean distances between the observations in this cluster, divided by the total number of observations in the kth cluster. As in the case with linear regression, to minimize this function we should start from some initial guess. Our task is to find cluster centroids (average position of all points in the space) or, means for each cluster. This may be achieved via three-step algorithm in which we: 1.Randomly initialize K cluster centroids – m1, m2 … mk 2.We assign each observation in the data set to the cluster that yields the least WCSS. Intuitively, the least WCSS is the ‘nearest’ mean (centroid). To find the ‘nearest’ centroid we should calculate the Euclidean distances between observations and centroids and select the one with the smallest distance. Equation #13 Assigning observation to the closest centroid 3.The next step is moving to a new centroid m by computing the centroid for each K cluster. The centroid of the kth cluster may be defined as a vector of the p feature means for the observations in the kth cluster. So, for example, if we have x1, x2, x3 and they belong to the same cluster (c2), then the centroid m2 is defined by the average: Equation #14 Centroid Calculation Example Since arithmetic mean is a good least-squares estimator, this step also minimizes the within-cluster sum of squares. This means that as the algorithm runs, the clustering obtained will continually improve until the result no longer changes. When this happens, the local optimum has been reached and clusters become stable. Conclusion Data mining models and methods described in this paper, allow data scientists to perform a wide array of tasks, including inference, prediction, and analysis. Linear regression is powerful in the prediction of trends and inference of relationships between features. In its turn, logistic regression may be used in the automatic classification of behaviors, processes, and objects, which makes it useful in business analytics and anomaly detection. Finally, clustering allows to make insights about unlabeled data and infer hidden relationships that may drive effective business decisions and strategic choices.

Hi,Did you have a look at this paper to integrate some models/techniques?http://www.sciencedirect.com/science/article/pii/S0957417412003077

#### PUBLIC SECTOR CLOUD CONFERENCE 6th December 2017

PUBLIC SECTOR CLOUD CONFERENCE 6th December 2017 Location: University of Salford Funded places are being offered to members of the Data Science Foundation, contact chris@datascience.foundation for further information on funded places The Public Sector Cloud conference will take place at the University of Salford on the 6th December. The event will bring together leading experts of IoT and Digital Infrastructure. Speakers will share their views on the development of cloud based systems within government and education, the constant pressure for councils to migrate to a 'Cloud First' initiative is a broad concern, however as integrated systems and legacy operations have been in place for so long, the transition will certainly take time and money to achieve. With added cost, cloud transition seems a time away as budget pressures are an overwhelming concern, made worse by the constant scrutiny of the media spotlight. We will be offering a case study on the day of success stories experienced by local councils and public institutes. The overall day will be directed to ensure that our delegates are made aware of the obstacles and challenges ahead, ensuring they see a discernible path to updating their systems gradually and progressively. An example of speakers and themes: Liam Maxwell, UK National Technology Adviser, HM Government Promoting and supporting digital industry in the UK and internationally; The future of public sector services and migrating to cloud based service. James Stewart - Former Director of Technical Architecture of UK Government Digital Service Why cloud remains a key part of any “Digital by Default” agenda New cloud developments – 4 years on Christopher Wroath - Director of Digital, NHS Education for Scotland Cloud as part of the new Health & Social Care Delivery plan Cloud first – NES Journey Event Page: http://www.salford.ac.uk/onecpd/courses/public-sector-cloud

Funded places are being offered to members of the Data Science Foundation, contact chris@datascience.foundation for further information on funded places.The funded places are limited, if you are interested in going please get in contact ASAP

#### Introduction to Artificial Neural Networks (ANNs)

Introduction to Artificial Neural Networks (ANNs) White Paper 5 September 2017 Introduction Machine Learning (ML) is a subfield of computer science that stands behind the rapid development of Artificial Intelligence (AI) over the past decade. Machine Learning studies algorithms that allow machines recognizing patterns, construct prediction models, or generate images or videos through learning. ML algorithms can be implemented using a wide variety of methods like clustering, linear regression, decision trees, and more. In this paper, we are going to discuss the design of Artificial Neural Networks (ANN) – a ML architecture that gathered a powerful momentum in the recent years as one of the most efficient and fast learning methods to solve complex computer vision, speech recognition, NLP (Natural Language Processing), image, audio, and video generation problems. Thanks to their efficient multilayer design that models the biological structure of human brain, ANNs have firmly established themselves as the state-of-the-art technology that drives AI revolution. In what follows, we are going to describe the architecture of a simple ANN and offer you a useful intuition of how it may be used to solve complex nonlinear problems in an efficient way. What is an Artificial Neural Network? An Artificial Neural Network is an ML (Machine Learning) algorithm inspired by biological computational models of brain and biological neural networks. In a nutshell, an Artificial Neural Network (ANN) is a computational representation of the human neural network that regulates human intelligence, reasoning and memory. However, why should we necessary emulate a human brain system to develop efficient ML algorithms? The main rationale behind using ANNs (ANN) is that neural networks are efficient in complex computations and hierarchical representation of knowledge. Neurons connected by axons and dendrites into complex neural networks can pass and exchange information, store intermediary computation results, produce abstractions, and divide the learning process into multiple steps. Computation model of such system can thus produce very efficient learning processes similar to the biological ones. A perceptron algorithm invented in 1957 by Franc Rosenblatt in 1957 was the first attempt to create a computational model of a biological neural network. However, complex neural networks with multiple layers, nodes, and neurons became possible only recently and thanks to the dramatic increase of computing power (Moore’s Law), more efficient GPUs (Graphics Processing Units), and proliferation of Big Data used for training ML models. In the 2000s-2010s these developments gave rise to Deep Learning (DL), – a modern approach to the design of ANNs based on a deep cascade of multiple layers that extract features from data and do transformations and hierarchical representations of knowledge. Image #1 Overfitting problem Thanks to their ability to simulate complex nonlinear processes and create hierarchical, and abstract representations of data, ANNs stand behind recent breakthroughs in image recognition and computer vision, NLP (Natural Language Processing), generative models and various other ML applications that seek to retrieve complex patterns from data. Neural networks are especially useful for studying nonlinear hypotheses with many features (e.g n=100). Constructing an accurate hypothesis for such a large feature space would require using multiple high-order polynomials which would inevitably lead to overfitting – a scenario in which the model describes the random noise in data rather than underlying relationships and patterns. The problem of overfitting is especially tangible in image recognition problems where each pixel may represent a feature. For example, when working with 50 X 50 pixel images, we may have 25000 features which would make manual construction of the hypothesis almost impossible. A Simple Neural Network with a Single Neuron The simplest possible neural network consists of a single “neuron” (see the diagram below). Using a biological analogy, this ‘neuron’ is a computational unit that takes inputs via (dendrites) as electrical inputs (let’s say “spikes”) and transmits them via axons to the next layer or the network’s output. Image #2 A neural network with a single neuron In a simple neural network depicted above, dendrites are input features (x1, x2 …) and the outputs (axons) represent the results of our hypothesis (hw,b(x)). Besides input features, the input layer of a neural network normally has a 'bias unit' which is equal to 1. A bias unit is needed to use a constant term in the hypothesis function. In Machine Learning terms, the network depicted above has one input layer, one hidden layer (that consists of a single neuron) and one output layer. A learning process of this network is implemented in the following way. The input layer takes input features (e.g pixels) for each training sample and feeds them to the activation function that computes the hypothesis in the hidden layer. An activation function is normally a logistic regression used for classification, however, other alternatives are also possible. In the case described above, our single neuron corresponds exactly to the input-output mapping that was defined by logistic regression. Image #3 Logistic Regression As in the case with simple binary classification, our logistic regression has parameters. They are often called “weights” in the ANN (Artificial Neural Network) models. Multi-Layered Neural Network To understand how neural networks work, we need to formalize the model and describe it in a real-world scenario. In the image below we can see a multilayer network that consists of three layers and has several neurons. Here, as in a single-neuron network, we have one input layer with three inputs (x1,x2,x3) with an added bias unit (+1). The second layer of the network is a hidden layer consisting of three units/neurons represented by the activation functions. We call it a hidden layer because we don’t observe the values computed in it. Actually, a neural network can contain multiple hidden layers that pass complex functions and computations from the “surface” layers to the “bottom” of the neural network. The design of a neural network with many hidden layers is frequently used in Deep Learning (DL) – a popular approach in the ML research that gained a powerful momentum in recent years. Image #4 Multilayer Perceptron The hidden layer (Layer 2) above has three neurons (a12, a22, a32). In abstract terms, each unit/neuron of a hidden layer aij is an activation of unit/neuron in in the layer j. In our case, a unit a12 ctivates the first neuron of the second layer (hidden layer). By activation, we mean a value which is computed by the activation function (e.g logistic regression) in this layer and outputted by that node to the next layer. Finally, Layer 3 is an output layer that gets results from the hidden layer and applies them to its own activation function. This layer computes the final value of our hypothesis. Afterwards, the cycle continues until the neural network comes up with the model and weights that best predict the values of the training data. So far, we haven’t defined how the ‘weights’ work in the activation functions. For that reason, let’s define Q(j) as a matrix of parameters/weights that controls the function mapping from layer j to layer j + 1. For example, Q1 will control the mapping from the input layer to the hidden layer, whereas Q2 will control the mapping from the hidden layer to the output layer. The dimensionality of Q matrix will be defined by the following rule. If our network has sj units in the layer j and sj+1 units in the layer j+1, then Qj will have a dimension of sj+1 X (sj + 1). The + 1 dimension comes from the necessary addition in Qj of a bias unit x0 and Q0(j). In other words, our output nodes will not include the bias unit while the input nodes will. To illustrate how the dimensionality of the Q matrix works, let’s assume that we have two layers with 101 and 21 units in each. Then, using our rule Qj would be a 21 X 102 matrix with 21 rows and 102 columns. Image #5 A Neural Network Model Let’s put it all together. In the image above, we see our neural network with three layers again. What we need to do, is to calculate activation functions based on the input values, and our main hypothesis function based on the set of calculations from the previous layer (the hidden layer). In this case, our neural network works as a cascade of calculations where each subsequent layer supplies values to the activation functions of the next one. To calculate activations, we first have to define the dimensionality of our Q matrices. In this example, we have 3 input and 3 hidden units, so Q1 mapping from input to hidden layer is of dimension 3 X 4 because the bias unit is included. The activation layer of each hidden neuron (e.g a12) is equal to our sigmoid function applied to the linear combination of inputs with weights retrieved from the weight matrix Qj. In the diagram above, you can see that each activation unit is computed by the function g which is our logistic regression function. In its turn, Q2 refers to the matrix of weights that maps from the hidden layer to the output layer. These weights may be randomly assigned to the matrix before the neural network runs or be a product of previous computations. In our case, Q2 is a 1 X 4 dimensional matrix (i.e a row vector). To calculate the output results we apply our hypothesis function (sigmoid function) to the results calculated by the activation functions in the hidden layer. If we had several hidden layers, then the results of the previous activation functions would be passed to the next hidden layer and then to the output layer. This sequential mechanism makes neural networks very powerful in computation on nonlinear hypotheses and complex functions. Instead of trying to fit inputs to polynomial functions designed manually, we can create a neural network with numerous activation functions that exchange intermediary results and update weights. These automatic setup allows creating nonlinear models that are more accurate in prediction and classification of our data. Neural Networks in Action The power of neural networks to compute complex nonlinear functions may be illustrated using the following binary classification example taken from Coursera Machine Learning course by Professor Andrew Ngi. Consider the case when x1 and x2 can take two binary values (0,1). To put this binary classification problem in Boolean terms, our task is to compute y = x1 XOR x2 , which is the same as computing x1 XNOR x2. The latter is a logic gate that may be interpreted as NOT (x1 XOR x2). This is the same as saying that the function is true if both x1 and x2 are equal 0 or 1. To make our network calculate XNOR, we first have to describe simple logical functions to be used as intermediary activations in the hidden layer. The first function we want to compute is a logical AND function: y = x1 AND x2. Image #6 Logical AND function As in the first example above, our AND function is a simple single-neuron network with inputs x1 and x2 and a bias unit (+1). The first thing we need to do is to assign weights to the activation function and then compute it based on the input values specified in the truth table below. These input values are all possible binary values that x1 and x2 can take. By fitting 0s and 1s into the function (i.e logistic regression) we can compute our hypothesis. hq(x) = g(-30 + 20x1 + 20x2). To understand how the values of the third column of the truth table are found, remember that sigmoid function is 0 at ≈ -4.6 and 1 at ≈ 4.6. As a result, we have: x1 x2 hq(x) 0 0 g(-30) ≈ 0 0 1 g(-10) ≈ 0 1 0 g(-10) ≈ 0 1 1 g(10) ≈ 1 As we can see now, the rightmost column is a definition of a logical AND function that is true only if both x1 and x2 are true. The second function we need for our neural network to work is a logical OR function. In the logical OR, y is true (1) if either x1 OR x2 or both of them are 1 (true). Image #7 Logical OR function As in the previous case with the logical AND, we assign weights that will fit the definition of the logical OR function. Putting these weights into our logistic function g(-10 + 20x1 + 20x2) we get the following truth table: x1 x2 hq(x) 0 0 g(-10) ≈ 0 0 1 g(10) ≈ 1 1 0 g(-10) ≈ 1 1 1 g(10) ≈ 1 As you see, our function is false (0) only if both x1 and x2 are false. In all other cases, it is true. This corresponds to the logical OR function. The last function we need to compute before running a network for finding x1 XNOR x2 is (NOT x1) and (NOT x2). In essence, this function consists of two logical negations (NOT). A single negation NOT x1 may be presented in the following diagram. In essence, it says that y is true only if x1 is false. Therefore, the logical NOT has only one input unit (x1). Image #8 Logical NOT After putting inputs with weights into g = 10 – 20x1, we end up with the following truth table. x1 hq(x) 0 g(10) ≈ 1 1 g(-10) ≈ 0 The output values of this table confirm our hypothesis that NOT function outputs true only if x1 is false. Now, we can find out values of the logical (NOT x1) AND (NOT x2) function. Image #9 Logical (NOT x1) AND (NOT x2) Putting binary values of x1 and x2 in the function g(10 - 20x1 -20x2) we end up with the following truth table. x1 x2 hq(x) 0 0 g(10) ≈ 1 0 1 g(-10) ≈ 0 1 0 g(-10) ≈ 0 1 1 g(-30) ≈ 0 This table demonstrates that the logical (NOT x1) AND (NOT x2) function is true only if both x1 and x2 are false. These three simple functions (logical AND, logical OR, and double negation AND function) may be now used as the activation functions in our three-layer neural network to compute another nonlinear function defined in the beginning: x1 XNOR x2. To do this, we need to put these three simple functions together into a single network. Logical AND Logical (NOT x1) AND (NOT x2) Logical OR This network uses three logical functions calculated above as the activation functions. Image #10 A Neural Network to Compute XNOR Function As you see, the first layer of this network consists of two inputs (x1 and x2) plus a bias unit +1. The first unit of the hidden layer is a Logical AND activation function that takes weights specified above (-30, 20, 20). The second unit a(2)2 is represented by the (NOT x1) AND (NOT x2) function that takes parameters 10, -20, -20. Doing our usual calculations, we get the values 0,0,0,1 for a(2)1 and the values 1,0,0,0 for the second unit in the hidden layer. Now, the final step is using the second set of parameters from the logical OR function that sits in the output layer. What we do here, is simply take the values produced by the two units in the hidden layer (logical AND and (NOT x1) AND (NOT x2) ) and apply them to the OR function with its parameters. The results of this computation make up our hypothesis function (1,0,0,1), which is our desired XNOR function. x1 x2 a(2)1 a(2)2 hq(x) 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 1 That’s it! Hopefully, as this example illustrates, neural networks are powerful in computing complex nonlinear hypotheses by using a cascade of functions. In fact, neural networks can use output values of a certain function as the inputs of other functions. Leveraging this functionality, complex multi-layered networks that can extract complex features and patterns from images, videos, and other data can be designed. Conclusion Artificial Neural Networks (ANNs) are the main drivers of the contemporary AI revolution. Inspired by the biological structure of human brain, ANNs are powerful in modeling functions and hypotheses which would be hard to derive intuitively or logically. Instead of inventing your own function with high-order polynomials, which may lead to overfitting, one can design an efficient ANN architecture that can automatically fit complex nonlinear hypotheses to data. This advantage of the ANNs has been leveraged in the algorithmic feature extraction in computer vision and image recognition. For example, instead of manually specifying a finite list of image features to choose from, we can design a Convolutional Neural Network (CNN) that uses the same principle as the animal’s visual cortex to extract features. As a human eye, layers of the CNN respond to stimuli only in a restricted region of the visual field. This allows the network to recognize low-level features such as points, edges, or corners and gradually merge them into high-level geometric figures and objects. This example illustrates how good ANNs are in the automatic derivation of hypotheses and models from complex data that includes numerous associations and relationships.

Good overview, Kirill.It might make sense to also refer to the following picture to give a broad and quick snapshot of the different neural nets scientists can use:http://www.asimovinstitute.org/wp-content/uploads/2016/09/neuralnetworks.png

#### Getting a new periodic table of elements using AI

"Elementary particles are the building blocks of al matter everywhere in the universe. Their properties are connected with the fundamental forces of nature" Murray Gell Mann Getting a new periodic table of elements using AI Abstract Objective: To obtain an atomic classification based on clustering techniques using non-supervised learning algorithms. Design: The sample of atoms used in the experiments is defined using a set of atomic elements with known properties that are not null for all the individuals of the sample. Different clustering algorithms are used to establish relationships between the elements, getting as result a cluster of atoms related with each other by the numerical values of some of their structural properties. Results: Sets of elements related with the atom that represents each cluster. Keywords: Clustering, atoms, periodic table of elements, unsupervised algorithms, Random Forest, K-Means, K-Nearest Neighbour, Weka, Bayesian Classifier. Introduction The periodic table of elements is an atomic organisation based on two axis. The horizontal axis establishes an increasing order based on the atomic number (number of protons) of each element. The vertical arrangement is managed by the electronic configuration and presents a taxonomic structure designed by the electrons of their latest layer . Furthermore, four main blocks arrange the atoms by similar properties (gases, metals, nonmetals, metalloids). Additionally to the number of protons and the electronic configuration, the atoms are characterised by other attributes that are not ascendant nor cyclic in the periodic table of elements. The values of these properties constitute a sample of numbers that represent different atomic magnitudes that distinguish in some how the chemical elements. In this experiment some of these chemical and physical dimensions have been involved in the training of a set of machine learning algorithms to obtain representative clusters of each element. Research problem The hypothesis of this experiment considers the use of some variants of unsupervised learning models to discover relationships between atomic elements based on a few of chemical and physical matter attributes. Moreover, these techniques calculate clusters of categories based on their numerical attributes. The research problem drives also to an element clustering that could offer a new atomic distribution based on the inferred functions processed by the machine learning processes. The goal is to present an organisation of elements based on the clustering calculation applied on a specific set of atomic properties. Units of analysis The following atomic properties have been used to train and evaluate the unsupervised algorithms: melting point [K], boiling point [K], atomic radius [pm], covalent radius [pm], molar volume [cm3], specific heat [J/(Kg K)], thermal conductivity [W/(m k)], Pauling electronegativity [Pauling scale],first ionisation energy [kJ/mol] and lattice constant [pm]. Only atoms with non null values for each magnitude have been selected in the sample. Notice that some of these properties have not been already discovered or calculated for some atoms that do not appear in the sample. The raw data can be downloaded from this link. The following graphical representation shows how some of these properties are distributed across the spectrum of elements sorted by the ascending number of protons: Graphic 1. Distribution of the melting point, boiling point, lattice constant and the atomic radius versus the atomic number. In this graphic there is not any seeming correlation among the displayed magnitude values and the atomic number. At the first glance there are not correlations nor any pattern between the displayed attributes and the elements upward sorted. Methods The unsupervised machine learning algorithms allow to infer models that identify hidden structure from "untagged" data. Thus no categories are included in the observation and data used to learn can not be used in the accuracy evaluation of the results. Using the machine learning library Java- ML and the non null values for the above specified magnitudes, two exercises were performed: 1 - Clustering of elements The scope of this exercise is to create clusters of atomic elements using three different machine learning techniques provided by the Java-ML library. The result was three atomic configurations based on the following algorithms: K-Means clustering with 10 clusters. This algorithm divides the selected atomic elements into k clusters where each individual is associated to each cluster through the nearest mean calculation. Iterative Multi K-Means implements an extension of K-Means. This algorithm works performing iterations with a different k value, starting from kMin and increasing to kMax, and several iterations for each k. Each clustering result is evaluated with an evaluation score. The result is the cluster with the best score. The applied evaluation in the exercise was the sum of squared errors. K-Means cluster wrapped into Weka algorithms. Classification algorithms from Weka are accessible from within the Java-ML library. An experiment with 3 clusters were calculated just to compare with the first exercise (K-Means with 10 clusters). The results were presented using the TreeMap provided by the d3 - TreeMap graphic library. Graphic 2. Applying K-Means clustering to the sample. 2 - Atomic elements classifications and relationships between themselves. The following exercise was intended to evaluate the degree of relationship among the atoms contained in the sample. Three algorithms were applied: Random Forest with 30 trees to grow and 10 variables randomly sampled as candidates at each split (one for each atomic magnitude). This technique works by constructing a multitude of decision trees at training time and providing the class that is the mode of the classes. Bayesian Classifier. The Naive Bayes classification algorithm has been used to classify the set of elements in different categories. K nearest neighbour (KNN) classification algorithm with KDtree support. The number of neighbours was fixed to 8, considering that this number of potential elements could establish the boundaries for each element positioned in the center of a square (laterals and corners are not managed in the current hypothesis). Graphic 3. Schema of 8 neighbours surrounding the target element Each algorithm worked such as a classifier and they produced a membership distribution with the associated degree evaluation. The classes with a membership evaluation equals to zero were not considered. In this experiment, the physical and chemical attribute values have been clusterized and afterwards, each atom belonging to the same sample, has been classified in the set of the calculated clusters. Therefore, each element is identified with a specific group where the only requirement is that the atom that is being classified must be the representative for the selected category. The calculated clusters have been distributed in pairs of atoms with their corresponding degree evaluation following this structure: [Xi, Yj, Ej] Where Xi is each atom in the sample, Yj is each element in the category Y and Ej the related degree evaluation to the pair. The relationships between the individuals and their categories are shownthrough the chord graphic representation based on the Chord Viz component provided by d3. Graphic 4. Nitrogen relationships considering the evaluation of different classifiers Results The three tree maps (one per clustering algorithm) where the chemical elements have been organised, are showing interesting groups of components. For instance all of them include in the same group the S and the Se. Other atoms (all of them gases) such as the Ne, Ar, Kr and Xe are also enclosed in the same group by all the algorithms (remember that neither the atomic number nor the electronic configuration were included in the models). It is interesting to mention that the configuration generated by the two K-Means algorithms are presenting the H and the Li in a separated and mono-element clusters. Regarding the weighted relationships between the elements, a chord graphic has been created for each machine learning algorithm. This data representation shows how the atomic elements can be related with each other through unsupervised machine learning techniques taking some of their chemical and physical properties and assigning a relational degree to them. There are some interesting behaviours such as the set of relationships found for the Nitrogen. The Random Forest algorithm determined that the O, Ne and Ar are highly related, the Bayesian Classifier calculated that only the Oxygen was related and the results of the K-NN method evaluated that the O, Ne, Cl, Ar, Br, Kr and the I are related when the number of neighbours was fixed to 8. Some familiar associations can be found in the calculated relationships when comparing the components in the clusters and their distribution in the periodic table of elements. Nevertheless, other non evident atomic relations have been set up by these methods. Additionally, the non commutative property is a remarkable characteristic. For instance the Nitrogen is not related in the reverse way with the Hydrogen when they are selected in the results calculated using the Random Forest algorithm. Conclusions Although the calculated atomic organisation through the machine learning algorithms are not following any physic or chemical rule, some associations arise creating groups of components that follow similar configurations like the provided by the periodic table of elements. Beyond the calculated results, the applied library (Java-ML) and the used algorithms, the exercise is interesting by itself. The proof that chemical or physical relationships can be stablished among the elementary components based on the similarity of their properties using machine learning can drive to new lines of research. Acknowledgments I want to thank to Montse Torra her task gathering the physical and chemical properties for each used atom in the sample. References Bostjan Kaluza. "Machine Learning in Java". Packt Publishing Ltd, Apr 29, 2016 Eibe Frank, Mark A. Hall, and Ian H. Witten. "The WEKA Workbench". Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques”. Morgan Kaufmann, Fourth Edition, 2016 Physical and chemical atomic properties extracted from WebElements and PeriodicTable.

Very original application, Toni - thanks for sharing

#### What Is The Difference Between Artificial Intelligence And Machine Learning?

Artificial Intelligence (AI) and Machine Learning (ML) are two very hot buzzwords right now, and often seem to be used interchangeably.They are not quite the same thing, but the perception that they are can sometimes lead to some confusion. So I thought it would be worth writing a piece to explain the difference.https://www.google.co.uk/amp/s/www.forbes.com/sites/bernardmarr/2016/12/06/what-is-the-difference-between-artificial-intelligence-and-machine-learning/amp/

#### Machine Learning Stanford University

About this course: Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you'll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems. Finally, you'll learn about some of Silicon Valley's best practices in innovation as it pertains to machine learning and AI. https://www.coursera.org/learn/machine-learning

#### ANALYTICS How to Spot a Machine Learning Opportunity, Even If You Aren’t a Data Scientist

Artificial intelligence is no longer just a niche subfield of computer science. Tech giants have been using AI for years: Machine learning algorithms power Amazon product recommendations, Google Maps, and the content that Facebook, Instagram, and Twitter display in social media feeds. But William Gibson’s adage applies well to AI adoption: The future is already here, it’s just not evenly distributed.https://hbr.org/2017/10/how-to-spot-a-machine-learning-opportunity-even-if-you-arent-a-data-scientist

#### Developing a Code of Conduct for the Data Science and Analytics Sector

Developing a Code of Conduct for the Data Science and Analytics Sector The Data Science Foundation is reviewing its Code of Conduct and the services it delivers to members. A six-month consultation period will commence January 2018 with both internal and external stakeholders from industry, education and government. In advance of this we are inviting members to participate in a fact-finding exercise that will help to shape the debate. Below this introduction you will see five questions, we would be grateful if you would send your responses by email. Looking back the Foundation has had a tremendous 2017; increasing membership numbers, developing contacts, offering members more online functionality including the development of Personal Profile Pages, Published By Pages, Messaging Facilities and the development of the Discussion Forum. We will be launching a major new initiative with a partner at Big Data LND on 15th November and will follow this up with the launch of the Data Science Writer of the Year Awards 2018. It is now time to look forward, if you are a member of the Data Science Foundation, please participate in the debate to develop a Code of Conduct for the Data Science and Analytics Sector, to help form the services we deliver and to have your say in how the sector is represented. If you are not yet a member but have a professional interest in data and advanced analytics, become a member and join the debate. Membership is free for individuals: https://datascience.foundation/joinus Code of Conduct Initial Questions Please email your answers to debate@datascience.foundation Use the subject line Code of Conduct. Please answer the following questions: Would you support a Code of Conduct for the Data Science and Analytics sector? Should the code focus on professional, ethical or moral standards or all three? What should be included in the Code of Conduct? Please provide a short description What services would you like the Data Science Foundation to provide? Would participate in the debate? The debate will initially be conducted by email questionnaire and then online discussion Background Information About the Data Science Foundation The Data Science Foundation is a professional body representing the interests of people working in the data science and advanced analytics sector. Our membership consists of both users and suppliers of data services as well as universities offering data science courses and their students. The foundation aims to create an active community of data scientists, to provide a platform to share ideas and to support professional development. All members of the foundation are provided with an online Profile page and a Published By page, which showcases all articles and papers published for peer review. Aims The primary aims of the Data Science Foundation are to: Create a community of qualified and highly skilled data science professionals Provide the data science community with a forum to share ideas and support professional development Develop approved professional standards that differentiate members from others working in the data science and advanced analytics sector The Data Science Foundation is working to: Raise the profile of data science in the UK, to educate business people about the benefits of knowledge-based decision making and to encourage firms to make optimal use of their data. Launch an education programme which includes seminars covering topics such as; ‘Helping business people understand big data’ and ‘Helping data scientists communicate with business people’. An ‘Introduction to data science’ talk will be offered to schools. Improve the way organizations use their data; by helping organizations form partnerships with universities and consultancies. The website The foundation’s website has been built as a communications platform, a publishing tool and as a means for members to procure expertise or obtain employment. The site is a source of information for the media and for those interested in learning about data science. We will ensure that the website: Contains extensive and accurate information about big data and data science education Displays accurate records of corporate, supplier, individual and associate members Is the ideal place to find a data science provider and to start a data science project Creates employment for members via the CV board Publishes the most sought-after job opportunities from leading firms Becomes a platform that allows individuals to gain recognition for their expertise and become leading figures within the industry Code of Conduct The Data Science Foundation Code of Conduct applies to all members. Promotion of Good Practices We adhere to the fact that data scientists should follow best practices at all times, and never encourage or suggest a business or client take actions that are or could be construed as criminal or unethical. All members will strive to provide competent services at all times. Integrity and Honesty We believe in the value and importance of integrity and honesty in all interactions and transactions, whether providing informal advice, a formal project plan, or visualised data for information dissemination. Capability and Expertise The Foundation advocates for more formal guidelines in terms of professional capability and expertise, and works with leading education centres and universities to develop degree courses to this end. Transparency The Data Science Foundation advocates for transparency at all levels, both within the Foundation and within partner organisations, educational institutions and government agencies. We believe in being forthright and direct, with no hidden agenda. Confidentiality Confidentiality is an essential consideration, particularly with sensitive business data. We adhere to the strictest confidentiality stipulations to safeguard these vital business and organisation assets. Information created, developed, used or learned in the course of employment with a particular client, business or organisation is considered completely confidential. Security All members must ensure that data is secure at all times, safeguarded from all threats, including but not limited to viruses, malware, internal and external hacking attempts, theft, and accident. All members must utilise industry-standard software/hardware to ensure data security at all times. Professional Standards We require all members of the Foundation to adhere to strict professional standards regarding integrity, honesty, quality, confidentiality and more. We also enforce professional standards in terms of working practice, data quality, and standards of evidence. Misuse or misrepresentation of data is not permissible. The Foundation practices strict enforcement of professional standards, and repercussions can include expulsion from the Foundation. General Membership Policy Our general membership policy applies to all members, including corporate members. Members are granted several key benefits, including access to Foundation publications, the ability to join discussions and forums, to network with others in the data science industry and more. The Data Science Foundation offers different membership options, including corporate membership and associate membership, each of which delivers unique benefits and advantages.

Hey Chris,Interesting piece. I have done some similar work myself time ago, but I have always wondered whether we should simply indicate how to build a code of conduct (and making it collectively afterwards) or rather drafting something that we hope people will adopt in their daily jobsF

#### The new CxO gang: data, AI, and robotics

The new CxO gang: data, AI, and robotics Hiring new figures to lead the data revolution It has been said that this new wave of exponential technologies will threaten a lot of jobs, both blue and white-collar ones. But if from one hand many roles will disappear, from the other hand in the very short-term we are observing new people coming out from the crowd to lead this revolution and set the pace. These are the people who really understand both the technicalities of the problems as well as have a clear view of the business implications of the new technologies and can easily plan how to embed those new capabilities in enterprise contexts. Hence, I am going to briefly present three of them, i.e., the Chief Data Officer (CDO), the Chief Artificial Intelligence Officer (CAIO) and the Chief Robotics Officer (CRO). Sad to be said, I never heard about a ‘Chief of Data Science’, but for some strange reasons, the role is usually called either ‘Head of Data Science’ or ‘Chief Analytics Officer’ (as if data scientist won’t deserve someone at C-level to lead their efforts). Let’s see then who they are and what they would be useful for. The Chief Data Officer (CDO) A slide taken from one of the speakers at the CDO Summit in London illustrating business drivers and capabilities and how they related to the CDO job. Apparently, it is a new role born in a lighter form straight after the financial crisis springing from the need to have a central figure to deal with technology, regulation and reporting. Therefore, the CDO is basically the guy who acts as a liaison between the CTO(tech guy) and the CAO/Head of Data Science (data guy) and takes care of data quality and data management. Actually, its final goal is to guarantee that everyone can get access to the right data in virtually no time. In that sense, a CDO is the guy in charge of ‘democratizing data’ within the company. It is not a static role, and it evolved from simply being a facilitator to being a data governor, with the tasks of defining data management policies and business priorities, shaping not only the data strategy, but also the frameworks, procedures, and tools. In other words, he is a kind of ‘Chief of Data Engineers’ (if we agree on the distinctions between data scientists, who actually deal with modeling, and data engineers, who deal with data preparation and data flow). “The difference between a CIO and CDO (apart from the words data and information…) is best described using the bucket and water analogy. The CIO is responsible for the bucket, ensuring that it is complete without any holes in it, the bucket is the right size with just a little bit of spare room but not too much and its all in a safe place. The CDO is responsible for the liquid you put in the bucket, ensuring that it is the right liquid, the right amount and that’s not contaminated. The CDO is also responsible for what happens to the liquid, and making the clean vital liquid is available for the business to slake its thirst.” (Caroline Carruthers, Chief Data Officer Network Rail, and Peter Jackson, Head of Data Southern Water)” Interestingly enough, the role of the CDO as we described it is both verticaland horizontal. It spans indeed across the entire organization even though the CDO still needs to report to someone else in the organizational chart. Who the CDO reports to will be largely determined by the organization he is operating in. Furthermore, it is also relevant to highlight that a CDO can be found more likely in larger organizations rather than small startups. The latter type is indeed usually set up to be data-driven (with a forward-looking approach) and therefore the CDO function is already embedded in the role who designs the technological infrastructure/data pipeline. It is also true that not every company has a CDO, so how do you decide to eventually get one? Well, simply out of internal necessity, strict incoming regulation, and because all your business intelligence projects are failing because of data issues. If you have any of these problems, you might need someone who pushes the “fail-fast” principle as the data approach to be adopted throughout the entire organization, who considers data as a company asset and wants to set the fundamentals to allow fast trial and error experimentations. And above all, someone who is centrally liable and accountable for anything about data. A CDO is then the end-to-end data workflow responsible and it oversees the entire data value chain Finally, if the CDO will do his job in a proper way, you’ll be able to see two different outcomes: first of all, the board will stop asking for quality data and will have clear in mind what every team is doing. Second, and most important, a good CDO aims to create an organization where a CDO has no reasons to exist. It is counterintuitive, but basically, a CDO will do a great job when the company won’t need a CDO anymore because every line of business will be responsible and liable for their own data. A good CDO aims to create an organization where a CDO has no reasons to exist. In order to reach his final goal, he needs to prove from the beginning that not investing in higher data quality and frictionless data transfer might be a source of inefficiency in business operations, resulting in non-optimized IT operations and making compliance as well as analytics much less effective. The Chief Artificial Intelligence Officer (CAIO) If the CDO is somehow an already consolidated role, the CAIO is nothing more than a mere industry hypothesis (not sure I have seen one yet, although the strong ongoing discussions between AI experts and sector players— see here and here for two opposite views on the topic). Moreover, the creation of this new role highlights the emergence of two different schools of thought of enterprise AI, i.e., centralized vs decentralized AI implementation, and a clear cost-benefit analysis to understand which approach will work better is still missing. My two cents are that elevating AI to be represented at the board level means to really become an AI-driven company and embed AI into every product and process within your organization—and I bet not everyone is ready for that. So, let’s try to sketch at a glance the most common themes to consider when talking about a CAIO: Responsibilities (what he does): a CAIO is someone who should be able to connect the dots and apply AI across data and functional silos (this is Andrew Ng’s view, by the way). If you also want to have a deeper look at what a CAIO job description would look like, check out here the article by Tarun Gangwani; Relevance (should you hire a CAIO?): you only need to do it if you understand that I is no longer a competitive advantage to your business but rather a part of your core product and business processes; Skills (how do you pick the right guy?): first and more important, a CAIO has to be a ‘guiding light’ within the AI community because he will be one of your decisive assets to win the AI talent war. This means that he needs to be highly respected and trusted, which is something that comes only with a strong understanding of foundational technologies and data infrastructure. Finally, being a cross-function activity, he needs to have the right balance between willingness to risk and experiment to foster innovation and attention to product and company needs (he needs to support different lines of business); Risks (is a smart move hiring a CAIO?): there are two main risks, which are i) the misalignment between technology and business focus (you tend to put more attention on technology rather than business needs), and ii) every problem will be tackled with AI tools, which might not be that efficient (this type of guys are super trained and will be highly paid, so it is natural they will try to apply AI to everything). Where do I stand on that? Well, my view is that a CAIO is something which makes sense, even though only temporarily. It is an essential position to allow a smooth transition for companies who strive for becoming AI-driven firms, but I don’t see the role to be any different from what a smart tech CEO of the future should do (of course, supported by the right lower management team). However, for the next decade having a centralized function with the task of using AI to support the business lines (50% of the time) and foster innovation internally (50% of the time) it sounds extremely appealing to me. In spite of all the predictions I can make, the reality is that the relevance of a CAIO will be determined by how we will end up approaching AI, i.e., whether it will be eventually considered a mere instrument(AI-as-a-tool) or rather a proper business unit (AI-as-a-function) The Chief Robotics Officer (CRO) We moved from the CDO role, which has been around for a few years now, to the CAIO one, which is close to being embedded in organizational charts. But the Chief Robotics Officer is a completely different story Even if someone is speaking about the importance of it (check out this report if you like), it is really not clear what his tasks would be and what kind of benefits would bring to a company, and envisaging this role requires a huge leap of imagination and optimism about the future of work (and business). In few words, what a CRO will be supposed to take care of is managing the automated workforce of the company. To use Gartner’s words, ‘he will oversee the blending of human and robotic workers’. He will be responsible of the overall automatization of workflows and to integrate them smoothly into the normal design process and daily activities. I am not sure I get the importance of this holistic approach to enterprise automation, although I recognize the relevance of having a central figure who will actively keep track and communicate to employees all the changes made in transforming a manual activity/process into an automated one. Another interesting point is who the CRO will report to, which is of course shaped by his real functions and goals. If robotics is deeply routed into the company and allows to create or access new markets, a CRO might directly report to the CEO. If his goal is instead to automatize internal processes to achieve a higher efficiency, he will likely report to the COO or to a strategic CxO (varying on industry and vertical). My hypothesis is that this is going to be a strategic role (and not a technical one, as you might infer from the name) which, as the CAIO, might have a positive impact in the short term (especially in managing the costs of adopting early robotics technologies) but no reason to exist in the longer term. It is easier to think about it in physical product industries rather than digital products or services companies, but automation will likely happen in a faster way in the latter, so we will end up having a Chief of Physical Robotics Officer (to manage the supply chain workflow) as well as a Chief of Digital Robotics Officer (to manage instead the automation of processes and activities).

Thanks Chris - a bit speculative in some points but I think useful to at least start a conversation

#### Data Science Foundation will be at Big Data LND

The Data Science Foundation and Big Data LDN 15-16 November 2017 – Stand 327 Olympia London Meet us at stand 327 Big Data LDN on 15-16th November 2017! Find out more about the work of the Data Science Foundation and see how becoming a member would help you make more Data Science Connections. We are launching the Data Science Writer of the Year Awards 2018. The awards recognise the contribution made by individuals who create and share data science knowledge and understanding. All members of the Data Science Foundation are eligible to participate in the awards. Individual membership is free of charge. Big Data LDN is a free to attend conference and exhibition open to all, and will host leading global data and analytics experts, ready to arm you with the tools you need to deliver the most effective data-driven strategy. With content divided into comprehensive sections, you’ll have the opportunity to ask the big questions, share ideas with forward-thinking, likeminded peers, and learn from leading members of the Data community. Big Data LDN is back for a second year and is set to be larger than ever in 2017. The two-day event is essential for those with businesses wanting to deliver a data-driven strategy. Get the latest updates on fast/real-time data, artificial intelligence, machine learning, GDPR, deep learning, self-service analytics and much more. The event will host leading, global data and analytics experts, ready to arm you with the tools to deliver your most effective data-driven strategy. Discuss the big questions and share ideas with forward-thinking peers and leading members of the data community. Be in the vanguard of the data revolution, sign up to Big Data LDN and learn how to build a bright data-driven future for your business. Register free here and visit our stand at the event https://bigdataldn.com

The Data Science Foundation will be on stand 327 at Big Data LDN. It would be great to meet.