Datascience Foundation in Big Data London November 2017
It was great to meet a lot of prospective members. Tge curiousity factor was very high. Some Data Science called DSF as LinkedIn of Datascience. This is a good way to look at it.
Qlik. More power than you can imagine.
Search and explore vast amounts of data – all your data. With Qlik, you’re not constrained by preconceived notions of how data should be related, but can finally understand how it truly is related. Analyse, reveal, collaborate and act. https://www.qlik.com/en-gb
Top Trend in Analytics
The pace and evolution of business intelligence solutions mean what’s working now may need refining tomorrow. From natural language processing to the rise in data insurance, we interviewed customers and Tableau staff to identify the 10 impactful trends you will be talking about in 2018. Whether you’re a data rockstar or an IT hero or an executive building your BI empire, these trends emphasize strategic priorities that could help take your organization to the next level.Read more at https://www.tableau.com/reports/business-intelligence-trends#QZ5lKOrsPRsTiKj2.99https://www.tableau.com/reports/business-intelligence-trends?domain=yahoo.co.uk&eid=CTBLS000011076203&elqCampaignId=28134&elqTrackId=df07a602fd9948c0944bf2daa142366d&elqaid=26586&elqat=1&utm_campaign=Whitepaper%20-%20BI%20Trends%20-%20Prospect%20-%20EMEA%20en-GB%20-%202017-11-16&utm_medium=Email&utm_source=Eloqua&domain=yahoo.co.uk&eid=CTBLS000011076203&elqTrackId=df07a602fd9948c0944bf2daa142366d&elq=d48b17f551624380a22bdc07f4d95a6c&elqaid=26586&elqat=1&elqCampaignId=28134
An Introduction to Data Science: Making Big Data Usable
Our world is increasingly fuelled by data – an unparalleled amount of data, to be precise. More information is available today than at any other point in human history, and all of that data has value, particularly to businesses, organizations and government agencies. Data makes the world go round today. However, getting at that data and actually putting it to use can be difficult. Moreover, not all information has value, depending on the needs of the user. A means to identify, catalogue, collate and extract useful data from the sea of other information was necessary – data science. What Is Data Science? Depending on where you look, you’ll find a number of different definitions for data science. Some believe that it’s merely an offshoot of statistics. Others believe that it’s a combination of a few different knowledge extraction models. Yet others believe that it’s something else entirely. It’s important to note that “data science” isn’t true science in the technical definition of the term. Data scientists aren’t trying to prove or disprove a hypothesis. So, what is data science, then? It’s actually difficult to create a single, cohesive definition for the term simply because there are so many potential applications of this discipline, from statistics to data modelling to analytics and more. Perhaps the closest we have come to a unified definition comes from Quora user Drew Conway, a PhD student at NYU, who said, “Data science most often refers to the tools and methods used to analyse large amounts of data. As such, the discipline is an amalgamation of many bits from other areas of research. For tools, the influence primarily comes from computer science, where issues of algorithmic efficiency and storage scalability form the main focus. For analysis, however, the influences are much more varied. Modern methods are borrowed from both the so-called hard sciences (physics, statistics, graph theory) and the social sciences (economics, sociology, political sciences, etc.). Specific classes of techniques that are naturally interdisciplinary are also very popular, such as machine learning.” Michael Driscoll, a specialist in data, analytics and visualization, has another definition. “Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools and materials, coupled with a theoretical understanding of what’s possible.” The University of California at Berkley also sums up data science rather well by saying, “The field of data science is emerging at the intersection of the fields of social science and statistics, information and computer science, and design.” However, data science is not just “using data.”. That’s the end goal, yes, but the discipline focuses on how to first organize and access available data, and then to put it to use effectively. That last word is the key here – “effectively”. Data science enables the effective use of vast quantities of information for other purposes, while also enabling the creation of additional data products (which are themselves data, further feeding the cycle). In the end, data science truly isn’t “science”. Data scientists don’t work in the world of academics. Their employment is solely contingent on the needs of businesses and industries, just as much as a sales clerk’s position or a particular supplier’s line of products. So, what do data scientists actually do, though? Here’s a rough outline: Ask questions in order to answer or solve known problems or find new solutions to problems with which businesses must deal Define data necessary for a particular need Work with existing data to collect, store and explore data Determine the needed analysis type for a particular situation or type of data Use algorithms and other tools to parse, clean, check quality and utilize data Transform insights learned into formats usable by non-data scientists (for use by others within the business), including the creation of graphs, charts, infographics and more Create software to automate data science tasks based on specific business requirements and data sources So, data scientists must successfully combine several professional fields, including statistics, software programming, mathematics, research and subject expertise. A Few Basic Examples of Use Cases Before we dive too deep down the rabbit hole, let’s consider a few basic real world examples of data science in action. Microsoft Word: Sure, MS Word isn’t the most advanced piece of software on the planet, but the word processing program does show us some very good applications of data science. Consider the fact that the program essentially learns more about its users through interaction – it stores and parses data on its own. Microsoft has also done a great deal of work in building the program’s spellchecking and grammar capabilities. Does it match what you’d get from a professional editor? No. However, it does an excellent job for a computer program (that’s not to say there aren’t better programs available, but Word is the industry standard, and Microsoft’s achievements in data usage are noteworthy). Google PageRank: Google might be the world’s uncontested master when it comes to data everything. However, the search giant’s PageRank function is an ideal example of data science in action. This function essentially uses the number of links pointing at a particular domain (data outside the page itself) to help rank the authority and relevance of different websites. Facebook: Sure, the big blue social network has access to tons of information, but they put it to use in a number of innovative ways. For instance, Facebook uses the vast amounts of information it has about various users to help make suggestions for new friends and connections. These suggestions are based on patterns of friendships, including among people that you might not even be connected with on the site, and they can be extremely accurate (sometimes frighteningly so). These are just a few relatively basic examples of data science in action in the world around us, with programs and websites used every day. Why Is Data Science Necessary? Information is vital to every aspect of human existence, and it has been since the dawn of our species. A hunter-gatherer had to rely on his or her knowledge (information) to make decisions about various plant species (would it be deadly to eat?) as well as the animals hunted for food and pelts. Early agriculturalists had to rely on information to make informed decisions about planting, harvesting and preparation. Putting data to use has been with us since time out of mind – that hasn’t changed. What has changed, though, is the volume of data available. O’Reilly makes an excellent point here. “The question facing every company today, every start-up, every non-profit, every project site that wants to attract a community, is how to use data effectively – not just their own data, but all the data that’s available and relevant. Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well defined kinds of analytics. What differentiates data science from statistics is that data science is a holistic approach. We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into tractable form, making it tell its story, and presenting that story to others.” Given the fact that humans have been parsing, storing, organizing and using data for millennia, it’s natural to wonder why a new discipline is even necessary. Isn’t statistics enough? Isn’t the use of current database software sufficient? To put it simply – no. According to a story in VentureBeat, “Today’s modern business needs to manage far more data than ever before, and few have the talent on staff for the job. Projections indicate that the (data scientist) market will experience meteoric growth in the next several years.” GigaOM backs that up. “Every organization will need someone wearing the data scientist hat just like every organization has people responsible for product, sales, marketing and support.” The rise of Big Data and its increasing importance to businesses or all sizes, in every industry, and even the government sector, has made data science not only vital, but one of the fastest growing fields in the world. What Is Big Data, Really? You hear the term “Big Data” thrown around a lot today, but what does it mean, really? Organizations have had access to massive amounts of information for a very long time. Oil companies are prime examples of this, but there are many others, including massive retail chains like Wal-Mart. What makes today different in terms of the amount of information available? What makes it warrant a specific designation of “Big Data”? According to Oxford University Press’ Oxford Dictionary, Big Data is, “data sets that are too large and complex to manipulate or interrogate with standard methods or tools.” That tells us a little bit, but it’s not the full story. Forbes magazine weighs in with a slightly different take on the question. In a story written by Lisa Arthur for Forbes, the author defines BD as, “a collection of data from traditional and digital sources inside and outside your company that represents a source for ongoing discovery and analysis.” That’s a bit more illuminating. Arthur goes on to state that, “In defining big data, it’s also important to understand the mix of unstructured and multi-structured data that comprises the volume of information. Unstructured data comes from information that is not organized or easily interpreted by traditional databases or data models, and typically, it’s very text-heavy. Multi-structured data refers to a variety of data formats and types and can be derived from interactions between people and machines, such as web applications or social networks. A great example is web log data, which includes a combination of text and visual images along with structured data like form or transactional information.” So, unstructured information could be information that’s entered into a form field (either online or off). Multi-structured data could be derived from scraping a website. There are several conditions that go into making Big Data; that we will now describe as the Five Vs of Big Data: Volume (the sheer amount) Velocity (the speed information is generated and aggregated) Variety (types of data) Veracity (authenticity of data) Value (what the data is actually worth to the company) However, it’s very important to understand that the meaning of Big Data will vary from business to business and organization to organization. In each instance, this information will do something different, offer something different, and mean something different. The Explosion of Data Available Today As mentioned, there’s more information available today than at any point in history, and that amount is growing exponentially. In a sense, it feeds itself. As more data is explored, collated, categorized and packaged, more information is created. To truly understand not only what data science is but why it’s become such an essential skillset and discipline to the modern world, it’s important to understand the lifecycle of data – where it comes from, how that information is used, and more. We’ll begin by exploring where data originates. It comes from everywhere. Every single search query through Google or Bing is data. Every picture uploaded is data. Every Vine video created is data. You get the idea. However, it’s not limited to data made available online. Data is literally everywhere. O’Reilly states that, “Data is everywhere: your government, your web server, your business partners, even your body. While we aren’t drowning in a sea of data, we’re finding that almost everything can (or has) been instrumented.” For a good example of how much data is being generated today, consider a simple Amazon recommendation, called an “also bought”. Let’s say you go to Amazon searching for a new washing machine. You’re looking for the right deal, the right features, and the right size. You can sort your options, compare different models and more. You arrive at what seems to be the perfect solution and there near the bottom of the page is a list of items that other customers “also bought” when viewing or purchasing the same model of washing machine. Amazon had to first identify that information, then collate and store it, and then serve it up to you based on nothing more than your online search query through their website. You leave data behind you with every single action you take online. Using Google to search for a dog training program? That query doesn’t disappear when you close the browser. Google stores your information and uses it. And it’s more than just your query terms. Google is also storing geo-location information and a great deal more. Now apply that to mobile apps, which leave an even richer trail of data behind when you’re done (consisting of both electronic data as well as possibly audio and video information, exact geo-location data, and a great deal more). To go beyond the online and device scenario, consider your frequent shopper card. You use it to get access to important discounts on your groceries or fuel, but every single swipe generates an immense amount of information about your shopping habits, your preferences, the specific retail store locations where you prefer to spend your time… You get the picture. Everything we do today generates data. However, all that information would be worthless if there wasn’t a way to store it. This is the very beginnings of data science. Storage for data must be more than just a data dump into a database somewhere online, though. Our storage solutions have become ever more sophisticated over time. We’ve moved from paper ledgers to electronic spreadsheets to incredibly intricate digital storage systems capable of holding immense amounts of information. Storage is not the end of it, though. Moore’s Law (concerning the advance of computer technology) can be applied equally well to the growth of data. As stated by O’Reilly, “The importance of Moore’s Law as applied to data isn’t just geek pyrotechnics. Data expands to fill the space you have to store it. The more storage is available, the more data you will find to put in it.” It’s the same concept behind our desire for larger homes. We want larger homes to have more space, but that space inevitably becomes filled with possessions – we buy new furniture, new dishes, new computers, new televisions. This brings us to the next hurdle – the more data you store, the more sophisticated your data analysis solutions must be. Things were simple enough when businesses ran on physical ledgers and the owner’s knowledge of customers, but today things are much more complicated. This is the very foundation of data science – increasing the ability to analyse and use the increasing volume of data available to us. Making Data Useful Data without meaning is useless. Data without context is pointless. Data without structure is unusable. Today’s organizations must do more than just warehouse information, and accessing, analysing, and using that information requires more than just basic software. In order to make data useful, it must first be analysed, or “conditioned”. Really, this is nothing more than separating the wheat from the chaff, so to speak. You need to analyse the data and then determine what is useful and what is not. For a company selling athletic shoes, your purchase of gardening tools is probably irrelevant. However, your purchase of a gym membership would be valuable information. Your search on Google for running tracks near your home would also be important data. The problem here is that while a great many new ways of delivering machine-consumable data have been created (atom feeds and microformats, for instance), much of the data found in the wild is very messy, very chaotic. This type of information cannot simply be inserted into an XML file. There’s simply too much garbage inserted into the data to make it usable by software. So, data conditioning also includes clean-up – for instance, removing the HTML code from data scraped from a website so that only the pertinent information remains, and none of the underlying code of the page. Once the data has been conditioned and cleaned, it’s time to move to the next step, which is basically quality control. For instance, let’s say you were interested in gathering physical mailing addresses for potential customers who may be interested in your new product. You could gather much of that information online, but it may not be complete. You might have the first and last name, as well as the street address of a particular person, but be missing their postcode code (this is just a very basic example). You may also have incongruous information – that is, data that doesn’t seem to relate to your goal. According to scientists, the depletion of the ozone layer was discovered because someone decided to take a look at the incongruous data that was gathered, rather than discarding it (deciding whether data is actually incongruous through gathering errors, or if there’s another story underlying that incongruity is part of the job of a data scientist). Now you have to add in the problem of human language. This can be considerable, even when you’re dealing with just one language. For instance, parsing data in English requires a significant understanding of elements that relate directly to the particular task. O’Reilly uses the following example, which is an excellent illustration of the complexities involved with quality assurance in data science. Roger Magoulas, head of O’Reilly’s data analysis group, recently had to search for Apple job listings that required candidates to have geolocation skills. The problem was that not only did Magoulas need an understanding of job listing formats, but also the ability to parse English and to separate Apple-specific listings from the wide range of other job postings out there (as well as other information, related to geolocation, but not pertinent to Apple or employment with the company). This is a perfect example of the difficulty in parsing and quality checking data, and it extends well beyond the search for employment with Apple. Consider running a query for information on Python, the programming language. Google returns an immense number of hits for the snake, instead, leaving you to sort out the right answers for yourself. And this is just for English – add in the hundreds of other languages spoken around the world, and you begin to get a sense of just how daunting and difficult it can be. Software is helpful here, but it’s not always the best solution. Often, data scientists are required to lean on their own understanding (human intelligence versus machine intelligence). Of course, this brings in the question of additional manpower and costs. A single data scientist simply doesn’t have the ability to sort through 10,000 potential listings to determine relevance, not in any realistic way. That means hiring help (which admittedly can be done relatively easily through sources like Amazon’s Mechanical Turk program or through sites like Fiverr.com or the like). Using the Data Once the data has been gathered (or harvested, if you prefer) and gone through the cleaning process and been analysed, it must be put to use. This is done in many different ways, and will depend largely on the business in question – their needs and initiatives will inform the ways that data is used. The first step here is visualization, which can be done in any number of ways, including Venn diagrams, charts, graphs, tables and more. Again, the visualization format will need to fit the business, as well as the initiative for which the data is being analysed. The visualization format for a field map of a company’s competition would look very different from the format for data depicting potential customers within a 10-mile radius of a company’s brick and mortar shop, but both are examples of what’s possible with data science. With this being said, perhaps the most common visualization format is a graph. Analysis generally leads to the production of information in numeric format. Obviously, that’s understandable by machines, but it’s not so much use for human beings. Those numbers need to be plotted out in a visual format, giving meaning to the information highlighted. For example, declining sales numbers might be interesting in a purely numeric format, but it becomes much more compelling when given life as a graphic, depicting the rapid decline of a company’s profits. In fact, visualization is so important to the data scientist that many employ it at each stage of the process. For instance, one might use scatter plots to get a sense of what’s interesting in the information gleaned before beginning an analysis. Another might plot their data to get a sense of how skewed it is, or how much false information is included before conditioning. Even animation can be include here to get a sense of how information (and corresponding real world trends) change over time. Some of the programs used to create visualizations include the following: R Processing Many Eyes (IBM) GnuPlot After data visualization comes data implementation – actually putting the information gleaned to use. This is generally not the role of the data scientist. After creating a visualization format for the information, it is handed off to others within the organization and the data scientist moves to the next project. Other teams or professionals will take the information and use it to move the company forward towards its objective. In Conclusion: The Future for Data Scientists What do the rise of Big Data and the increasing use of data science mean for the world at large? It has a number of impacts, but the most salient is this: every business will need someone to fill the position of data scientist, even if it doesn’t go by that name. Every business, every corporation, every organization and even government agencies must have qualified data scientists capable of transforming raw information into something that can be used to move the organization forward in multiple ways. According to GlassDoor.com, data science jobs have reached “critical mass” in many ways. It is currently the 15th highest paying job in demand, with almost 3,500 openings and an average salary base of over $100,000 annually (in the US – other nations vary). It is currently ranked at the 9th best job in the US, as well. This is not about to change, either. A study by McKinsey Global Institute stated that, “a shortage of the analytical and managerial talent necessary to make the most of Big Data is a significant and pressing challenge.” It goes on to estimate that up to 5 million jobs in the United States alone will require skills in data science by the year 2018. These include positions for data analysts, data engineers, statisticians and data scientists. Data science professionals have unique abilities and skills, as well. They combine the spirit of entrepreneurship with patience, mathematical skills with the ability to explore and make connections, computer science skills with an understanding of human behaviour. UC Berkley states, “Virtually every sector of the economy now has access to more data than would have been imaginable even a decade ago. Businesses today are accumulating new data at a rate that exceeds their capacity to extract value from it. The question facing every organization that wants to attract a community is how to use data effectively – not just their own data, but all the data that’s available and relevant.” Even the New York Times weighed in on the burgeoning field of data science, saying, “This hot new field promises to revolutionize industries from business to government, health care to academia.” Obviously, Big Data is here to stay, and the need for professionals capable of transforming that raw information into something that can be used to further organization goals, create community, foster customer engagement and more is immense. Source: https://beta.oreilly.com/ideas/what-is-data-science http://datascience.berkeley.edu/about/what-is-data-science/ http://www.revelytix.com/?q=content/what-data-science-0 http://www.quora.com/What-is-data-science http://venturebeat.com/2013/11/11/data-scientists-needed/ https://gigaom.com/2013/01/06/why-data-scientists-matter-data-science-is-the-future-of-everything/ http://www.forbes.com/sites/lisaarthur/2013/08/15/what-is-big-data/
Good piece Chris and well said, there is a huge difference between big (useless) data and usable (even smaller) ones.If you have seen the recent Data Science survey from Kaggle this is indeed to greatest problem for any data scientist, i.e., data cleansing and preparation to make them usable
Data Mining: Models and Methods
What is Data Mining? Data mining refers to discovery and extraction of patterns and knowledge from large data sets of structured and unstructured data. Data mining techniques have been around for many decades, however, recent advances in ML (Machine Learning), computer performance, and numerical computation have made data mining methods easier to implement on the large data sets and in business-centric tasks. Growing popularity of data mining in business analytics and marketing is also due to the proliferation of Big Data and Cloud Computing. Large distributed databases and methods for parallel processing of data such as MapReduce, make huge volumes of data manageable and useful for companies and academia. Similarly, the cost of storing and managing data is reduced by cloud service providers (CSPs) who offer a pay-as-you-go model to access virtualised servers, storage capacities (disc drives), GPUs (Graphic Processing Unit), and distributed databases. As a result, companies can store, process, and analyze more data getting better business insights. By themselves, state-of-the-art data mining methods are powerful in many classes of tasks. Some of them are anomaly detection, clustering, classification, association rule learning, regression, and summarization. Each of these tasks plays a crucial role in a whatever setting one might think of. For example, anomaly detection techniques help companies protect against network intrusion and data breach. In their turn, regression models are powerful in the prediction of business trends, revenues, and expenses. Clustering techniques have the highest utility in grouping huge volumes of data into cohesive entities that tell about patterns, dependencies both within and among them without the prior knowledge of any laws that govern observations. As these examples illustrate, data mining has the power to put data into the service of businesses and entire communities. Data Mining Models There exist numerous ways to organize and analyze data. Which approach to select depends much on our purpose (e.g. prediction, the inference of relationships) and the form of data (structured vs. unstructured). We can end up with a particular configuration of data which might be good for one task, but not so good for another. Thus, to make data usable one should be aware of theoretical models and approaches used in data mining and realize possible trade-offs and pitfalls in each of them. Parametric and Non-Parametric Models One way of looking at a data mining model is to determine whether it has parameters or not. In terms of parameters, we have a choice between parametric and non-parametric models. In the first type of models, we select a function that, in our view, is the best fit to the training data. For instance, we may choose a linear function of a form F (X) = q0 + q1 x1 + q2 x2 + . . . + qp xp, in which x’s are features of the input data (e.g house size, floor, a number of rooms) and q’s are unknown parameters of the model. These parameters may be thought of as weights that determine a contribution of different features (e.g house size, floor, number of rooms) to the value of the function Y (e.g house price). The task of a parametric model is then to find parameters Q using some statistical methods, such as linear regression or logistic regression. The main advantage of parametric models is that they contain intuition about relationships among features in our data. This makes parametric models an excellent heuristic, inference, and prediction tool. At the same time, however, parametric models have several pitfalls. If the function we have selected is too simple, it may fail to properly explain patterns in the complex data. This problem, known as underfitting, is frequent in linear functions used with the non-linear data. On the other hand, if our function is too complex (e.g with polynomials), it may end up in the overfitting, a scenario in which our model responds to the noise in data rather than actual patterns and is not generalizable to new examples. Figure #1 Examples of normal, underfit, and overfit models. Non-parametric models are free from these issues because they make no assumptions about the underlying form of the function. Therefore, non-parametric models are good in dealing with unstructured data. On the other hand, since non-parametric models do not reduce the problem to the estimation of a small number of parameters, they require very large datasets in order to obtain a precise estimate of the function. Restrictive vs. Flexible Methods Data mining and ML models may also differ in terms of flexibility. Generally speaking, parametric models, such as linear regression, are considered to be highly restrictive, because they need structured data and actual responses (Y) to work. This very feature, however, makes them suitable for inference – finding relationships between features (e.g how the crime rate in the neighborhood affects house prices). Because of this, restrictive models are interpretable and clear. This observation, though, is not true for flexible models (e.g non-parametric models). Because flexible models make no assumptions about a form of the function that controls observation, they are less interpretable. In many settings, however, the lack of interpretability is not a concern. For example, when our only interest is a prediction of stock prices, we should not care about the interpretability of the model at all. Supervised vs. Unsupervised Learning Nowadays, we hear a lot about supervised and unsupervised Machine Learning. New neural networks based on these concepts are making progress in image and speech recognition, or autonomous driving on a daily basis. A natural question, though, what is the difference between unsupervised and supervised learning approaches? The main difference is in a form of data used and techniques to analyze it. In a supervised learning setting, we use a labeled data that consists of features/variables and dependent variables (Y or response). This data is then fed to the learning algorithm that searches for patterns, and a function that controls relationships between independent and dependent variables. The retrieved function may be then applied for the prediction of future observations. In the unsupervised learning, we also observe a vector of features (e.g. house size, floor). The difference with supervised learning, though, we don’t have any associated results (Y). In this case, we cannot apply a linear regression model since there are no response values to predict. Thus, in an unsupervised setting, we are working blind in some sense. Data Mining Methods In this section, we are going to describe technical details of several data mining methods. Our choice fell on linear regression, classification, and clustering methods. These methods are one of the most popular in data mining because they solve a wide variety of tasks, including inference and prediction. Also, these methods perfectly illustrate key features of data mining models described above. For example, linear regression and classification (logistic regression) are examples of parametric, supervised, and restrictive methods, whereas clustering (k-means) belongs to a subset of non-parametric unsupervised methods. Linear Regression for Machine Learning Linear Regression is a method of finding a linear function that reasonably approximates the relationship between data points and dependent variable. In other words, it finds an optimized function to represent and explain data. Contemporary advances in processing power and computation methods allow using linear regression in combination with ML algorithms to produce quick and efficient function optimization. In this section, we will describe an implementation of the linear regression with gradient descent to produce algorithmic fitting of data to linear function. Image #1 Linear regression For this task, let’ s take a case of a house price prediction. Let’s assume we have a training set of 100 house examples (m=100). Each house in this sample may be defined as x1 x2, x3 … xm. Correspondingly, each house has a set of features or properties, such as house size and floor. Features may be thought of as variables that determine a house price. So, for example, the variable would refer to the size of the first house in the training sample. Finally, our training sample has a list of prices for each house denoted as y1, y2, ...ym.. This data tell much by itself (e.g we may apply some methods of descriptive statistics to interpret it), however, in order to run a linear regression, we should first formulate the initial hypothesis. Our hypothesis may be defined as a simple linear function with three parameters (Q). hQ(x) = Q0 + Q1x1+ Q2x2 Image#2 Gradient Descent where x1 and x2 are the features (house size and floor) and Q parameters of the function we want to predict with the linear regression. In essence, this hypothesis says that a house price is determined by house size and floor parametrized by certain parameters Q. Thus, we have a confirmation that linear regression is a parametric model, in which we try to fit right parameters to find a configuration that best explains the data. However, what method should we apply to determine the right parameters? Intuitively, our task is to fit parameters that ensure our hypothesis h(x) for each house is close to y, which is a real-world price. For that purpose, we have to define a cost function that evaluates the difference between a predicted value and an actual value. Equation#1 Cost Function for Linear Regression The right-most part of this equation is a version of a popular least-squares method that calculates a squared difference between the training value y and the value predicted by our hypothesis function hq(x). Then, our task is to minimize the cost function so that the predicted error is small as possible. One of the most popular solutions to this problem is the gradient descent algorithm based on the mathematic properties of the gradient. Gradient is a vector- valued function that points in the direction of the greatest rate of increase of our function. In the case of a multi-variate function, its gradient is the vector whose components are partial derivatives of f. Since gradient is a vector that points in the direction of the function’s growth, it may be used to find parameters that minimize our function. To achieve this, we simply need to move in the backward direction. The technique of the gradual movement down the function to find a local or global minimum is known as a gradient descent and demonstrated in the image above. To implement gradient descent for our linear regression, we need to start with random parameters Q, and then repeatedly update them in the direction opposite to the gradient vector until convergence with the global minimum (our linear function guarantees that such global minimum actually exists). The gradient descent procedure is defined by the update algorithm Equation#2 Gradient Descent Algorithm where a is the learning rate at which we set our algorithm to learn. The learning rate should not be too large, so that we don’t jump over the global minimum, and should not be too small because then the process would take much time. A partial derivative in the right-most part of the equation is calculated for each parameter to construct a gradient vector. It is derived in the following way: Equation#3 Partial Derivative of Gradient Descent Putting partial derivatives and learning rate together produces the final update rule Equation #4 Gradient Descent Update Rule that should be repeated until our algorithm finds the global minimum. That would be the point at which our parameters (Q) produce the function that best explains the training data. This learned function now may be used to predict house prices for houses not included in the training sample and employed in the inference of various relationships between features of our model. This functionality makes the linear regression with gradient descent a powerful technique both in data mining and machine learning. Classification with Logistic Regression Classification is a process of determining a class/category to which the object belongs. Classification techniques implemented via machine learning algorithms have numerous applications ranging from email spam filtering to medical diagnostics and recommender systems. Similarly to linear regression, in a classification problem, we work with a labeled training set that includes some features. However, observations in the data set map not to the quantitative value as in linear regression, but to a categorical value (e.g class). For example, patients’ medical records may determine two classes of patients: those with benign and malignant cancers. The task of the classification algorithm is then to learn a function that best predicts what type of cancer (malignant vs. benign) a patient has. If there are only two classes, the problem is known as a binary classification. In contrast, multi-class classification may be used when we have more classes of data. One of the most common classification techniques in data science and ML is the logistic regression. Logistic regression is based on the sigmoid function that has an interesting property: it maps any real number to the (0,1) interval. As a result, it may be effectively used to evaluate the probability (between 0 and 1) than an observation falls within a certain category. For example, if we define benign cancer as 0 and malignant cancer as 1, a logistic value of .6 would mean that there is a 60% chance that a patient’s cancer is malignant. These properties make sigmoid function useful for binary classification, but multi-class classification is also possible. Image#3 Sigmoid Function The formula for the sigmoid function is: Equation#5 Sigmoid Function where e is the constant e (2,71828) – a base of the natural logarithm with a property that its natural logarithm is equal to 1. To build a working classification model, we should put our hypothesis into the sigmoid function. Remember that our hypothesis function has a form of hQ(x) = Q0 + Q1x1+ Q2x2 For convenience, it may be written in the vector form z = Qt x, where superscript t refers to the matrix transpose of the vector parameters Q. As a result of this transformation, we get the following function Equation#6 Logistic regression model where z refers to the vector representation of our initial hypothesis hq(x). In order to fit parameters Q to our logistic regression model, we should first redefine it in probabilistic terms. This is also needed to leverage the power of sigmoid function as a classifier. Equation# 7 Probabilistic Interpretation of Classification The above definition stems from the basic rule that probability always adds up to 1. So, for example, if the probability of the malignant cancer is 0.7, the probability of benign cancer is automatically 1- 0.7 = 0.3. The equation above formalizes this obvious observation. Now, as we defined the hypothesis and probabilistic assumptions, it’s time to construct a cost function in the same way we did for linear regression. For that purpose, we need to transform our original sigmoid function, because it’s a complex non-convex function that can produce many local minima. If used with the least-squares method similar to the linear regression model below, a sigmoid function will have a hard time to converge. Equation #8 Linear Regression Cost Function Instead, to represent the intuition behind sigmoid function, we may use log-probability applied to the above-mentioned probabilistic definition of the classification problem. Equation#9 Logistic Regression Cost Function The graph below illustrates that log function assigns a high cost if our hypothesis is wrong, and no cost if the hypothesis is right. If y=1 and hq(x) = 1, then the cost ? 0. In contrast, if y=1 and hqx is 0, then the cost goes to infinity. The opposite happens if y=0. Image #4 - log(x) Function We can make a simplified version of the cost function by merging these two cases together. The final cost function ready for use with logistic regression has the following form: Equation #10 Logistic Regression Cost Function (Simplified) Now, when the cost function is formulated, we can easily use the gradient descent identical to the one applied in linear regression. Equation # 11 Logistic Regression Update Rule A similar technique may be applied to the multi-class classification problem with more than two classes. In general, multi-class problems use a one-vs-all approach in which we choose one class and then lump all others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction. Image #5 One-vs-all or Multiclass Classification Clustering Methods As we have seen, clustering is an unsupervised method that is useful when the data is not labeled or if there are no response values (y). Clustering observations of a data set involves partitioning them into distinct groups so that observations within each group are quite similar to each other, while observations in the different groups have less in common. To illustrate this method, let’s take an example from marketing. Assume that we have a big volume of data about consumers. This data may involve median household income, occupation, distance from the nearest urban area, and so forth. This information may be then used for market segmentation. Our task is to identify various groups of customers without the prior knowledge of commonalities that may exist among them. Such segmentation may be then used for tailoring marketing campaigns that target specific clusters of consumers. There are many different clustering techniques to do this, but the most popular are k-mean clustering algorithm and hierarchical clustering. In this section, we are going to describe k-means method, a very efficient algorithm that covers a wide range of use cases. In the k-means clustering, we want to partition observations into a pre-specified number of clusters. Although setting a number of clusters before clustering is considered to be a limitation of the k-means algorithm, it is still a very powerful technique. In our clustering problem, we are given a training set of xi,...,x(m) consumers with individual features xj.. Features are vectors of variables that describe various properties of consumers, such as median income, age, gender, and so forth. The rule is that each observation (consumer) should belong to exactly one cluster and no observations should belong to more than one cluster. Image #6 Illustration of the Clustering Process The idea behind the k-means clustering is that a good cluster is the one for which the within-cluster variation or within-cluster sum of squares (WCSS) (variance) is minimal. In other words, consumers in the same cluster should have more in common with each other than with consumers from other clusters. To achieve this configuration, our task is to algorithmically minimize WCSS for all pre-specified clusters. This task is implemented in the following equation: Equation #12 Within-cluster Sum of Squares (WCSS) where |Ck | denotes the number of observations in the kth cluster In words, the equation above says that we want to partition the observations into K clusters so that the total within-cluster variation, summed over all K clusters, is as small as possible. The within- cluster variation for the kth cluster is the sum of all of the pairwise squared Euclidean distances between the observations in this cluster, divided by the total number of observations in the kth cluster. As in the case with linear regression, to minimize this function we should start from some initial guess. Our task is to find cluster centroids (average position of all points in the space) or, means for each cluster. This may be achieved via three-step algorithm in which we: 1.Randomly initialize K cluster centroids – m1, m2 … mk 2.We assign each observation in the data set to the cluster that yields the least WCSS. Intuitively, the least WCSS is the ‘nearest’ mean (centroid). To find the ‘nearest’ centroid we should calculate the Euclidean distances between observations and centroids and select the one with the smallest distance. Equation #13 Assigning observation to the closest centroid 3.The next step is moving to a new centroid m by computing the centroid for each K cluster. The centroid of the kth cluster may be defined as a vector of the p feature means for the observations in the kth cluster. So, for example, if we have x1, x2, x3 and they belong to the same cluster (c2), then the centroid m2 is defined by the average: Equation #14 Centroid Calculation Example Since arithmetic mean is a good least-squares estimator, this step also minimizes the within-cluster sum of squares. This means that as the algorithm runs, the clustering obtained will continually improve until the result no longer changes. When this happens, the local optimum has been reached and clusters become stable. Conclusion Data mining models and methods described in this paper, allow data scientists to perform a wide array of tasks, including inference, prediction, and analysis. Linear regression is powerful in the prediction of trends and inference of relationships between features. In its turn, logistic regression may be used in the automatic classification of behaviors, processes, and objects, which makes it useful in business analytics and anomaly detection. Finally, clustering allows to make insights about unlabeled data and infer hidden relationships that may drive effective business decisions and strategic choices.
Highly informative & appreciable.......
PUBLIC SECTOR CLOUD CONFERENCE 6th December 2017
PUBLIC SECTOR CLOUD CONFERENCE 6th December 2017 Location: University of Salford Funded places are being offered to members of the Data Science Foundation, contact firstname.lastname@example.org for further information on funded places The Public Sector Cloud conference will take place at the University of Salford on the 6th December. The event will bring together leading experts of IoT and Digital Infrastructure. Speakers will share their views on the development of cloud based systems within government and education, the constant pressure for councils to migrate to a 'Cloud First' initiative is a broad concern, however as integrated systems and legacy operations have been in place for so long, the transition will certainly take time and money to achieve. With added cost, cloud transition seems a time away as budget pressures are an overwhelming concern, made worse by the constant scrutiny of the media spotlight. We will be offering a case study on the day of success stories experienced by local councils and public institutes. The overall day will be directed to ensure that our delegates are made aware of the obstacles and challenges ahead, ensuring they see a discernible path to updating their systems gradually and progressively. An example of speakers and themes: Liam Maxwell, UK National Technology Adviser, HM Government Promoting and supporting digital industry in the UK and internationally; The future of public sector services and migrating to cloud based service. James Stewart - Former Director of Technical Architecture of UK Government Digital Service Why cloud remains a key part of any “Digital by Default” agenda New cloud developments – 4 years on Christopher Wroath - Director of Digital, NHS Education for Scotland Cloud as part of the new Health & Social Care Delivery plan Cloud first – NES Journey Event Page: http://www.salford.ac.uk/onecpd/courses/public-sector-cloud
Funded places are being offered to members of the Data Science Foundation, contact email@example.com for further information on funded places.The funded places are limited, if you are interested in going please get in contact ASAP
Introduction to Artificial Neural Networks (ANNs)
Introduction to Artificial Neural Networks (ANNs) White Paper 5 September 2017 Introduction Machine Learning (ML) is a subfield of computer science that stands behind the rapid development of Artificial Intelligence (AI) over the past decade. Machine Learning studies algorithms that allow machines recognizing patterns, construct prediction models, or generate images or videos through learning. ML algorithms can be implemented using a wide variety of methods like clustering, linear regression, decision trees, and more. In this paper, we are going to discuss the design of Artificial Neural Networks (ANN) – a ML architecture that gathered a powerful momentum in the recent years as one of the most efficient and fast learning methods to solve complex computer vision, speech recognition, NLP (Natural Language Processing), image, audio, and video generation problems. Thanks to their efficient multilayer design that models the biological structure of human brain, ANNs have firmly established themselves as the state-of-the-art technology that drives AI revolution. In what follows, we are going to describe the architecture of a simple ANN and offer you a useful intuition of how it may be used to solve complex nonlinear problems in an efficient way. What is an Artificial Neural Network? An Artificial Neural Network is an ML (Machine Learning) algorithm inspired by biological computational models of brain and biological neural networks. In a nutshell, an Artificial Neural Network (ANN) is a computational representation of the human neural network that regulates human intelligence, reasoning and memory. However, why should we necessary emulate a human brain system to develop efficient ML algorithms? The main rationale behind using ANNs (ANN) is that neural networks are efficient in complex computations and hierarchical representation of knowledge. Neurons connected by axons and dendrites into complex neural networks can pass and exchange information, store intermediary computation results, produce abstractions, and divide the learning process into multiple steps. Computation model of such system can thus produce very efficient learning processes similar to the biological ones. A perceptron algorithm invented in 1957 by Franc Rosenblatt in 1957 was the first attempt to create a computational model of a biological neural network. However, complex neural networks with multiple layers, nodes, and neurons became possible only recently and thanks to the dramatic increase of computing power (Moore’s Law), more efficient GPUs (Graphics Processing Units), and proliferation of Big Data used for training ML models. In the 2000s-2010s these developments gave rise to Deep Learning (DL), – a modern approach to the design of ANNs based on a deep cascade of multiple layers that extract features from data and do transformations and hierarchical representations of knowledge. Image #1 Overfitting problem Thanks to their ability to simulate complex nonlinear processes and create hierarchical, and abstract representations of data, ANNs stand behind recent breakthroughs in image recognition and computer vision, NLP (Natural Language Processing), generative models and various other ML applications that seek to retrieve complex patterns from data. Neural networks are especially useful for studying nonlinear hypotheses with many features (e.g n=100). Constructing an accurate hypothesis for such a large feature space would require using multiple high-order polynomials which would inevitably lead to overfitting – a scenario in which the model describes the random noise in data rather than underlying relationships and patterns. The problem of overfitting is especially tangible in image recognition problems where each pixel may represent a feature. For example, when working with 50 X 50 pixel images, we may have 25000 features which would make manual construction of the hypothesis almost impossible. A Simple Neural Network with a Single Neuron The simplest possible neural network consists of a single “neuron” (see the diagram below). Using a biological analogy, this ‘neuron’ is a computational unit that takes inputs via (dendrites) as electrical inputs (let’s say “spikes”) and transmits them via axons to the next layer or the network’s output. Image #2 A neural network with a single neuron In a simple neural network depicted above, dendrites are input features (x1, x2 …) and the outputs (axons) represent the results of our hypothesis (hw,b(x)). Besides input features, the input layer of a neural network normally has a 'bias unit' which is equal to 1. A bias unit is needed to use a constant term in the hypothesis function. In Machine Learning terms, the network depicted above has one input layer, one hidden layer (that consists of a single neuron) and one output layer. A learning process of this network is implemented in the following way. The input layer takes input features (e.g pixels) for each training sample and feeds them to the activation function that computes the hypothesis in the hidden layer. An activation function is normally a logistic regression used for classification, however, other alternatives are also possible. In the case described above, our single neuron corresponds exactly to the input-output mapping that was defined by logistic regression. Image #3 Logistic Regression As in the case with simple binary classification, our logistic regression has parameters. They are often called “weights” in the ANN (Artificial Neural Network) models. Multi-Layered Neural Network To understand how neural networks work, we need to formalize the model and describe it in a real-world scenario. In the image below we can see a multilayer network that consists of three layers and has several neurons. Here, as in a single-neuron network, we have one input layer with three inputs (x1,x2,x3) with an added bias unit (+1). The second layer of the network is a hidden layer consisting of three units/neurons represented by the activation functions. We call it a hidden layer because we don’t observe the values computed in it. Actually, a neural network can contain multiple hidden layers that pass complex functions and computations from the “surface” layers to the “bottom” of the neural network. The design of a neural network with many hidden layers is frequently used in Deep Learning (DL) – a popular approach in the ML research that gained a powerful momentum in recent years. Image #4 Multilayer Perceptron The hidden layer (Layer 2) above has three neurons (a12, a22, a32). In abstract terms, each unit/neuron of a hidden layer aij is an activation of unit/neuron in in the layer j. In our case, a unit a12 ctivates the first neuron of the second layer (hidden layer). By activation, we mean a value which is computed by the activation function (e.g logistic regression) in this layer and outputted by that node to the next layer. Finally, Layer 3 is an output layer that gets results from the hidden layer and applies them to its own activation function. This layer computes the final value of our hypothesis. Afterwards, the cycle continues until the neural network comes up with the model and weights that best predict the values of the training data. So far, we haven’t defined how the ‘weights’ work in the activation functions. For that reason, let’s define Q(j) as a matrix of parameters/weights that controls the function mapping from layer j to layer j + 1. For example, Q1 will control the mapping from the input layer to the hidden layer, whereas Q2 will control the mapping from the hidden layer to the output layer. The dimensionality of Q matrix will be defined by the following rule. If our network has sj units in the layer j and sj+1 units in the layer j+1, then Qj will have a dimension of sj+1 X (sj + 1). The + 1 dimension comes from the necessary addition in Qj of a bias unit x0 and Q0(j). In other words, our output nodes will not include the bias unit while the input nodes will. To illustrate how the dimensionality of the Q matrix works, let’s assume that we have two layers with 101 and 21 units in each. Then, using our rule Qj would be a 21 X 102 matrix with 21 rows and 102 columns. Image #5 A Neural Network Model Let’s put it all together. In the image above, we see our neural network with three layers again. What we need to do, is to calculate activation functions based on the input values, and our main hypothesis function based on the set of calculations from the previous layer (the hidden layer). In this case, our neural network works as a cascade of calculations where each subsequent layer supplies values to the activation functions of the next one. To calculate activations, we first have to define the dimensionality of our Q matrices. In this example, we have 3 input and 3 hidden units, so Q1 mapping from input to hidden layer is of dimension 3 X 4 because the bias unit is included. The activation layer of each hidden neuron (e.g a12) is equal to our sigmoid function applied to the linear combination of inputs with weights retrieved from the weight matrix Qj. In the diagram above, you can see that each activation unit is computed by the function g which is our logistic regression function. In its turn, Q2 refers to the matrix of weights that maps from the hidden layer to the output layer. These weights may be randomly assigned to the matrix before the neural network runs or be a product of previous computations. In our case, Q2 is a 1 X 4 dimensional matrix (i.e a row vector). To calculate the output results we apply our hypothesis function (sigmoid function) to the results calculated by the activation functions in the hidden layer. If we had several hidden layers, then the results of the previous activation functions would be passed to the next hidden layer and then to the output layer. This sequential mechanism makes neural networks very powerful in computation on nonlinear hypotheses and complex functions. Instead of trying to fit inputs to polynomial functions designed manually, we can create a neural network with numerous activation functions that exchange intermediary results and update weights. These automatic setup allows creating nonlinear models that are more accurate in prediction and classification of our data. Neural Networks in Action The power of neural networks to compute complex nonlinear functions may be illustrated using the following binary classification example taken from Coursera Machine Learning course by Professor Andrew Ngi. Consider the case when x1 and x2 can take two binary values (0,1). To put this binary classification problem in Boolean terms, our task is to compute y = x1 XOR x2 , which is the same as computing x1 XNOR x2. The latter is a logic gate that may be interpreted as NOT (x1 XOR x2). This is the same as saying that the function is true if both x1 and x2 are equal 0 or 1. To make our network calculate XNOR, we first have to describe simple logical functions to be used as intermediary activations in the hidden layer. The first function we want to compute is a logical AND function: y = x1 AND x2. Image #6 Logical AND function As in the first example above, our AND function is a simple single-neuron network with inputs x1 and x2 and a bias unit (+1). The first thing we need to do is to assign weights to the activation function and then compute it based on the input values specified in the truth table below. These input values are all possible binary values that x1 and x2 can take. By fitting 0s and 1s into the function (i.e logistic regression) we can compute our hypothesis. hq(x) = g(-30 + 20x1 + 20x2). To understand how the values of the third column of the truth table are found, remember that sigmoid function is 0 at ≈ -4.6 and 1 at ≈ 4.6. As a result, we have: x1 x2 hq(x) 0 0 g(-30) ≈ 0 0 1 g(-10) ≈ 0 1 0 g(-10) ≈ 0 1 1 g(10) ≈ 1 As we can see now, the rightmost column is a definition of a logical AND function that is true only if both x1 and x2 are true. The second function we need for our neural network to work is a logical OR function. In the logical OR, y is true (1) if either x1 OR x2 or both of them are 1 (true). Image #7 Logical OR function As in the previous case with the logical AND, we assign weights that will fit the definition of the logical OR function. Putting these weights into our logistic function g(-10 + 20x1 + 20x2) we get the following truth table: x1 x2 hq(x) 0 0 g(-10) ≈ 0 0 1 g(10) ≈ 1 1 0 g(-10) ≈ 1 1 1 g(10) ≈ 1 As you see, our function is false (0) only if both x1 and x2 are false. In all other cases, it is true. This corresponds to the logical OR function. The last function we need to compute before running a network for finding x1 XNOR x2 is (NOT x1) and (NOT x2). In essence, this function consists of two logical negations (NOT). A single negation NOT x1 may be presented in the following diagram. In essence, it says that y is true only if x1 is false. Therefore, the logical NOT has only one input unit (x1). Image #8 Logical NOT After putting inputs with weights into g = 10 – 20x1, we end up with the following truth table. x1 hq(x) 0 g(10) ≈ 1 1 g(-10) ≈ 0 The output values of this table confirm our hypothesis that NOT function outputs true only if x1 is false. Now, we can find out values of the logical (NOT x1) AND (NOT x2) function. Image #9 Logical (NOT x1) AND (NOT x2) Putting binary values of x1 and x2 in the function g(10 - 20x1 -20x2) we end up with the following truth table. x1 x2 hq(x) 0 0 g(10) ≈ 1 0 1 g(-10) ≈ 0 1 0 g(-10) ≈ 0 1 1 g(-30) ≈ 0 This table demonstrates that the logical (NOT x1) AND (NOT x2) function is true only if both x1 and x2 are false. These three simple functions (logical AND, logical OR, and double negation AND function) may be now used as the activation functions in our three-layer neural network to compute another nonlinear function defined in the beginning: x1 XNOR x2. To do this, we need to put these three simple functions together into a single network. Logical AND Logical (NOT x1) AND (NOT x2) Logical OR This network uses three logical functions calculated above as the activation functions. Image #10 A Neural Network to Compute XNOR Function As you see, the first layer of this network consists of two inputs (x1 and x2) plus a bias unit +1. The first unit of the hidden layer is a Logical AND activation function that takes weights specified above (-30, 20, 20). The second unit a(2)2 is represented by the (NOT x1) AND (NOT x2) function that takes parameters 10, -20, -20. Doing our usual calculations, we get the values 0,0,0,1 for a(2)1 and the values 1,0,0,0 for the second unit in the hidden layer. Now, the final step is using the second set of parameters from the logical OR function that sits in the output layer. What we do here, is simply take the values produced by the two units in the hidden layer (logical AND and (NOT x1) AND (NOT x2) ) and apply them to the OR function with its parameters. The results of this computation make up our hypothesis function (1,0,0,1), which is our desired XNOR function. x1 x2 a(2)1 a(2)2 hq(x) 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 1 That’s it! Hopefully, as this example illustrates, neural networks are powerful in computing complex nonlinear hypotheses by using a cascade of functions. In fact, neural networks can use output values of a certain function as the inputs of other functions. Leveraging this functionality, complex multi-layered networks that can extract complex features and patterns from images, videos, and other data can be designed. Conclusion Artificial Neural Networks (ANNs) are the main drivers of the contemporary AI revolution. Inspired by the biological structure of human brain, ANNs are powerful in modeling functions and hypotheses which would be hard to derive intuitively or logically. Instead of inventing your own function with high-order polynomials, which may lead to overfitting, one can design an efficient ANN architecture that can automatically fit complex nonlinear hypotheses to data. This advantage of the ANNs has been leveraged in the algorithmic feature extraction in computer vision and image recognition. For example, instead of manually specifying a finite list of image features to choose from, we can design a Convolutional Neural Network (CNN) that uses the same principle as the animal’s visual cortex to extract features. As a human eye, layers of the CNN respond to stimuli only in a restricted region of the visual field. This allows the network to recognize low-level features such as points, edges, or corners and gradually merge them into high-level geometric figures and objects. This example illustrates how good ANNs are in the automatic derivation of hypotheses and models from complex data that includes numerous associations and relationships.
Good overview, Kirill.It might make sense to also refer to the following picture to give a broad and quick snapshot of the different neural nets scientists can use:http://www.asimovinstitute.org/wp-content/uploads/2016/09/neuralnetworks.png
Getting a new periodic table of elements using AI
"Elementary particles are the building blocks of al matter everywhere in the universe. Their properties are connected with the fundamental forces of nature" Murray Gell Mann Getting a new periodic table of elements using AI Abstract Objective: To obtain an atomic classification based on clustering techniques using non-supervised learning algorithms. Design: The sample of atoms used in the experiments is defined using a set of atomic elements with known properties that are not null for all the individuals of the sample. Different clustering algorithms are used to establish relationships between the elements, getting as result a cluster of atoms related with each other by the numerical values of some of their structural properties. Results: Sets of elements related with the atom that represents each cluster. Keywords: Clustering, atoms, periodic table of elements, unsupervised algorithms, Random Forest, K-Means, K-Nearest Neighbour, Weka, Bayesian Classifier. Introduction The periodic table of elements is an atomic organisation based on two axis. The horizontal axis establishes an increasing order based on the atomic number (number of protons) of each element. The vertical arrangement is managed by the electronic configuration and presents a taxonomic structure designed by the electrons of their latest layer . Furthermore, four main blocks arrange the atoms by similar properties (gases, metals, nonmetals, metalloids). Additionally to the number of protons and the electronic configuration, the atoms are characterised by other attributes that are not ascendant nor cyclic in the periodic table of elements. The values of these properties constitute a sample of numbers that represent different atomic magnitudes that distinguish in some how the chemical elements. In this experiment some of these chemical and physical dimensions have been involved in the training of a set of machine learning algorithms to obtain representative clusters of each element. Research problem The hypothesis of this experiment considers the use of some variants of unsupervised learning models to discover relationships between atomic elements based on a few of chemical and physical matter attributes. Moreover, these techniques calculate clusters of categories based on their numerical attributes. The research problem drives also to an element clustering that could offer a new atomic distribution based on the inferred functions processed by the machine learning processes. The goal is to present an organisation of elements based on the clustering calculation applied on a specific set of atomic properties. Units of analysis The following atomic properties have been used to train and evaluate the unsupervised algorithms: melting point [K], boiling point [K], atomic radius [pm], covalent radius [pm], molar volume [cm3], specific heat [J/(Kg K)], thermal conductivity [W/(m k)], Pauling electronegativity [Pauling scale],first ionisation energy [kJ/mol] and lattice constant [pm]. Only atoms with non null values for each magnitude have been selected in the sample. Notice that some of these properties have not been already discovered or calculated for some atoms that do not appear in the sample. The raw data can be downloaded from this link. The following graphical representation shows how some of these properties are distributed across the spectrum of elements sorted by the ascending number of protons: Graphic 1. Distribution of the melting point, boiling point, lattice constant and the atomic radius versus the atomic number. In this graphic there is not any seeming correlation among the displayed magnitude values and the atomic number. At the first glance there are not correlations nor any pattern between the displayed attributes and the elements upward sorted. Methods The unsupervised machine learning algorithms allow to infer models that identify hidden structure from "untagged" data. Thus no categories are included in the observation and data used to learn can not be used in the accuracy evaluation of the results. Using the machine learning library Java- ML and the non null values for the above specified magnitudes, two exercises were performed: 1 - Clustering of elements The scope of this exercise is to create clusters of atomic elements using three different machine learning techniques provided by the Java-ML library. The result was three atomic configurations based on the following algorithms: K-Means clustering with 10 clusters. This algorithm divides the selected atomic elements into k clusters where each individual is associated to each cluster through the nearest mean calculation. Iterative Multi K-Means implements an extension of K-Means. This algorithm works performing iterations with a different k value, starting from kMin and increasing to kMax, and several iterations for each k. Each clustering result is evaluated with an evaluation score. The result is the cluster with the best score. The applied evaluation in the exercise was the sum of squared errors. K-Means cluster wrapped into Weka algorithms. Classification algorithms from Weka are accessible from within the Java-ML library. An experiment with 3 clusters were calculated just to compare with the first exercise (K-Means with 10 clusters). The results were presented using the TreeMap provided by the d3 - TreeMap graphic library. Graphic 2. Applying K-Means clustering to the sample. 2 - Atomic elements classifications and relationships between themselves. The following exercise was intended to evaluate the degree of relationship among the atoms contained in the sample. Three algorithms were applied: Random Forest with 30 trees to grow and 10 variables randomly sampled as candidates at each split (one for each atomic magnitude). This technique works by constructing a multitude of decision trees at training time and providing the class that is the mode of the classes. Bayesian Classifier. The Naive Bayes classification algorithm has been used to classify the set of elements in different categories. K nearest neighbour (KNN) classification algorithm with KDtree support. The number of neighbours was fixed to 8, considering that this number of potential elements could establish the boundaries for each element positioned in the center of a square (laterals and corners are not managed in the current hypothesis). Graphic 3. Schema of 8 neighbours surrounding the target element Each algorithm worked such as a classifier and they produced a membership distribution with the associated degree evaluation. The classes with a membership evaluation equals to zero were not considered. In this experiment, the physical and chemical attribute values have been clusterized and afterwards, each atom belonging to the same sample, has been classified in the set of the calculated clusters. Therefore, each element is identified with a specific group where the only requirement is that the atom that is being classified must be the representative for the selected category. The calculated clusters have been distributed in pairs of atoms with their corresponding degree evaluation following this structure: [Xi, Yj, Ej] Where Xi is each atom in the sample, Yj is each element in the category Y and Ej the related degree evaluation to the pair. The relationships between the individuals and their categories are shownthrough the chord graphic representation based on the Chord Viz component provided by d3. Graphic 4. Nitrogen relationships considering the evaluation of different classifiers Results The three tree maps (one per clustering algorithm) where the chemical elements have been organised, are showing interesting groups of components. For instance all of them include in the same group the S and the Se. Other atoms (all of them gases) such as the Ne, Ar, Kr and Xe are also enclosed in the same group by all the algorithms (remember that neither the atomic number nor the electronic configuration were included in the models). It is interesting to mention that the configuration generated by the two K-Means algorithms are presenting the H and the Li in a separated and mono-element clusters. Regarding the weighted relationships between the elements, a chord graphic has been created for each machine learning algorithm. This data representation shows how the atomic elements can be related with each other through unsupervised machine learning techniques taking some of their chemical and physical properties and assigning a relational degree to them. There are some interesting behaviours such as the set of relationships found for the Nitrogen. The Random Forest algorithm determined that the O, Ne and Ar are highly related, the Bayesian Classifier calculated that only the Oxygen was related and the results of the K-NN method evaluated that the O, Ne, Cl, Ar, Br, Kr and the I are related when the number of neighbours was fixed to 8. Some familiar associations can be found in the calculated relationships when comparing the components in the clusters and their distribution in the periodic table of elements. Nevertheless, other non evident atomic relations have been set up by these methods. Additionally, the non commutative property is a remarkable characteristic. For instance the Nitrogen is not related in the reverse way with the Hydrogen when they are selected in the results calculated using the Random Forest algorithm. Conclusions Although the calculated atomic organisation through the machine learning algorithms are not following any physic or chemical rule, some associations arise creating groups of components that follow similar configurations like the provided by the periodic table of elements. Beyond the calculated results, the applied library (Java-ML) and the used algorithms, the exercise is interesting by itself. The proof that chemical or physical relationships can be stablished among the elementary components based on the similarity of their properties using machine learning can drive to new lines of research. Acknowledgments I want to thank to Montse Torra her task gathering the physical and chemical properties for each used atom in the sample. References Bostjan Kaluza. "Machine Learning in Java". Packt Publishing Ltd, Apr 29, 2016 Eibe Frank, Mark A. Hall, and Ian H. Witten. "The WEKA Workbench". Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques”. Morgan Kaufmann, Fourth Edition, 2016 Physical and chemical atomic properties extracted from WebElements and PeriodicTable.
Very original application, Toni - thanks for sharing
What Is The Difference Between Artificial Intelligence And Machine Learning?
Artificial Intelligence (AI) and Machine Learning (ML) are two very hot buzzwords right now, and often seem to be used interchangeably.They are not quite the same thing, but the perception that they are can sometimes lead to some confusion. So I thought it would be worth writing a piece to explain the difference.https://www.google.co.uk/amp/s/www.forbes.com/sites/bernardmarr/2016/12/06/what-is-the-difference-between-artificial-intelligence-and-machine-learning/amp/
Machine Learning Stanford University
About this course: Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you'll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems. Finally, you'll learn about some of Silicon Valley's best practices in innovation as it pertains to machine learning and AI. https://www.coursera.org/learn/machine-learning