How to Ace Your Data Science Interview
With dissertation deadlines glooming, Data Science students are gearing up to leave the academic world and find their feet in a data science role. We all know the demand for the skills and the shortage supply of experienced data scientists means there are opportunities everywhere and companies are looking to secure grad talent, so finding some data science jobs should not be too difficult. But before you reach that commercial goldmine, your faced with the job interview. Not matter how much experience and exposure you have in previous interviews, public speaking or Data science discussions, this preparation is still hard. Data Science interviews tend to cover a wide range of topics. From technical exposure, to statistical understanding, to solving and communicating complex business problems. At Eden Smith we work with a number of business in hiring across the Data science spectrum and to support you ace your interview have curated a list of common Data Science interview questions. We have enriched this data with information from online and insight from our Data Science partners to help you prepare for the types of questions that can be thrown at you during your Data Science Interview. Building Models Building data models for machine learning or pure data transformation and analysis is one of the most common tasks of the modern data scientist and more and more businesses are developing teams, particularly with grads, that are modelling and coding heavy, this is resulting in more interviews covering the various modelling techniques and statistical theories. Not all interviews will be technical, but below are some questions that will help you prepare and refamiliarize yourself with. How would you create a logistic regression model? What is linear regression? What do the terms P-value, coefficient, R-Squared value mean? What is the significance of each of these components? Why is Central Limit Theorem important? Explain hash table collisions? In your opinion, which is more important when designing a machine learning model: Model performance? Or model accuracy? What are some situations where a general linear model fails? Is it better to have too many false positives, or too many false negatives? How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression? What is an example of a dataset with a non-Gaussian distribution? Explain Bayes Theorem, when might you use Bayes Inference? Programming Most Data Science teams are involved in both the ingestion of data for modelling and analysis and the production of models into the enterprise environment. Whether this is led by Data Engineering, software engineering or a database development team, you will be expected to have a strong understanding of various program languages, those directly involved in Data Science and those surrounding data integration and exportation. Be sure to brush up on your Python, R, SQL and relevant big data programming languages such as Scala. Python or R, which would you prefer for text analysis? What modules/libraries/Packages are you most familiar with? What do you like or dislike about them? What are the different types of sorting algorithms available in R language? What is the difference between a tuple and a list in Python? How do you split a continuous variable into different groups/ranks in R? What is the purpose of the group functions in SQL? Give some examples of group functions. Tell me the difference between an inner join, left join/right join, and union. Describe a data science project in which you worked with a substantial programming component. What did you learn from that experience? How would you clean a dataset in your programming preference? What are the two main components of the Hadoop Framework? Data Science Process Although being hands on with data and modelling and programming are the major aspects of any data science in today’s world; often businesses are looking to understand how insights and results are created. Interviewers are looking for you to demonstrate a clear understanding and be able to explain various methods and processes throughout a data science project and their Pros. Cons and use cases to a non-technical audience. Practice articulating and giving clear simple explanations of various complex data science procedures. What are various steps involved in an analytics project? What is the goal of A/B Testing? Explain the use of Combinatorics in data science? What is the difference between Cluster and Systematic Sampling? What is logistic regression? Or State an example when you have used logistic regression recently. Explain false negatives and false positives, which is better to have too many of? What was the business impact of your last project? Can you explain the difference between a Test Set and a Validation Set? What makes a dataset gold standard? What are outliers and inliers? What would you do if you find them in your dataset? General Data Science is still a position that has great variety and a lack of standardisation across the market. Therefore, every Data Science position and company you interview for will take a slightly different approach and expect additional skills and awareness of the surrounding subjects. Be sure to explore the business your interviewing with, check current employees; data scientists and analysts and see what additional products, technologies and soft skills they have experience with. Some common general questions are: What visualisation tools are you familiar with? Explain a time when you had to handle stakeholder’s expectations? Describe a time when you have been innovative and creative? Which Cloud services have you used and how have you interacted with them? What external data sources do you think could be interesting to our domain? Present to us your last data science project. What’s a project you would want to work on at our company? What data would you love to acquire if there were no limitations? How important is the product in Data Science? Eden Smith If you want more advice or support in how to land your dream data science opportunity or if you’re a manager looking to scale a data science team get in touch with us today.
Methods for dealing with missing values in datasets
Methods for dealing with missing values in datasets AlMazloum, Amer Eddin HERIOT WATT Professor: Dr. Hani Ragab Missing Values in Data: Missing data can occur because of nonresponse: no information is provided for one or more items or for a whole unit ("subject"). Some items are more likely to generate a nonresponse than others. Missing data mechanisms: Missing completely at random (MCAR): Suppose variable Y has some missing values. We will say that these values are MCAR if the probability of missing data on Y is unrelated to the value of Y itself or to the values of any other variable in the data set. On another hand, missing value (y) neither depends on x nor y. Missing at random (MAR): The probability of missing data on Y is unrelated to the value of Y after controlling for other variables in the analysis (say X). on another hand, Missing value (y) depends on x, but not y. Not missing at random (NMAR): Missing values do depend on unobserved values. On another hand, the probability of a missing value depends on the variable that is missing. Patterns of Missingness: We can distinguish between two main patterns of missingness. On the one hand, data are missing monotone if we can observe a pattern among the missing values. Note that it may be necessary to reorder variables and/or individuals. On the other hand, data are missing arbitrarily if there is not a way to order the variables to observe a clear pattern (SAS Institute, 2005). Methods for handling missing data: Deletion Methods List wise deletion If a case has missing data for any of the variables, then simply exclude that case from the analysis. It is usually the default in statistical packages. (Briggs et al.,2003). In this case, rows containing missing variables are deleted Pair wise deletion Analysis with all cases in which the variables of interest are present. On another hand, only the missing observations are ignored and analysis is done on variables present. Imputation Methods Popular Averaging Techniques: Mean, median and mode are the most popular averaging techniques, which are used to infer missing values. Approaches ranging from global average for the variable to averages based on groups are usually considered. On simply way Replace missing value with sample mean or mode. Conditional mean imputation: Suppose we are estimating a regression model with multiple independent variables. One of them, X, has missing values. We select those cases with complete information and regress X on all the other independent variables. Then, we use the estimated equation to predict X for those cases it is missing. (Graham, 2009) (Allison, 2001) and (Briggs et al., 2003). Model-Based Methods: Maximum Likelihood: We can use this method to get the variance-covariance matrix for the variables in the model based on all the available data points, and then use the obtained variance- covariance matrix to estimate our regression model (Schafer, 1997). On another hand, Estimate: value that is most likely to have resulted in the observed data. Multiple imputation: The imputed values are draws from a distribution, so they inherently contain some variation. Thus, multiple imputation (MI) solves the limitations of single imputation by introducing an additional form of error based on variation in the parameter estimates across the imputation, which is called “between imputation error”. It replaces each missing item with two or more acceptable values, representing a distribution of possibilities (Allison, 2001). How do you deal with missing values? Ignore or treat them? The answer would depend on the percentage of those missing values in the dataset, the variables affected by missing values, whether those missing values are a part of dependent or the independent variables, etc. Missing Value treatment becomes important since the data insights or the performance of your predictive model could be impacted if the missing values are not appropriately handled. In conclusion: Assumptions and patterns of missingness are used to determine which methods can be used to deal with missing data Sources and useful resources: Reports: http://www.bu.edu/sph/files/2014/05/Marina-tech-report.pdf https://liberalarts.utexas.edu/prc/_files/cs/Missing-Data.pdf References: Allison, P., 2001. Missing data — Quantitative applications in the social sciences. Thousand Oaks, CA: Sage. Vol. 136. Enders, Craig. 2010.Applied Missing Data Analysis. STATA 11, 2009, Multiple Imputation. Stata Corp. Schafer, J. L. ,1997. Analysis of Incomplete Multivariate Data. Useful links: Data sets with missing values that can be downloaded in different formats including SAS, STATA, SPSS and S plus: http://www.ats.ucla.edu/stat/examples/md/default.htm. Introduction to missing data with useful examples in SAS http://www.ats.ucla.edu/stat/sas/modules/missing.htm. Multiple imputation in SAS. Comprehensive explanations http://www.ats.ucla.edu/stat/sas/seminars/missing_data/part1.htm
Very nice review, thanks AmerI have worked once with sparse matrices with missing data, and I think your article is relevantThanksF
Drowning in Data?
Find real value and insights in the intersections between small data and big dataData – a set of facts and statistics collected together for reference (https://en.oxforddictionaries.com).As I talk to many business leaders time and time I hear the same frustrations, “I’m drowning in Data, what I need are real insights to drive my decision-making”.Despite these frustrations and the volumes of data available, I am amazed at the degree to which decisions are still being taken on often a very small number of discreet data points, with businesses often using separate disconnected data sets from different vendors to make decisions.So is data a definitive thing, a trusted source, a fact or figure that can be referenced to reinforce or justify a statement or decision? If so then surely it doesn’t matter the size of the data set, if a necessary fact can be derived? Well, it’s not a Yes or No answer, in fact, its Yes and No!In daily life, when looking at any situation, do we look at just one side of an argument or issue? No, because in open freethinking societies we shift our viewpoints to consider other views. In practice, this means triangulating our understanding with many differing sources and views to get much richer information on which to make well-informed decisions. Therefore, our perspectives and rationales change dependent on varying factors and the information available at hand. Such an approach however requires a common context with as a minimum some aligned data points for consideration.If this is how we naturally make our daily decisions, why should our use of data to inform business decisions be any different? Seeing how we make informed decisions on a personal level, is there any surprise that there is a real desire to get deeper and richer perspectives and therefore real insights from the data sources we use at a corporate level?So with so much data available this should be easy, shouldn’t it? If everyone is drowning in data there must be enough relevant information to provide real insights, right? Sadly the real problem is the complexity to align available data to provide the triangulation and joined up views required. Different definitions, taxonomies ,standards and complex data integration makes finding actionable insights from data sources a real problem.Stepping away from ICT to show a real world example, Let’s use a topical issue right now in the EU, Brexit to make a point. The question remains, how many ill-informed choices were made on both sides of the argument with an inaccurate application of individual non-joined up facts preventing a well rounder argument to support decision making, How many non-fact based subjective views were fuelled by bias and prejudice? Well, the answer can be seen in the general confusion and frustration arising after the post-Brexit vote. Prime Minister Theresa May could easily be one of the business leaders quoted in my opening gambit, “I’m drowning in Data, what I need are real insights to drive my decision-making”.Returning to ICT sourcing intelligence, At Pivotal iQ we believe that value is derived best when we are able to use data like we should use information in daily life, looking through different dimensions of interconnected facts and figures to see different perspectives of a client, contract or opportunity to identify the subtleties behind a situation that will inform a decision. We believe real value is actually in the intersections of data.Let me provide an example of how value can be derived in this way using 3 seemingly unrelated Big Data points:Company A has an outsourcing contract with Company B due for renewal in 12 monthsCompany C has an outsourcing contract due for renewal in 10 monthsCompany D has an outsourcing contract with Company E in 12 monthsWhat we have here in isolation are a number of data points that individually are useful but don’t provide sufficiently rich insights on an opportunity. In fact we could look at each and make many assumptions.However, by building relationships between facts and interconnecting ‘small data’ we can start to build richer insights:Company A has an outsourcing contract with Company B due for renewal in 12 monthsCompany A isn’t very happy with Company B’s delivery performance.Company B just released poor financial resultsCompany B has just partnered with Company CJohn a CTO at company A has traditionally had good relations with Company CA service provider now looking at this opportunity may indeed decide to prioritise this opportunity as the clients dissatisfaction provides an opportunity for displacement. The service provider may also seek to partner with C or factor in this association into their sales strategy with company A.This type of in-depth data when combined produces actionable insights. Indeed, Forbes.com (2013) confirms “Data is meaningless unless it helps make decisions that have measurable impact. Unfortunately, many decision makers are ensnared rather than enlightened by Big Data, preventing data and insights from making it to the front lines in relevant and usable forms”.I recently caught up with a global ICT service provider that used the joined up approach I advocate to build a picture of an international customer’s installed technologies across its many sites. By joining company , spending and installed base data they were able to see across a company’s sites and installations to identify an opportunity for consolidation within the company that the global provider was well placed to fulfill. The positive outcome was a huge order win, made possible by the real insights provided by the ‘small data’ between the ‘big data’ points.At Pivotal iQ, our solution has always been to standardise, building and integrating data sources that allow for cross sectional views of companies, opportunities, installed technologies, transactions and announcements allowing ‘small data views’, Integrating several data facts in this way makes for much richer insights.We believe that What you see depends on what you look for. By combining a Big and Small Data approach, we allow you to see the opportunities others can’t by providing an ability for you to see value and insight in the intersections of data.I urge every business leader to challenge their data approach, to see how it can be improved using the Big and Small Data principals championed by Pivotal iQ to provide the richer sourcing insights they demand.
Datascience Foundation in Big Data London November 2017
It was great to meet a lot of prospective members. Tge curiousity factor was very high. Some Data Science called DSF as LinkedIn of Datascience. This is a good way to look at it.
Qlik. More power than you can imagine.
Search and explore vast amounts of data – all your data. With Qlik, you’re not constrained by preconceived notions of how data should be related, but can finally understand how it truly is related. Analyse, reveal, collaborate and act. https://www.qlik.com/en-gb
Top Trend in Analytics
The pace and evolution of business intelligence solutions mean what’s working now may need refining tomorrow. From natural language processing to the rise in data insurance, we interviewed customers and Tableau staff to identify the 10 impactful trends you will be talking about in 2018. Whether you’re a data rockstar or an IT hero or an executive building your BI empire, these trends emphasize strategic priorities that could help take your organization to the next level.Read more at https://www.tableau.com/reports/business-intelligence-trends#QZ5lKOrsPRsTiKj2.99https://www.tableau.com/reports/business-intelligence-trends?domain=yahoo.co.uk&eid=CTBLS000011076203&elqCampaignId=28134&elqTrackId=df07a602fd9948c0944bf2daa142366d&elqaid=26586&elqat=1&utm_campaign=Whitepaper%20-%20BI%20Trends%20-%20Prospect%20-%20EMEA%20en-GB%20-%202017-11-16&utm_medium=Email&utm_source=Eloqua&domain=yahoo.co.uk&eid=CTBLS000011076203&elqTrackId=df07a602fd9948c0944bf2daa142366d&elq=d48b17f551624380a22bdc07f4d95a6c&elqaid=26586&elqat=1&elqCampaignId=28134
An Introduction to Data Science: Making Big Data Usable
Our world is increasingly fuelled by data – an unparalleled amount of data, to be precise. More information is available today than at any other point in human history, and all of that data has value, particularly to businesses, organizations and government agencies. Data makes the world go round today. However, getting at that data and actually putting it to use can be difficult. Moreover, not all information has value, depending on the needs of the user. A means to identify, catalogue, collate and extract useful data from the sea of other information was necessary – data science. What Is Data Science? Depending on where you look, you’ll find a number of different definitions for data science. Some believe that it’s merely an offshoot of statistics. Others believe that it’s a combination of a few different knowledge extraction models. Yet others believe that it’s something else entirely. It’s important to note that “data science” isn’t true science in the technical definition of the term. Data scientists aren’t trying to prove or disprove a hypothesis. So, what is data science, then? It’s actually difficult to create a single, cohesive definition for the term simply because there are so many potential applications of this discipline, from statistics to data modelling to analytics and more. Perhaps the closest we have come to a unified definition comes from Quora user Drew Conway, a PhD student at NYU, who said, “Data science most often refers to the tools and methods used to analyse large amounts of data. As such, the discipline is an amalgamation of many bits from other areas of research. For tools, the influence primarily comes from computer science, where issues of algorithmic efficiency and storage scalability form the main focus. For analysis, however, the influences are much more varied. Modern methods are borrowed from both the so-called hard sciences (physics, statistics, graph theory) and the social sciences (economics, sociology, political sciences, etc.). Specific classes of techniques that are naturally interdisciplinary are also very popular, such as machine learning.” Michael Driscoll, a specialist in data, analytics and visualization, has another definition. “Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools and materials, coupled with a theoretical understanding of what’s possible.” The University of California at Berkley also sums up data science rather well by saying, “The field of data science is emerging at the intersection of the fields of social science and statistics, information and computer science, and design.” However, data science is not just “using data.”. That’s the end goal, yes, but the discipline focuses on how to first organize and access available data, and then to put it to use effectively. That last word is the key here – “effectively”. Data science enables the effective use of vast quantities of information for other purposes, while also enabling the creation of additional data products (which are themselves data, further feeding the cycle). In the end, data science truly isn’t “science”. Data scientists don’t work in the world of academics. Their employment is solely contingent on the needs of businesses and industries, just as much as a sales clerk’s position or a particular supplier’s line of products. So, what do data scientists actually do, though? Here’s a rough outline: Ask questions in order to answer or solve known problems or find new solutions to problems with which businesses must deal Define data necessary for a particular need Work with existing data to collect, store and explore data Determine the needed analysis type for a particular situation or type of data Use algorithms and other tools to parse, clean, check quality and utilize data Transform insights learned into formats usable by non-data scientists (for use by others within the business), including the creation of graphs, charts, infographics and more Create software to automate data science tasks based on specific business requirements and data sources So, data scientists must successfully combine several professional fields, including statistics, software programming, mathematics, research and subject expertise. A Few Basic Examples of Use Cases Before we dive too deep down the rabbit hole, let’s consider a few basic real world examples of data science in action. Microsoft Word: Sure, MS Word isn’t the most advanced piece of software on the planet, but the word processing program does show us some very good applications of data science. Consider the fact that the program essentially learns more about its users through interaction – it stores and parses data on its own. Microsoft has also done a great deal of work in building the program’s spellchecking and grammar capabilities. Does it match what you’d get from a professional editor? No. However, it does an excellent job for a computer program (that’s not to say there aren’t better programs available, but Word is the industry standard, and Microsoft’s achievements in data usage are noteworthy). Google PageRank: Google might be the world’s uncontested master when it comes to data everything. However, the search giant’s PageRank function is an ideal example of data science in action. This function essentially uses the number of links pointing at a particular domain (data outside the page itself) to help rank the authority and relevance of different websites. Facebook: Sure, the big blue social network has access to tons of information, but they put it to use in a number of innovative ways. For instance, Facebook uses the vast amounts of information it has about various users to help make suggestions for new friends and connections. These suggestions are based on patterns of friendships, including among people that you might not even be connected with on the site, and they can be extremely accurate (sometimes frighteningly so). These are just a few relatively basic examples of data science in action in the world around us, with programs and websites used every day. Why Is Data Science Necessary? Information is vital to every aspect of human existence, and it has been since the dawn of our species. A hunter-gatherer had to rely on his or her knowledge (information) to make decisions about various plant species (would it be deadly to eat?) as well as the animals hunted for food and pelts. Early agriculturalists had to rely on information to make informed decisions about planting, harvesting and preparation. Putting data to use has been with us since time out of mind – that hasn’t changed. What has changed, though, is the volume of data available. O’Reilly makes an excellent point here. “The question facing every company today, every start-up, every non-profit, every project site that wants to attract a community, is how to use data effectively – not just their own data, but all the data that’s available and relevant. Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well defined kinds of analytics. What differentiates data science from statistics is that data science is a holistic approach. We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into tractable form, making it tell its story, and presenting that story to others.” Given the fact that humans have been parsing, storing, organizing and using data for millennia, it’s natural to wonder why a new discipline is even necessary. Isn’t statistics enough? Isn’t the use of current database software sufficient? To put it simply – no. According to a story in VentureBeat, “Today’s modern business needs to manage far more data than ever before, and few have the talent on staff for the job. Projections indicate that the (data scientist) market will experience meteoric growth in the next several years.” GigaOM backs that up. “Every organization will need someone wearing the data scientist hat just like every organization has people responsible for product, sales, marketing and support.” The rise of Big Data and its increasing importance to businesses or all sizes, in every industry, and even the government sector, has made data science not only vital, but one of the fastest growing fields in the world. What Is Big Data, Really? You hear the term “Big Data” thrown around a lot today, but what does it mean, really? Organizations have had access to massive amounts of information for a very long time. Oil companies are prime examples of this, but there are many others, including massive retail chains like Wal-Mart. What makes today different in terms of the amount of information available? What makes it warrant a specific designation of “Big Data”? According to Oxford University Press’ Oxford Dictionary, Big Data is, “data sets that are too large and complex to manipulate or interrogate with standard methods or tools.” That tells us a little bit, but it’s not the full story. Forbes magazine weighs in with a slightly different take on the question. In a story written by Lisa Arthur for Forbes, the author defines BD as, “a collection of data from traditional and digital sources inside and outside your company that represents a source for ongoing discovery and analysis.” That’s a bit more illuminating. Arthur goes on to state that, “In defining big data, it’s also important to understand the mix of unstructured and multi-structured data that comprises the volume of information. Unstructured data comes from information that is not organized or easily interpreted by traditional databases or data models, and typically, it’s very text-heavy. Multi-structured data refers to a variety of data formats and types and can be derived from interactions between people and machines, such as web applications or social networks. A great example is web log data, which includes a combination of text and visual images along with structured data like form or transactional information.” So, unstructured information could be information that’s entered into a form field (either online or off). Multi-structured data could be derived from scraping a website. There are several conditions that go into making Big Data; that we will now describe as the Five Vs of Big Data: Volume (the sheer amount) Velocity (the speed information is generated and aggregated) Variety (types of data) Veracity (authenticity of data) Value (what the data is actually worth to the company) However, it’s very important to understand that the meaning of Big Data will vary from business to business and organization to organization. In each instance, this information will do something different, offer something different, and mean something different. The Explosion of Data Available Today As mentioned, there’s more information available today than at any point in history, and that amount is growing exponentially. In a sense, it feeds itself. As more data is explored, collated, categorized and packaged, more information is created. To truly understand not only what data science is but why it’s become such an essential skillset and discipline to the modern world, it’s important to understand the lifecycle of data – where it comes from, how that information is used, and more. We’ll begin by exploring where data originates. It comes from everywhere. Every single search query through Google or Bing is data. Every picture uploaded is data. Every Vine video created is data. You get the idea. However, it’s not limited to data made available online. Data is literally everywhere. O’Reilly states that, “Data is everywhere: your government, your web server, your business partners, even your body. While we aren’t drowning in a sea of data, we’re finding that almost everything can (or has) been instrumented.” For a good example of how much data is being generated today, consider a simple Amazon recommendation, called an “also bought”. Let’s say you go to Amazon searching for a new washing machine. You’re looking for the right deal, the right features, and the right size. You can sort your options, compare different models and more. You arrive at what seems to be the perfect solution and there near the bottom of the page is a list of items that other customers “also bought” when viewing or purchasing the same model of washing machine. Amazon had to first identify that information, then collate and store it, and then serve it up to you based on nothing more than your online search query through their website. You leave data behind you with every single action you take online. Using Google to search for a dog training program? That query doesn’t disappear when you close the browser. Google stores your information and uses it. And it’s more than just your query terms. Google is also storing geo-location information and a great deal more. Now apply that to mobile apps, which leave an even richer trail of data behind when you’re done (consisting of both electronic data as well as possibly audio and video information, exact geo-location data, and a great deal more). To go beyond the online and device scenario, consider your frequent shopper card. You use it to get access to important discounts on your groceries or fuel, but every single swipe generates an immense amount of information about your shopping habits, your preferences, the specific retail store locations where you prefer to spend your time… You get the picture. Everything we do today generates data. However, all that information would be worthless if there wasn’t a way to store it. This is the very beginnings of data science. Storage for data must be more than just a data dump into a database somewhere online, though. Our storage solutions have become ever more sophisticated over time. We’ve moved from paper ledgers to electronic spreadsheets to incredibly intricate digital storage systems capable of holding immense amounts of information. Storage is not the end of it, though. Moore’s Law (concerning the advance of computer technology) can be applied equally well to the growth of data. As stated by O’Reilly, “The importance of Moore’s Law as applied to data isn’t just geek pyrotechnics. Data expands to fill the space you have to store it. The more storage is available, the more data you will find to put in it.” It’s the same concept behind our desire for larger homes. We want larger homes to have more space, but that space inevitably becomes filled with possessions – we buy new furniture, new dishes, new computers, new televisions. This brings us to the next hurdle – the more data you store, the more sophisticated your data analysis solutions must be. Things were simple enough when businesses ran on physical ledgers and the owner’s knowledge of customers, but today things are much more complicated. This is the very foundation of data science – increasing the ability to analyse and use the increasing volume of data available to us. Making Data Useful Data without meaning is useless. Data without context is pointless. Data without structure is unusable. Today’s organizations must do more than just warehouse information, and accessing, analysing, and using that information requires more than just basic software. In order to make data useful, it must first be analysed, or “conditioned”. Really, this is nothing more than separating the wheat from the chaff, so to speak. You need to analyse the data and then determine what is useful and what is not. For a company selling athletic shoes, your purchase of gardening tools is probably irrelevant. However, your purchase of a gym membership would be valuable information. Your search on Google for running tracks near your home would also be important data. The problem here is that while a great many new ways of delivering machine-consumable data have been created (atom feeds and microformats, for instance), much of the data found in the wild is very messy, very chaotic. This type of information cannot simply be inserted into an XML file. There’s simply too much garbage inserted into the data to make it usable by software. So, data conditioning also includes clean-up – for instance, removing the HTML code from data scraped from a website so that only the pertinent information remains, and none of the underlying code of the page. Once the data has been conditioned and cleaned, it’s time to move to the next step, which is basically quality control. For instance, let’s say you were interested in gathering physical mailing addresses for potential customers who may be interested in your new product. You could gather much of that information online, but it may not be complete. You might have the first and last name, as well as the street address of a particular person, but be missing their postcode code (this is just a very basic example). You may also have incongruous information – that is, data that doesn’t seem to relate to your goal. According to scientists, the depletion of the ozone layer was discovered because someone decided to take a look at the incongruous data that was gathered, rather than discarding it (deciding whether data is actually incongruous through gathering errors, or if there’s another story underlying that incongruity is part of the job of a data scientist). Now you have to add in the problem of human language. This can be considerable, even when you’re dealing with just one language. For instance, parsing data in English requires a significant understanding of elements that relate directly to the particular task. O’Reilly uses the following example, which is an excellent illustration of the complexities involved with quality assurance in data science. Roger Magoulas, head of O’Reilly’s data analysis group, recently had to search for Apple job listings that required candidates to have geolocation skills. The problem was that not only did Magoulas need an understanding of job listing formats, but also the ability to parse English and to separate Apple-specific listings from the wide range of other job postings out there (as well as other information, related to geolocation, but not pertinent to Apple or employment with the company). This is a perfect example of the difficulty in parsing and quality checking data, and it extends well beyond the search for employment with Apple. Consider running a query for information on Python, the programming language. Google returns an immense number of hits for the snake, instead, leaving you to sort out the right answers for yourself. And this is just for English – add in the hundreds of other languages spoken around the world, and you begin to get a sense of just how daunting and difficult it can be. Software is helpful here, but it’s not always the best solution. Often, data scientists are required to lean on their own understanding (human intelligence versus machine intelligence). Of course, this brings in the question of additional manpower and costs. A single data scientist simply doesn’t have the ability to sort through 10,000 potential listings to determine relevance, not in any realistic way. That means hiring help (which admittedly can be done relatively easily through sources like Amazon’s Mechanical Turk program or through sites like Fiverr.com or the like). Using the Data Once the data has been gathered (or harvested, if you prefer) and gone through the cleaning process and been analysed, it must be put to use. This is done in many different ways, and will depend largely on the business in question – their needs and initiatives will inform the ways that data is used. The first step here is visualization, which can be done in any number of ways, including Venn diagrams, charts, graphs, tables and more. Again, the visualization format will need to fit the business, as well as the initiative for which the data is being analysed. The visualization format for a field map of a company’s competition would look very different from the format for data depicting potential customers within a 10-mile radius of a company’s brick and mortar shop, but both are examples of what’s possible with data science. With this being said, perhaps the most common visualization format is a graph. Analysis generally leads to the production of information in numeric format. Obviously, that’s understandable by machines, but it’s not so much use for human beings. Those numbers need to be plotted out in a visual format, giving meaning to the information highlighted. For example, declining sales numbers might be interesting in a purely numeric format, but it becomes much more compelling when given life as a graphic, depicting the rapid decline of a company’s profits. In fact, visualization is so important to the data scientist that many employ it at each stage of the process. For instance, one might use scatter plots to get a sense of what’s interesting in the information gleaned before beginning an analysis. Another might plot their data to get a sense of how skewed it is, or how much false information is included before conditioning. Even animation can be include here to get a sense of how information (and corresponding real world trends) change over time. Some of the programs used to create visualizations include the following: R Processing Many Eyes (IBM) GnuPlot After data visualization comes data implementation – actually putting the information gleaned to use. This is generally not the role of the data scientist. After creating a visualization format for the information, it is handed off to others within the organization and the data scientist moves to the next project. Other teams or professionals will take the information and use it to move the company forward towards its objective. In Conclusion: The Future for Data Scientists What do the rise of Big Data and the increasing use of data science mean for the world at large? It has a number of impacts, but the most salient is this: every business will need someone to fill the position of data scientist, even if it doesn’t go by that name. Every business, every corporation, every organization and even government agencies must have qualified data scientists capable of transforming raw information into something that can be used to move the organization forward in multiple ways. According to GlassDoor.com, data science jobs have reached “critical mass” in many ways. It is currently the 15th highest paying job in demand, with almost 3,500 openings and an average salary base of over $100,000 annually (in the US – other nations vary). It is currently ranked at the 9th best job in the US, as well. This is not about to change, either. A study by McKinsey Global Institute stated that, “a shortage of the analytical and managerial talent necessary to make the most of Big Data is a significant and pressing challenge.” It goes on to estimate that up to 5 million jobs in the United States alone will require skills in data science by the year 2018. These include positions for data analysts, data engineers, statisticians and data scientists. Data science professionals have unique abilities and skills, as well. They combine the spirit of entrepreneurship with patience, mathematical skills with the ability to explore and make connections, computer science skills with an understanding of human behaviour. UC Berkley states, “Virtually every sector of the economy now has access to more data than would have been imaginable even a decade ago. Businesses today are accumulating new data at a rate that exceeds their capacity to extract value from it. The question facing every organization that wants to attract a community is how to use data effectively – not just their own data, but all the data that’s available and relevant.” Even the New York Times weighed in on the burgeoning field of data science, saying, “This hot new field promises to revolutionize industries from business to government, health care to academia.” Obviously, Big Data is here to stay, and the need for professionals capable of transforming that raw information into something that can be used to further organization goals, create community, foster customer engagement and more is immense. Source: https://beta.oreilly.com/ideas/what-is-data-science http://datascience.berkeley.edu/about/what-is-data-science/ http://www.revelytix.com/?q=content/what-data-science-0 http://www.quora.com/What-is-data-science http://venturebeat.com/2013/11/11/data-scientists-needed/ https://gigaom.com/2013/01/06/why-data-scientists-matter-data-science-is-the-future-of-everything/ http://www.forbes.com/sites/lisaarthur/2013/08/15/what-is-big-data/
Good piece Chris and well said, there is a huge difference between big (useless) data and usable (even smaller) ones.If you have seen the recent Data Science survey from Kaggle this is indeed to greatest problem for any data scientist, i.e., data cleansing and preparation to make them usable
Data Mining: Models and Methods
What is Data Mining? Data mining refers to discovery and extraction of patterns and knowledge from large data sets of structured and unstructured data. Data mining techniques have been around for many decades, however, recent advances in ML (Machine Learning), computer performance, and numerical computation have made data mining methods easier to implement on the large data sets and in business-centric tasks. Growing popularity of data mining in business analytics and marketing is also due to the proliferation of Big Data and Cloud Computing. Large distributed databases and methods for parallel processing of data such as MapReduce, make huge volumes of data manageable and useful for companies and academia. Similarly, the cost of storing and managing data is reduced by cloud service providers (CSPs) who offer a pay-as-you-go model to access virtualised servers, storage capacities (disc drives), GPUs (Graphic Processing Unit), and distributed databases. As a result, companies can store, process, and analyze more data getting better business insights. By themselves, state-of-the-art data mining methods are powerful in many classes of tasks. Some of them are anomaly detection, clustering, classification, association rule learning, regression, and summarization. Each of these tasks plays a crucial role in a whatever setting one might think of. For example, anomaly detection techniques help companies protect against network intrusion and data breach. In their turn, regression models are powerful in the prediction of business trends, revenues, and expenses. Clustering techniques have the highest utility in grouping huge volumes of data into cohesive entities that tell about patterns, dependencies both within and among them without the prior knowledge of any laws that govern observations. As these examples illustrate, data mining has the power to put data into the service of businesses and entire communities. Data Mining Models There exist numerous ways to organize and analyze data. Which approach to select depends much on our purpose (e.g. prediction, the inference of relationships) and the form of data (structured vs. unstructured). We can end up with a particular configuration of data which might be good for one task, but not so good for another. Thus, to make data usable one should be aware of theoretical models and approaches used in data mining and realize possible trade-offs and pitfalls in each of them. Parametric and Non-Parametric Models One way of looking at a data mining model is to determine whether it has parameters or not. In terms of parameters, we have a choice between parametric and non-parametric models. In the first type of models, we select a function that, in our view, is the best fit to the training data. For instance, we may choose a linear function of a form F (X) = q0 + q1 x1 + q2 x2 + . . . + qp xp, in which x’s are features of the input data (e.g house size, floor, a number of rooms) and q’s are unknown parameters of the model. These parameters may be thought of as weights that determine a contribution of different features (e.g house size, floor, number of rooms) to the value of the function Y (e.g house price). The task of a parametric model is then to find parameters Q using some statistical methods, such as linear regression or logistic regression. The main advantage of parametric models is that they contain intuition about relationships among features in our data. This makes parametric models an excellent heuristic, inference, and prediction tool. At the same time, however, parametric models have several pitfalls. If the function we have selected is too simple, it may fail to properly explain patterns in the complex data. This problem, known as underfitting, is frequent in linear functions used with the non-linear data. On the other hand, if our function is too complex (e.g with polynomials), it may end up in the overfitting, a scenario in which our model responds to the noise in data rather than actual patterns and is not generalizable to new examples. Figure #1 Examples of normal, underfit, and overfit models. Non-parametric models are free from these issues because they make no assumptions about the underlying form of the function. Therefore, non-parametric models are good in dealing with unstructured data. On the other hand, since non-parametric models do not reduce the problem to the estimation of a small number of parameters, they require very large datasets in order to obtain a precise estimate of the function. Restrictive vs. Flexible Methods Data mining and ML models may also differ in terms of flexibility. Generally speaking, parametric models, such as linear regression, are considered to be highly restrictive, because they need structured data and actual responses (Y) to work. This very feature, however, makes them suitable for inference – finding relationships between features (e.g how the crime rate in the neighborhood affects house prices). Because of this, restrictive models are interpretable and clear. This observation, though, is not true for flexible models (e.g non-parametric models). Because flexible models make no assumptions about a form of the function that controls observation, they are less interpretable. In many settings, however, the lack of interpretability is not a concern. For example, when our only interest is a prediction of stock prices, we should not care about the interpretability of the model at all. Supervised vs. Unsupervised Learning Nowadays, we hear a lot about supervised and unsupervised Machine Learning. New neural networks based on these concepts are making progress in image and speech recognition, or autonomous driving on a daily basis. A natural question, though, what is the difference between unsupervised and supervised learning approaches? The main difference is in a form of data used and techniques to analyze it. In a supervised learning setting, we use a labeled data that consists of features/variables and dependent variables (Y or response). This data is then fed to the learning algorithm that searches for patterns, and a function that controls relationships between independent and dependent variables. The retrieved function may be then applied for the prediction of future observations. In the unsupervised learning, we also observe a vector of features (e.g. house size, floor). The difference with supervised learning, though, we don’t have any associated results (Y). In this case, we cannot apply a linear regression model since there are no response values to predict. Thus, in an unsupervised setting, we are working blind in some sense. Data Mining Methods In this section, we are going to describe technical details of several data mining methods. Our choice fell on linear regression, classification, and clustering methods. These methods are one of the most popular in data mining because they solve a wide variety of tasks, including inference and prediction. Also, these methods perfectly illustrate key features of data mining models described above. For example, linear regression and classification (logistic regression) are examples of parametric, supervised, and restrictive methods, whereas clustering (k-means) belongs to a subset of non-parametric unsupervised methods. Linear Regression for Machine Learning Linear Regression is a method of finding a linear function that reasonably approximates the relationship between data points and dependent variable. In other words, it finds an optimized function to represent and explain data. Contemporary advances in processing power and computation methods allow using linear regression in combination with ML algorithms to produce quick and efficient function optimization. In this section, we will describe an implementation of the linear regression with gradient descent to produce algorithmic fitting of data to linear function. Image #1 Linear regression For this task, let’ s take a case of a house price prediction. Let’s assume we have a training set of 100 house examples (m=100). Each house in this sample may be defined as x1 x2, x3 … xm. Correspondingly, each house has a set of features or properties, such as house size and floor. Features may be thought of as variables that determine a house price. So, for example, the variable would refer to the size of the first house in the training sample. Finally, our training sample has a list of prices for each house denoted as y1, y2, ...ym.. This data tell much by itself (e.g we may apply some methods of descriptive statistics to interpret it), however, in order to run a linear regression, we should first formulate the initial hypothesis. Our hypothesis may be defined as a simple linear function with three parameters (Q). hQ(x) = Q0 + Q1x1+ Q2x2 Image#2 Gradient Descent where x1 and x2 are the features (house size and floor) and Q parameters of the function we want to predict with the linear regression. In essence, this hypothesis says that a house price is determined by house size and floor parametrized by certain parameters Q. Thus, we have a confirmation that linear regression is a parametric model, in which we try to fit right parameters to find a configuration that best explains the data. However, what method should we apply to determine the right parameters? Intuitively, our task is to fit parameters that ensure our hypothesis h(x) for each house is close to y, which is a real-world price. For that purpose, we have to define a cost function that evaluates the difference between a predicted value and an actual value. Equation#1 Cost Function for Linear Regression The right-most part of this equation is a version of a popular least-squares method that calculates a squared difference between the training value y and the value predicted by our hypothesis function hq(x). Then, our task is to minimize the cost function so that the predicted error is small as possible. One of the most popular solutions to this problem is the gradient descent algorithm based on the mathematic properties of the gradient. Gradient is a vector- valued function that points in the direction of the greatest rate of increase of our function. In the case of a multi-variate function, its gradient is the vector whose components are partial derivatives of f. Since gradient is a vector that points in the direction of the function’s growth, it may be used to find parameters that minimize our function. To achieve this, we simply need to move in the backward direction. The technique of the gradual movement down the function to find a local or global minimum is known as a gradient descent and demonstrated in the image above. To implement gradient descent for our linear regression, we need to start with random parameters Q, and then repeatedly update them in the direction opposite to the gradient vector until convergence with the global minimum (our linear function guarantees that such global minimum actually exists). The gradient descent procedure is defined by the update algorithm Equation#2 Gradient Descent Algorithm where a is the learning rate at which we set our algorithm to learn. The learning rate should not be too large, so that we don’t jump over the global minimum, and should not be too small because then the process would take much time. A partial derivative in the right-most part of the equation is calculated for each parameter to construct a gradient vector. It is derived in the following way: Equation#3 Partial Derivative of Gradient Descent Putting partial derivatives and learning rate together produces the final update rule Equation #4 Gradient Descent Update Rule that should be repeated until our algorithm finds the global minimum. That would be the point at which our parameters (Q) produce the function that best explains the training data. This learned function now may be used to predict house prices for houses not included in the training sample and employed in the inference of various relationships between features of our model. This functionality makes the linear regression with gradient descent a powerful technique both in data mining and machine learning. Classification with Logistic Regression Classification is a process of determining a class/category to which the object belongs. Classification techniques implemented via machine learning algorithms have numerous applications ranging from email spam filtering to medical diagnostics and recommender systems. Similarly to linear regression, in a classification problem, we work with a labeled training set that includes some features. However, observations in the data set map not to the quantitative value as in linear regression, but to a categorical value (e.g class). For example, patients’ medical records may determine two classes of patients: those with benign and malignant cancers. The task of the classification algorithm is then to learn a function that best predicts what type of cancer (malignant vs. benign) a patient has. If there are only two classes, the problem is known as a binary classification. In contrast, multi-class classification may be used when we have more classes of data. One of the most common classification techniques in data science and ML is the logistic regression. Logistic regression is based on the sigmoid function that has an interesting property: it maps any real number to the (0,1) interval. As a result, it may be effectively used to evaluate the probability (between 0 and 1) than an observation falls within a certain category. For example, if we define benign cancer as 0 and malignant cancer as 1, a logistic value of .6 would mean that there is a 60% chance that a patient’s cancer is malignant. These properties make sigmoid function useful for binary classification, but multi-class classification is also possible. Image#3 Sigmoid Function The formula for the sigmoid function is: Equation#5 Sigmoid Function where e is the constant e (2,71828) – a base of the natural logarithm with a property that its natural logarithm is equal to 1. To build a working classification model, we should put our hypothesis into the sigmoid function. Remember that our hypothesis function has a form of hQ(x) = Q0 + Q1x1+ Q2x2 For convenience, it may be written in the vector form z = Qt x, where superscript t refers to the matrix transpose of the vector parameters Q. As a result of this transformation, we get the following function Equation#6 Logistic regression model where z refers to the vector representation of our initial hypothesis hq(x). In order to fit parameters Q to our logistic regression model, we should first redefine it in probabilistic terms. This is also needed to leverage the power of sigmoid function as a classifier. Equation# 7 Probabilistic Interpretation of Classification The above definition stems from the basic rule that probability always adds up to 1. So, for example, if the probability of the malignant cancer is 0.7, the probability of benign cancer is automatically 1- 0.7 = 0.3. The equation above formalizes this obvious observation. Now, as we defined the hypothesis and probabilistic assumptions, it’s time to construct a cost function in the same way we did for linear regression. For that purpose, we need to transform our original sigmoid function, because it’s a complex non-convex function that can produce many local minima. If used with the least-squares method similar to the linear regression model below, a sigmoid function will have a hard time to converge. Equation #8 Linear Regression Cost Function Instead, to represent the intuition behind sigmoid function, we may use log-probability applied to the above-mentioned probabilistic definition of the classification problem. Equation#9 Logistic Regression Cost Function The graph below illustrates that log function assigns a high cost if our hypothesis is wrong, and no cost if the hypothesis is right. If y=1 and hq(x) = 1, then the cost ? 0. In contrast, if y=1 and hqx is 0, then the cost goes to infinity. The opposite happens if y=0. Image #4 - log(x) Function We can make a simplified version of the cost function by merging these two cases together. The final cost function ready for use with logistic regression has the following form: Equation #10 Logistic Regression Cost Function (Simplified) Now, when the cost function is formulated, we can easily use the gradient descent identical to the one applied in linear regression. Equation # 11 Logistic Regression Update Rule A similar technique may be applied to the multi-class classification problem with more than two classes. In general, multi-class problems use a one-vs-all approach in which we choose one class and then lump all others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction. Image #5 One-vs-all or Multiclass Classification Clustering Methods As we have seen, clustering is an unsupervised method that is useful when the data is not labeled or if there are no response values (y). Clustering observations of a data set involves partitioning them into distinct groups so that observations within each group are quite similar to each other, while observations in the different groups have less in common. To illustrate this method, let’s take an example from marketing. Assume that we have a big volume of data about consumers. This data may involve median household income, occupation, distance from the nearest urban area, and so forth. This information may be then used for market segmentation. Our task is to identify various groups of customers without the prior knowledge of commonalities that may exist among them. Such segmentation may be then used for tailoring marketing campaigns that target specific clusters of consumers. There are many different clustering techniques to do this, but the most popular are k-mean clustering algorithm and hierarchical clustering. In this section, we are going to describe k-means method, a very efficient algorithm that covers a wide range of use cases. In the k-means clustering, we want to partition observations into a pre-specified number of clusters. Although setting a number of clusters before clustering is considered to be a limitation of the k-means algorithm, it is still a very powerful technique. In our clustering problem, we are given a training set of xi,...,x(m) consumers with individual features xj.. Features are vectors of variables that describe various properties of consumers, such as median income, age, gender, and so forth. The rule is that each observation (consumer) should belong to exactly one cluster and no observations should belong to more than one cluster. Image #6 Illustration of the Clustering Process The idea behind the k-means clustering is that a good cluster is the one for which the within-cluster variation or within-cluster sum of squares (WCSS) (variance) is minimal. In other words, consumers in the same cluster should have more in common with each other than with consumers from other clusters. To achieve this configuration, our task is to algorithmically minimize WCSS for all pre-specified clusters. This task is implemented in the following equation: Equation #12 Within-cluster Sum of Squares (WCSS) where |Ck | denotes the number of observations in the kth cluster In words, the equation above says that we want to partition the observations into K clusters so that the total within-cluster variation, summed over all K clusters, is as small as possible. The within- cluster variation for the kth cluster is the sum of all of the pairwise squared Euclidean distances between the observations in this cluster, divided by the total number of observations in the kth cluster. As in the case with linear regression, to minimize this function we should start from some initial guess. Our task is to find cluster centroids (average position of all points in the space) or, means for each cluster. This may be achieved via three-step algorithm in which we: 1.Randomly initialize K cluster centroids – m1, m2 … mk 2.We assign each observation in the data set to the cluster that yields the least WCSS. Intuitively, the least WCSS is the ‘nearest’ mean (centroid). To find the ‘nearest’ centroid we should calculate the Euclidean distances between observations and centroids and select the one with the smallest distance. Equation #13 Assigning observation to the closest centroid 3.The next step is moving to a new centroid m by computing the centroid for each K cluster. The centroid of the kth cluster may be defined as a vector of the p feature means for the observations in the kth cluster. So, for example, if we have x1, x2, x3 and they belong to the same cluster (c2), then the centroid m2 is defined by the average: Equation #14 Centroid Calculation Example Since arithmetic mean is a good least-squares estimator, this step also minimizes the within-cluster sum of squares. This means that as the algorithm runs, the clustering obtained will continually improve until the result no longer changes. When this happens, the local optimum has been reached and clusters become stable. Conclusion Data mining models and methods described in this paper, allow data scientists to perform a wide array of tasks, including inference, prediction, and analysis. Linear regression is powerful in the prediction of trends and inference of relationships between features. In its turn, logistic regression may be used in the automatic classification of behaviors, processes, and objects, which makes it useful in business analytics and anomaly detection. Finally, clustering allows to make insights about unlabeled data and infer hidden relationships that may drive effective business decisions and strategic choices.
Hi,Did you have a look at this paper to integrate some models/techniques?http://www.sciencedirect.com/science/article/pii/S0957417412003077
PUBLIC SECTOR CLOUD CONFERENCE 6th December 2017
PUBLIC SECTOR CLOUD CONFERENCE 6th December 2017 Location: University of Salford Funded places are being offered to members of the Data Science Foundation, contact firstname.lastname@example.org for further information on funded places The Public Sector Cloud conference will take place at the University of Salford on the 6th December. The event will bring together leading experts of IoT and Digital Infrastructure. Speakers will share their views on the development of cloud based systems within government and education, the constant pressure for councils to migrate to a 'Cloud First' initiative is a broad concern, however as integrated systems and legacy operations have been in place for so long, the transition will certainly take time and money to achieve. With added cost, cloud transition seems a time away as budget pressures are an overwhelming concern, made worse by the constant scrutiny of the media spotlight. We will be offering a case study on the day of success stories experienced by local councils and public institutes. The overall day will be directed to ensure that our delegates are made aware of the obstacles and challenges ahead, ensuring they see a discernible path to updating their systems gradually and progressively. An example of speakers and themes: Liam Maxwell, UK National Technology Adviser, HM Government Promoting and supporting digital industry in the UK and internationally; The future of public sector services and migrating to cloud based service. James Stewart - Former Director of Technical Architecture of UK Government Digital Service Why cloud remains a key part of any “Digital by Default” agenda New cloud developments – 4 years on Christopher Wroath - Director of Digital, NHS Education for Scotland Cloud as part of the new Health & Social Care Delivery plan Cloud first – NES Journey Event Page: http://www.salford.ac.uk/onecpd/courses/public-sector-cloud
Funded places are being offered to members of the Data Science Foundation, contact email@example.com for further information on funded places.The funded places are limited, if you are interested in going please get in contact ASAP
Introduction to Artificial Neural Networks (ANNs)
Introduction to Artificial Neural Networks (ANNs) White Paper 5 September 2017 Introduction Machine Learning (ML) is a subfield of computer science that stands behind the rapid development of Artificial Intelligence (AI) over the past decade. Machine Learning studies algorithms that allow machines recognizing patterns, construct prediction models, or generate images or videos through learning. ML algorithms can be implemented using a wide variety of methods like clustering, linear regression, decision trees, and more. In this paper, we are going to discuss the design of Artificial Neural Networks (ANN) – a ML architecture that gathered a powerful momentum in the recent years as one of the most efficient and fast learning methods to solve complex computer vision, speech recognition, NLP (Natural Language Processing), image, audio, and video generation problems. Thanks to their efficient multilayer design that models the biological structure of human brain, ANNs have firmly established themselves as the state-of-the-art technology that drives AI revolution. In what follows, we are going to describe the architecture of a simple ANN and offer you a useful intuition of how it may be used to solve complex nonlinear problems in an efficient way. What is an Artificial Neural Network? An Artificial Neural Network is an ML (Machine Learning) algorithm inspired by biological computational models of brain and biological neural networks. In a nutshell, an Artificial Neural Network (ANN) is a computational representation of the human neural network that regulates human intelligence, reasoning and memory. However, why should we necessary emulate a human brain system to develop efficient ML algorithms? The main rationale behind using ANNs (ANN) is that neural networks are efficient in complex computations and hierarchical representation of knowledge. Neurons connected by axons and dendrites into complex neural networks can pass and exchange information, store intermediary computation results, produce abstractions, and divide the learning process into multiple steps. Computation model of such system can thus produce very efficient learning processes similar to the biological ones. A perceptron algorithm invented in 1957 by Franc Rosenblatt in 1957 was the first attempt to create a computational model of a biological neural network. However, complex neural networks with multiple layers, nodes, and neurons became possible only recently and thanks to the dramatic increase of computing power (Moore’s Law), more efficient GPUs (Graphics Processing Units), and proliferation of Big Data used for training ML models. In the 2000s-2010s these developments gave rise to Deep Learning (DL), – a modern approach to the design of ANNs based on a deep cascade of multiple layers that extract features from data and do transformations and hierarchical representations of knowledge. Image #1 Overfitting problem Thanks to their ability to simulate complex nonlinear processes and create hierarchical, and abstract representations of data, ANNs stand behind recent breakthroughs in image recognition and computer vision, NLP (Natural Language Processing), generative models and various other ML applications that seek to retrieve complex patterns from data. Neural networks are especially useful for studying nonlinear hypotheses with many features (e.g n=100). Constructing an accurate hypothesis for such a large feature space would require using multiple high-order polynomials which would inevitably lead to overfitting – a scenario in which the model describes the random noise in data rather than underlying relationships and patterns. The problem of overfitting is especially tangible in image recognition problems where each pixel may represent a feature. For example, when working with 50 X 50 pixel images, we may have 25000 features which would make manual construction of the hypothesis almost impossible. A Simple Neural Network with a Single Neuron The simplest possible neural network consists of a single “neuron” (see the diagram below). Using a biological analogy, this ‘neuron’ is a computational unit that takes inputs via (dendrites) as electrical inputs (let’s say “spikes”) and transmits them via axons to the next layer or the network’s output. Image #2 A neural network with a single neuron In a simple neural network depicted above, dendrites are input features (x1, x2 …) and the outputs (axons) represent the results of our hypothesis (hw,b(x)). Besides input features, the input layer of a neural network normally has a 'bias unit' which is equal to 1. A bias unit is needed to use a constant term in the hypothesis function. In Machine Learning terms, the network depicted above has one input layer, one hidden layer (that consists of a single neuron) and one output layer. A learning process of this network is implemented in the following way. The input layer takes input features (e.g pixels) for each training sample and feeds them to the activation function that computes the hypothesis in the hidden layer. An activation function is normally a logistic regression used for classification, however, other alternatives are also possible. In the case described above, our single neuron corresponds exactly to the input-output mapping that was defined by logistic regression. Image #3 Logistic Regression As in the case with simple binary classification, our logistic regression has parameters. They are often called “weights” in the ANN (Artificial Neural Network) models. Multi-Layered Neural Network To understand how neural networks work, we need to formalize the model and describe it in a real-world scenario. In the image below we can see a multilayer network that consists of three layers and has several neurons. Here, as in a single-neuron network, we have one input layer with three inputs (x1,x2,x3) with an added bias unit (+1). The second layer of the network is a hidden layer consisting of three units/neurons represented by the activation functions. We call it a hidden layer because we don’t observe the values computed in it. Actually, a neural network can contain multiple hidden layers that pass complex functions and computations from the “surface” layers to the “bottom” of the neural network. The design of a neural network with many hidden layers is frequently used in Deep Learning (DL) – a popular approach in the ML research that gained a powerful momentum in recent years. Image #4 Multilayer Perceptron The hidden layer (Layer 2) above has three neurons (a12, a22, a32). In abstract terms, each unit/neuron of a hidden layer aij is an activation of unit/neuron in in the layer j. In our case, a unit a12 ctivates the first neuron of the second layer (hidden layer). By activation, we mean a value which is computed by the activation function (e.g logistic regression) in this layer and outputted by that node to the next layer. Finally, Layer 3 is an output layer that gets results from the hidden layer and applies them to its own activation function. This layer computes the final value of our hypothesis. Afterwards, the cycle continues until the neural network comes up with the model and weights that best predict the values of the training data. So far, we haven’t defined how the ‘weights’ work in the activation functions. For that reason, let’s define Q(j) as a matrix of parameters/weights that controls the function mapping from layer j to layer j + 1. For example, Q1 will control the mapping from the input layer to the hidden layer, whereas Q2 will control the mapping from the hidden layer to the output layer. The dimensionality of Q matrix will be defined by the following rule. If our network has sj units in the layer j and sj+1 units in the layer j+1, then Qj will have a dimension of sj+1 X (sj + 1). The + 1 dimension comes from the necessary addition in Qj of a bias unit x0 and Q0(j). In other words, our output nodes will not include the bias unit while the input nodes will. To illustrate how the dimensionality of the Q matrix works, let’s assume that we have two layers with 101 and 21 units in each. Then, using our rule Qj would be a 21 X 102 matrix with 21 rows and 102 columns. Image #5 A Neural Network Model Let’s put it all together. In the image above, we see our neural network with three layers again. What we need to do, is to calculate activation functions based on the input values, and our main hypothesis function based on the set of calculations from the previous layer (the hidden layer). In this case, our neural network works as a cascade of calculations where each subsequent layer supplies values to the activation functions of the next one. To calculate activations, we first have to define the dimensionality of our Q matrices. In this example, we have 3 input and 3 hidden units, so Q1 mapping from input to hidden layer is of dimension 3 X 4 because the bias unit is included. The activation layer of each hidden neuron (e.g a12) is equal to our sigmoid function applied to the linear combination of inputs with weights retrieved from the weight matrix Qj. In the diagram above, you can see that each activation unit is computed by the function g which is our logistic regression function. In its turn, Q2 refers to the matrix of weights that maps from the hidden layer to the output layer. These weights may be randomly assigned to the matrix before the neural network runs or be a product of previous computations. In our case, Q2 is a 1 X 4 dimensional matrix (i.e a row vector). To calculate the output results we apply our hypothesis function (sigmoid function) to the results calculated by the activation functions in the hidden layer. If we had several hidden layers, then the results of the previous activation functions would be passed to the next hidden layer and then to the output layer. This sequential mechanism makes neural networks very powerful in computation on nonlinear hypotheses and complex functions. Instead of trying to fit inputs to polynomial functions designed manually, we can create a neural network with numerous activation functions that exchange intermediary results and update weights. These automatic setup allows creating nonlinear models that are more accurate in prediction and classification of our data. Neural Networks in Action The power of neural networks to compute complex nonlinear functions may be illustrated using the following binary classification example taken from Coursera Machine Learning course by Professor Andrew Ngi. Consider the case when x1 and x2 can take two binary values (0,1). To put this binary classification problem in Boolean terms, our task is to compute y = x1 XOR x2 , which is the same as computing x1 XNOR x2. The latter is a logic gate that may be interpreted as NOT (x1 XOR x2). This is the same as saying that the function is true if both x1 and x2 are equal 0 or 1. To make our network calculate XNOR, we first have to describe simple logical functions to be used as intermediary activations in the hidden layer. The first function we want to compute is a logical AND function: y = x1 AND x2. Image #6 Logical AND function As in the first example above, our AND function is a simple single-neuron network with inputs x1 and x2 and a bias unit (+1). The first thing we need to do is to assign weights to the activation function and then compute it based on the input values specified in the truth table below. These input values are all possible binary values that x1 and x2 can take. By fitting 0s and 1s into the function (i.e logistic regression) we can compute our hypothesis. hq(x) = g(-30 + 20x1 + 20x2). To understand how the values of the third column of the truth table are found, remember that sigmoid function is 0 at ≈ -4.6 and 1 at ≈ 4.6. As a result, we have: x1 x2 hq(x) 0 0 g(-30) ≈ 0 0 1 g(-10) ≈ 0 1 0 g(-10) ≈ 0 1 1 g(10) ≈ 1 As we can see now, the rightmost column is a definition of a logical AND function that is true only if both x1 and x2 are true. The second function we need for our neural network to work is a logical OR function. In the logical OR, y is true (1) if either x1 OR x2 or both of them are 1 (true). Image #7 Logical OR function As in the previous case with the logical AND, we assign weights that will fit the definition of the logical OR function. Putting these weights into our logistic function g(-10 + 20x1 + 20x2) we get the following truth table: x1 x2 hq(x) 0 0 g(-10) ≈ 0 0 1 g(10) ≈ 1 1 0 g(-10) ≈ 1 1 1 g(10) ≈ 1 As you see, our function is false (0) only if both x1 and x2 are false. In all other cases, it is true. This corresponds to the logical OR function. The last function we need to compute before running a network for finding x1 XNOR x2 is (NOT x1) and (NOT x2). In essence, this function consists of two logical negations (NOT). A single negation NOT x1 may be presented in the following diagram. In essence, it says that y is true only if x1 is false. Therefore, the logical NOT has only one input unit (x1). Image #8 Logical NOT After putting inputs with weights into g = 10 – 20x1, we end up with the following truth table. x1 hq(x) 0 g(10) ≈ 1 1 g(-10) ≈ 0 The output values of this table confirm our hypothesis that NOT function outputs true only if x1 is false. Now, we can find out values of the logical (NOT x1) AND (NOT x2) function. Image #9 Logical (NOT x1) AND (NOT x2) Putting binary values of x1 and x2 in the function g(10 - 20x1 -20x2) we end up with the following truth table. x1 x2 hq(x) 0 0 g(10) ≈ 1 0 1 g(-10) ≈ 0 1 0 g(-10) ≈ 0 1 1 g(-30) ≈ 0 This table demonstrates that the logical (NOT x1) AND (NOT x2) function is true only if both x1 and x2 are false. These three simple functions (logical AND, logical OR, and double negation AND function) may be now used as the activation functions in our three-layer neural network to compute another nonlinear function defined in the beginning: x1 XNOR x2. To do this, we need to put these three simple functions together into a single network. Logical AND Logical (NOT x1) AND (NOT x2) Logical OR This network uses three logical functions calculated above as the activation functions. Image #10 A Neural Network to Compute XNOR Function As you see, the first layer of this network consists of two inputs (x1 and x2) plus a bias unit +1. The first unit of the hidden layer is a Logical AND activation function that takes weights specified above (-30, 20, 20). The second unit a(2)2 is represented by the (NOT x1) AND (NOT x2) function that takes parameters 10, -20, -20. Doing our usual calculations, we get the values 0,0,0,1 for a(2)1 and the values 1,0,0,0 for the second unit in the hidden layer. Now, the final step is using the second set of parameters from the logical OR function that sits in the output layer. What we do here, is simply take the values produced by the two units in the hidden layer (logical AND and (NOT x1) AND (NOT x2) ) and apply them to the OR function with its parameters. The results of this computation make up our hypothesis function (1,0,0,1), which is our desired XNOR function. x1 x2 a(2)1 a(2)2 hq(x) 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 1 That’s it! Hopefully, as this example illustrates, neural networks are powerful in computing complex nonlinear hypotheses by using a cascade of functions. In fact, neural networks can use output values of a certain function as the inputs of other functions. Leveraging this functionality, complex multi-layered networks that can extract complex features and patterns from images, videos, and other data can be designed. Conclusion Artificial Neural Networks (ANNs) are the main drivers of the contemporary AI revolution. Inspired by the biological structure of human brain, ANNs are powerful in modeling functions and hypotheses which would be hard to derive intuitively or logically. Instead of inventing your own function with high-order polynomials, which may lead to overfitting, one can design an efficient ANN architecture that can automatically fit complex nonlinear hypotheses to data. This advantage of the ANNs has been leveraged in the algorithmic feature extraction in computer vision and image recognition. For example, instead of manually specifying a finite list of image features to choose from, we can design a Convolutional Neural Network (CNN) that uses the same principle as the animal’s visual cortex to extract features. As a human eye, layers of the CNN respond to stimuli only in a restricted region of the visual field. This allows the network to recognize low-level features such as points, edges, or corners and gradually merge them into high-level geometric figures and objects. This example illustrates how good ANNs are in the automatic derivation of hypotheses and models from complex data that includes numerous associations and relationships.
Good overview, Kirill.It might make sense to also refer to the following picture to give a broad and quick snapshot of the different neural nets scientists can use:http://www.asimovinstitute.org/wp-content/uploads/2016/09/neuralnetworks.png