Pandas Cookbook:Develop Powerful Routines for Exploring Real World Datasets
Pandas Cookbook - Develop Powerful Routines for Exploring Real-word Datasets Exploring Real-World Datasets My name is Ted Petrou , and I am the author of the newly released Pandas Cookbook. In this article, I will discuss the overall approach I took to writing Pandas Cookbook along with highlights of each chapter. Pandas Cookbook Guiding Principles I had three main guiding principles when writing the book: Use of real-world datasets Focus on doing data analysis Writing modern, idiomatic pandas First, I wanted you, the reader, to explore real-world datasets and not randomly generated data. I tried very hard to find datasets that contained situations where an interesting or unique pandas operation could be performed. Descriptions of the main datasets used throughout the book can be found in this Jupyter notebook. Second, I wanted to focus on doing actual data analysis by providing useful or surprising insights. I wanted to avoid a mechanical approach where pandas operations where learned in isolation or were devoid of contact with real data. In this regard, Pandas Cookbook teaches both, how to understand pandas operations, and how to generate results that would be useful for a data analysis. Third, the pandas library has evolved quite substantially since it first started to make regular appearances in data analysis workflows in 2012. Many of the older tutorials, and especially the older answers on Stack Overflow have not been upgraded to reflect newer syntax. Pandas is confusing because there are often multiple ways to produce the same result, many of which, will be slow or ineffective. Pandas Cookbook strives to provide straightforward and efficient or ‘idiomatic’ pandas. Formation of Pandas Cookbook Pandas Cookbook was inspired by the following: My weeklong Data Exploration Bootcamp Answering 400+ questions on pandas on Stack Overflow Working as a data scientist at Schlumberger Hosting dozens of meetups for Houston Data Science My Data Exploration Bootcamp is an intensive, weeklong class with over 700 pages of material, 250 short-answer questions and a couple projects. Much of the material for Pandas Cookbook was inspired by this class. The material was expanded and refined each time the class was taught, thanks in part to the excellent feedback from my students. Teaching showed me first hand, exactly where the greatest pain-points were. Nothing helped me more to improve my own ability to write idiomatic pandas than answering questions on Stack Overflow. You learn an incredible amount by answering and discussing with the other top users. As a data scientist at Schlumberger, I built scripts to clean and process data that required the use of dozens of pandas commands pieced together. Pandas Cookbook has many advanced recipes that combine operations from different parts of the library to get the required result. Also, I was given a week’s worth of professional Python training, which was quite bad and sparked a desire to produce a better class. You can hear more of my story in this podcast from Undersampled Radio. A Book must Beat the Documentation The official documentation itself is very thorough and over 2,000 pages in total. In order for a book to be of any value, at a minimum, it must be better than the documentation. There are some major advantages of the documentation over a book. First, there is no restriction on page length, so every single aspect of the library can be covered. Second, the documentation is always up to date with the latest changes. Technical books on fast moving libraries like pandas tend to go out of date relatively fast. How Pandas Cookbook Demolishes the Documentation Unfortunately, the pandas documentation does not have interesting examples using real-world datasets. Nearly all of its examples are done using randomly generated or contrived data showing operations in isolation from one another. You learn how to run a single command, independent of all the other available ones. This is not at all how an analysis happens using actual data. There is certainly lots of value for learning the mechanics of all the pandas operations and I suggest doing that in my How to Learn Pandas article. In fact, I have read through most parts of the documentation five or more times each. Pandas is a huge library and its difficult to keep all the commands in the forefront of your mind, even if you use it every single day. Pandas Cookbook uses multiple operations one after the other in many of its recipes. This often yields a long chain of methods called from a DataFrame or Series. This is what makes Pandas Cookbook valuable — you are constantly working with real data, stringing together multiple pandas operations to complete a particular task. Cookbook Format It’s a bit unfortunate/ludicrous that the title of the book sounds appalling for those not in the know. I suggest keeping this book in the kitchen next to your other cookbooks for some guaranteed extra laughs. The book is composed of approximately 100 recipes, with each one containing three major sections: How to do it: Step-by-step code on how to complete a particular task. There are some explanations embedded into the steps themselves. How it works: Very detailed explanations of all the steps in the recipe. I read lots of reviews of other Packt cookbooks, and the most common complaint was the lack of explanations in this section. I took extra care to ensure that all steps and commands were fully explained. There’s more: Extra operations, closely related to the main recipe. There are almost always tangents that you can follow when learning pandas. This section is often equivalent to an entire new recipe. Entire Focus on Pandas This book makes one basic assumption — that you are comfortable with the fundamentals of Python. Every single recipe (except one or two), uses pandas. Thus, the scope of the book is a bit narrower than other similar books, in that it only focuses on doing data analysis with pandas (along with matplotlib and seaborn for visualization). Target Audience There is no hard requirement for having any prior exposure to pandas. The recipes range from very simple to advanced, so the book is suitable for novices as well as experienced pandas users. Getting the Most out of Pandas Cookbook To get the most out of Pandas Cookbook, I suggest doing the following: Keep the official documentation open at all times Run the code in the Jupyter notebooks as you read the book Read the book sequentially, cover to cover Pandas Cookbook strives hard to differentiate itself from the documentation. This doesn’t mean it is a replacement for the documentation. Most recipes link to a specific part of the documentation, where you can get more details on a specific command. This is why I recommend keeping the documentation open as you progress through the book. Do not just read the book. Run the code as you read through each recipe. You should be doing lots of exploration and formulating questions on your own. I also recommend reading this book sequentially. I recommend this whether you are a novice or an experienced pandas user. The recipes have a natural flow that progress from one to the next and tend to get more and more complex. More experienced users, of course, can skip around to recipes that appeal to them more. But, I’ve found that, unless you are a power user of pandas, it will still be good to drill the fundamentals, which is done by reading the book sequentially. Chapter Highlights Below, I discuss a few of the more important concepts and recipes of each chapter Chapter 1: Pandas Foundations Chapter 1 begins by dissecting the anatomy of the DataFrame and Series, the primary objects that will handle the bulk of your workload. It’s vital to be aware of the DataFrame components — the index, the columns and the data (values). The chapter continues by selecting a single column from a DataFrame as a Series. We use this Series to learn about method chaining which is an extremely common way to use pandas. The majority of the recipes in the book string together multiple methods in succession like this. Chapter 2: Essential DataFrame Operations Chapter 2 focuses entirely on the DataFrame. We learn how to order columns sensibly, which is a commonly overlooked task and can greatly help improve readability of the data. As a practical and fun example, we determine the diversity of college campuses by using many of the concepts covered up to this point. Chapter 3: Beginning Data Analysis Chapter 3 covers several fairly simple but complete tasks that you might do when first starting an analysis. It can be immensely helpful to establish a routine at the beginning of a data analysis. Another recipe finds the largest/smallest value in column ‘x’ for every unique value in column ‘y’ without using a call to the groupby method. This is an example of one popular idiom that has arisen more recently. Chapter 4: Selecting Subsets of Data Chapter 4 selects subsets of DataFrames and Series in just about every way imaginable. Data selection is one of the most confusing aspects of the library, which is unfortunate, as it’s used very frequently. Pandas is partially to blame here, as indexing changed with the addition of the .loc/.iloc indexers along with the recent deprecation of .ix. Chapter 5: Boolean Indexing Chapter 5 covers boolean indexing, which is used to select subsets of data by the actual content of the columns and not by their label or integer location(as in chapter 4). One common theme throughout Pandas Cookbook is the comparison between different methods that produce the same results. In one recipe in this chapter, we show how boolean indexing can be replicated by placing columns into the index. For those familiar with SQL, boolean indexing is also compared to the WHERE clause. Chapter 6: Index Alignment All of chapter 6 is dedicated to one of the most powerful, but unexpected, feature of pandas, automatic alignment of each index. Some users can spend years using pandas without even understanding this concept. Automatic index alignment is what separates pandas from most other data analysis libraries. An absurd example is the ‘Exploding Indexes’ recipe, which is used to hammer-home exactly what happens when combining multiple pandas objects. Chapter 7: Grouping for Aggregation, Filtration, and Transformation The first 6 chapters cover the most fundamental parts of pandas in 200 pages. The remaining 5 chapters, and 300 pages, use these fundamentals in just about every recipe to do more complex and interesting analysis. The groupby method in this chapter is particularly helpful for splitting data into independent groups. One particularly fun recipe uses the transform method to calculate the results of a weight-loss bet. Also, one of the most complex recipes resides in this chapter, and finds the streaks of on-time flights for each airline. Chapter 8: Restructuring Data into a Tidy Form Data analysis is made easier when you have tidy data , a term popularized by Hadley Wickham. Chapter 8 transforms many different formats of messy data into tidy data with the following methods: stack, unstack, melt, and pivot. You will also be exposed to the str accessor, which is used to rip apart string data to extract new variables. Chapter 8 is probably the most unique chapter in this book, as I have not seen much discussion online on how to tidy the vast assortment of datasets as done in this chapter. Chapter 9: Combining Pandas Objects There are four primary methods/functions that are used to combine DataFrames/Series together: append, concat, merge and join. This chapter provides examples that are suited for each. “Comparing President Trump’s and Obama’s approval ratings” is one of my favorite recipes, which does intricate web-scraping, moving windows analysis and visualization all in one. This chapter also connects to a relational database with multiple tables to perform an analysis one might normally do with SQL. Chapter 10: Time Series Analysis Pandas has powerful time series functionality that exceeds that from the datetime and NumPy libraries. You will learn how to group simultaneously by time and another variable. Also, one of the newest additions to pandas, the merge_asof function, will be used to find the last time crime was 20% lower. Chapter 11: Visualization with Matplotlib, Pandas, and Seaborn One of the most infuriating and confusing things about matplotlib is its dual interface. In my opinion, all matplotlib should be written with the object-oriented interface as it’s more Pythonic. Pandas Cookbook thoroughly covers how to get started with the object-oriented interface along with the Figure/Axes hierarchy which is key to understanding all of plotting in matplotlib. Pandas and seaborn both use matplotlib to make plots in completely different ways. Pandas uses wide or aggregated data while seaborn takes long or tidy data. One particularly useful recipe for data scientists involves “uncovering Simpson’s paradox”, which is a very common finding that gets revealed whenever you look at more granular slices of your data. Lots More! The chapter highlights are just a small sampling of what is contained in the book. I worked extremely hard to make Pandas Cookbook the very best book available for learning pandas while doing analysis with real-world data. I had lots of fun coming up with the recipes and hope you have fun exploring them.
Great article Ted!
5 Misconceptions About Data Science
In this contributed article, technology writer and blogger Kayla Matthews examines the 5 most common misconceptions floating around about data science and what project administrators and business managers need to be aware of. Remember these tips before getting involved, and be sure to do the necessary research. With the right people and knowledge on your side, you’ll be on your way in no time, rocketing to success.https://insidebigdata.com/category/news-analysis/
How to Ace Your Data Science Interview
With dissertation deadlines glooming, Data Science students are gearing up to leave the academic world and find their feet in a data science role. We all know the demand for the skills and the shortage supply of experienced data scientists means there are opportunities everywhere and companies are looking to secure grad talent, so finding some data science jobs should not be too difficult. But before you reach that commercial goldmine, your faced with the job interview. Not matter how much experience and exposure you have in previous interviews, public speaking or Data science discussions, this preparation is still hard. Data Science interviews tend to cover a wide range of topics. From technical exposure, to statistical understanding, to solving and communicating complex business problems. At Eden Smith we work with a number of business in hiring across the Data science spectrum and to support you ace your interview have curated a list of common Data Science interview questions. We have enriched this data with information from online and insight from our Data Science partners to help you prepare for the types of questions that can be thrown at you during your Data Science Interview. Building Models Building data models for machine learning or pure data transformation and analysis is one of the most common tasks of the modern data scientist and more and more businesses are developing teams, particularly with grads, that are modelling and coding heavy, this is resulting in more interviews covering the various modelling techniques and statistical theories. Not all interviews will be technical, but below are some questions that will help you prepare and refamiliarize yourself with. How would you create a logistic regression model? What is linear regression? What do the terms P-value, coefficient, R-Squared value mean? What is the significance of each of these components? Why is Central Limit Theorem important? Explain hash table collisions? In your opinion, which is more important when designing a machine learning model: Model performance? Or model accuracy? What are some situations where a general linear model fails? Is it better to have too many false positives, or too many false negatives? How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression? What is an example of a dataset with a non-Gaussian distribution? Explain Bayes Theorem, when might you use Bayes Inference? Programming Most Data Science teams are involved in both the ingestion of data for modelling and analysis and the production of models into the enterprise environment. Whether this is led by Data Engineering, software engineering or a database development team, you will be expected to have a strong understanding of various program languages, those directly involved in Data Science and those surrounding data integration and exportation. Be sure to brush up on your Python, R, SQL and relevant big data programming languages such as Scala. Python or R, which would you prefer for text analysis? What modules/libraries/Packages are you most familiar with? What do you like or dislike about them? What are the different types of sorting algorithms available in R language? What is the difference between a tuple and a list in Python? How do you split a continuous variable into different groups/ranks in R? What is the purpose of the group functions in SQL? Give some examples of group functions. Tell me the difference between an inner join, left join/right join, and union. Describe a data science project in which you worked with a substantial programming component. What did you learn from that experience? How would you clean a dataset in your programming preference? What are the two main components of the Hadoop Framework? Data Science Process Although being hands on with data and modelling and programming are the major aspects of any data science in today’s world; often businesses are looking to understand how insights and results are created. Interviewers are looking for you to demonstrate a clear understanding and be able to explain various methods and processes throughout a data science project and their Pros. Cons and use cases to a non-technical audience. Practice articulating and giving clear simple explanations of various complex data science procedures. What are various steps involved in an analytics project? What is the goal of A/B Testing? Explain the use of Combinatorics in data science? What is the difference between Cluster and Systematic Sampling? What is logistic regression? Or State an example when you have used logistic regression recently. Explain false negatives and false positives, which is better to have too many of? What was the business impact of your last project? Can you explain the difference between a Test Set and a Validation Set? What makes a dataset gold standard? What are outliers and inliers? What would you do if you find them in your dataset? General Data Science is still a position that has great variety and a lack of standardisation across the market. Therefore, every Data Science position and company you interview for will take a slightly different approach and expect additional skills and awareness of the surrounding subjects. Be sure to explore the business your interviewing with, check current employees; data scientists and analysts and see what additional products, technologies and soft skills they have experience with. Some common general questions are: What visualisation tools are you familiar with? Explain a time when you had to handle stakeholder’s expectations? Describe a time when you have been innovative and creative? Which Cloud services have you used and how have you interacted with them? What external data sources do you think could be interesting to our domain? Present to us your last data science project. What’s a project you would want to work on at our company? What data would you love to acquire if there were no limitations? How important is the product in Data Science? Eden Smith If you want more advice or support in how to land your dream data science opportunity or if you’re a manager looking to scale a data science team get in touch with us today.
Methods for dealing with missing values in datasets
Methods for dealing with missing values in datasets AlMazloum, Amer Eddin HERIOT WATT Professor: Dr. Hani Ragab Missing Values in Data: Missing data can occur because of nonresponse: no information is provided for one or more items or for a whole unit ("subject"). Some items are more likely to generate a nonresponse than others. Missing data mechanisms: Missing completely at random (MCAR): Suppose variable Y has some missing values. We will say that these values are MCAR if the probability of missing data on Y is unrelated to the value of Y itself or to the values of any other variable in the data set. On another hand, missing value (y) neither depends on x nor y. Missing at random (MAR): The probability of missing data on Y is unrelated to the value of Y after controlling for other variables in the analysis (say X). on another hand, Missing value (y) depends on x, but not y. Not missing at random (NMAR): Missing values do depend on unobserved values. On another hand, the probability of a missing value depends on the variable that is missing. Patterns of Missingness: We can distinguish between two main patterns of missingness. On the one hand, data are missing monotone if we can observe a pattern among the missing values. Note that it may be necessary to reorder variables and/or individuals. On the other hand, data are missing arbitrarily if there is not a way to order the variables to observe a clear pattern (SAS Institute, 2005). Methods for handling missing data: Deletion Methods List wise deletion If a case has missing data for any of the variables, then simply exclude that case from the analysis. It is usually the default in statistical packages. (Briggs et al.,2003). In this case, rows containing missing variables are deleted Pair wise deletion Analysis with all cases in which the variables of interest are present. On another hand, only the missing observations are ignored and analysis is done on variables present. Imputation Methods Popular Averaging Techniques: Mean, median and mode are the most popular averaging techniques, which are used to infer missing values. Approaches ranging from global average for the variable to averages based on groups are usually considered. On simply way Replace missing value with sample mean or mode. Conditional mean imputation: Suppose we are estimating a regression model with multiple independent variables. One of them, X, has missing values. We select those cases with complete information and regress X on all the other independent variables. Then, we use the estimated equation to predict X for those cases it is missing. (Graham, 2009) (Allison, 2001) and (Briggs et al., 2003). Model-Based Methods: Maximum Likelihood: We can use this method to get the variance-covariance matrix for the variables in the model based on all the available data points, and then use the obtained variance- covariance matrix to estimate our regression model (Schafer, 1997). On another hand, Estimate: value that is most likely to have resulted in the observed data. Multiple imputation: The imputed values are draws from a distribution, so they inherently contain some variation. Thus, multiple imputation (MI) solves the limitations of single imputation by introducing an additional form of error based on variation in the parameter estimates across the imputation, which is called “between imputation error”. It replaces each missing item with two or more acceptable values, representing a distribution of possibilities (Allison, 2001). How do you deal with missing values? Ignore or treat them? The answer would depend on the percentage of those missing values in the dataset, the variables affected by missing values, whether those missing values are a part of dependent or the independent variables, etc. Missing Value treatment becomes important since the data insights or the performance of your predictive model could be impacted if the missing values are not appropriately handled. In conclusion: Assumptions and patterns of missingness are used to determine which methods can be used to deal with missing data Sources and useful resources: Reports: http://www.bu.edu/sph/files/2014/05/Marina-tech-report.pdf https://liberalarts.utexas.edu/prc/_files/cs/Missing-Data.pdf References: Allison, P., 2001. Missing data — Quantitative applications in the social sciences. Thousand Oaks, CA: Sage. Vol. 136. Enders, Craig. 2010.Applied Missing Data Analysis. STATA 11, 2009, Multiple Imputation. Stata Corp. Schafer, J. L. ,1997. Analysis of Incomplete Multivariate Data. Useful links: Data sets with missing values that can be downloaded in different formats including SAS, STATA, SPSS and S plus: http://www.ats.ucla.edu/stat/examples/md/default.htm. Introduction to missing data with useful examples in SAS http://www.ats.ucla.edu/stat/sas/modules/missing.htm. Multiple imputation in SAS. Comprehensive explanations http://www.ats.ucla.edu/stat/sas/seminars/missing_data/part1.htm
Very nice review, thanks AmerI have worked once with sparse matrices with missing data, and I think your article is relevantThanksF
Datascience Foundation in Big Data London November 2017
It was great to meet a lot of prospective members. Tge curiousity factor was very high. Some Data Science called DSF as LinkedIn of Datascience. This is a good way to look at it.
An Introduction to Data Science: Making Big Data Usable
Our world is increasingly fuelled by data – an unparalleled amount of data, to be precise. More information is available today than at any other point in human history, and all of that data has value, particularly to businesses, organizations and government agencies. Data makes the world go round today. However, getting at that data and actually putting it to use can be difficult. Moreover, not all information has value, depending on the needs of the user. A means to identify, catalogue, collate and extract useful data from the sea of other information was necessary – data science. What Is Data Science? Depending on where you look, you’ll find a number of different definitions for data science. Some believe that it’s merely an offshoot of statistics. Others believe that it’s a combination of a few different knowledge extraction models. Yet others believe that it’s something else entirely. It’s important to note that “data science” isn’t true science in the technical definition of the term. Data scientists aren’t trying to prove or disprove a hypothesis. So, what is data science, then? It’s actually difficult to create a single, cohesive definition for the term simply because there are so many potential applications of this discipline, from statistics to data modelling to analytics and more. Perhaps the closest we have come to a unified definition comes from Quora user Drew Conway, a PhD student at NYU, who said, “Data science most often refers to the tools and methods used to analyse large amounts of data. As such, the discipline is an amalgamation of many bits from other areas of research. For tools, the influence primarily comes from computer science, where issues of algorithmic efficiency and storage scalability form the main focus. For analysis, however, the influences are much more varied. Modern methods are borrowed from both the so-called hard sciences (physics, statistics, graph theory) and the social sciences (economics, sociology, political sciences, etc.). Specific classes of techniques that are naturally interdisciplinary are also very popular, such as machine learning.” Michael Driscoll, a specialist in data, analytics and visualization, has another definition. “Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools and materials, coupled with a theoretical understanding of what’s possible.” The University of California at Berkley also sums up data science rather well by saying, “The field of data science is emerging at the intersection of the fields of social science and statistics, information and computer science, and design.” However, data science is not just “using data.”. That’s the end goal, yes, but the discipline focuses on how to first organize and access available data, and then to put it to use effectively. That last word is the key here – “effectively”. Data science enables the effective use of vast quantities of information for other purposes, while also enabling the creation of additional data products (which are themselves data, further feeding the cycle). In the end, data science truly isn’t “science”. Data scientists don’t work in the world of academics. Their employment is solely contingent on the needs of businesses and industries, just as much as a sales clerk’s position or a particular supplier’s line of products. So, what do data scientists actually do, though? Here’s a rough outline: Ask questions in order to answer or solve known problems or find new solutions to problems with which businesses must deal Define data necessary for a particular need Work with existing data to collect, store and explore data Determine the needed analysis type for a particular situation or type of data Use algorithms and other tools to parse, clean, check quality and utilize data Transform insights learned into formats usable by non-data scientists (for use by others within the business), including the creation of graphs, charts, infographics and more Create software to automate data science tasks based on specific business requirements and data sources So, data scientists must successfully combine several professional fields, including statistics, software programming, mathematics, research and subject expertise. A Few Basic Examples of Use Cases Before we dive too deep down the rabbit hole, let’s consider a few basic real world examples of data science in action. Microsoft Word: Sure, MS Word isn’t the most advanced piece of software on the planet, but the word processing program does show us some very good applications of data science. Consider the fact that the program essentially learns more about its users through interaction – it stores and parses data on its own. Microsoft has also done a great deal of work in building the program’s spellchecking and grammar capabilities. Does it match what you’d get from a professional editor? No. However, it does an excellent job for a computer program (that’s not to say there aren’t better programs available, but Word is the industry standard, and Microsoft’s achievements in data usage are noteworthy). Google PageRank: Google might be the world’s uncontested master when it comes to data everything. However, the search giant’s PageRank function is an ideal example of data science in action. This function essentially uses the number of links pointing at a particular domain (data outside the page itself) to help rank the authority and relevance of different websites. Facebook: Sure, the big blue social network has access to tons of information, but they put it to use in a number of innovative ways. For instance, Facebook uses the vast amounts of information it has about various users to help make suggestions for new friends and connections. These suggestions are based on patterns of friendships, including among people that you might not even be connected with on the site, and they can be extremely accurate (sometimes frighteningly so). These are just a few relatively basic examples of data science in action in the world around us, with programs and websites used every day. Why Is Data Science Necessary? Information is vital to every aspect of human existence, and it has been since the dawn of our species. A hunter-gatherer had to rely on his or her knowledge (information) to make decisions about various plant species (would it be deadly to eat?) as well as the animals hunted for food and pelts. Early agriculturalists had to rely on information to make informed decisions about planting, harvesting and preparation. Putting data to use has been with us since time out of mind – that hasn’t changed. What has changed, though, is the volume of data available. O’Reilly makes an excellent point here. “The question facing every company today, every start-up, every non-profit, every project site that wants to attract a community, is how to use data effectively – not just their own data, but all the data that’s available and relevant. Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well defined kinds of analytics. What differentiates data science from statistics is that data science is a holistic approach. We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into tractable form, making it tell its story, and presenting that story to others.” Given the fact that humans have been parsing, storing, organizing and using data for millennia, it’s natural to wonder why a new discipline is even necessary. Isn’t statistics enough? Isn’t the use of current database software sufficient? To put it simply – no. According to a story in VentureBeat, “Today’s modern business needs to manage far more data than ever before, and few have the talent on staff for the job. Projections indicate that the (data scientist) market will experience meteoric growth in the next several years.” GigaOM backs that up. “Every organization will need someone wearing the data scientist hat just like every organization has people responsible for product, sales, marketing and support.” The rise of Big Data and its increasing importance to businesses or all sizes, in every industry, and even the government sector, has made data science not only vital, but one of the fastest growing fields in the world. What Is Big Data, Really? You hear the term “Big Data” thrown around a lot today, but what does it mean, really? Organizations have had access to massive amounts of information for a very long time. Oil companies are prime examples of this, but there are many others, including massive retail chains like Wal-Mart. What makes today different in terms of the amount of information available? What makes it warrant a specific designation of “Big Data”? According to Oxford University Press’ Oxford Dictionary, Big Data is, “data sets that are too large and complex to manipulate or interrogate with standard methods or tools.” That tells us a little bit, but it’s not the full story. Forbes magazine weighs in with a slightly different take on the question. In a story written by Lisa Arthur for Forbes, the author defines BD as, “a collection of data from traditional and digital sources inside and outside your company that represents a source for ongoing discovery and analysis.” That’s a bit more illuminating. Arthur goes on to state that, “In defining big data, it’s also important to understand the mix of unstructured and multi-structured data that comprises the volume of information. Unstructured data comes from information that is not organized or easily interpreted by traditional databases or data models, and typically, it’s very text-heavy. Multi-structured data refers to a variety of data formats and types and can be derived from interactions between people and machines, such as web applications or social networks. A great example is web log data, which includes a combination of text and visual images along with structured data like form or transactional information.” So, unstructured information could be information that’s entered into a form field (either online or off). Multi-structured data could be derived from scraping a website. There are several conditions that go into making Big Data; that we will now describe as the Five Vs of Big Data: Volume (the sheer amount) Velocity (the speed information is generated and aggregated) Variety (types of data) Veracity (authenticity of data) Value (what the data is actually worth to the company) However, it’s very important to understand that the meaning of Big Data will vary from business to business and organization to organization. In each instance, this information will do something different, offer something different, and mean something different. The Explosion of Data Available Today As mentioned, there’s more information available today than at any point in history, and that amount is growing exponentially. In a sense, it feeds itself. As more data is explored, collated, categorized and packaged, more information is created. To truly understand not only what data science is but why it’s become such an essential skillset and discipline to the modern world, it’s important to understand the lifecycle of data – where it comes from, how that information is used, and more. We’ll begin by exploring where data originates. It comes from everywhere. Every single search query through Google or Bing is data. Every picture uploaded is data. Every Vine video created is data. You get the idea. However, it’s not limited to data made available online. Data is literally everywhere. O’Reilly states that, “Data is everywhere: your government, your web server, your business partners, even your body. While we aren’t drowning in a sea of data, we’re finding that almost everything can (or has) been instrumented.” For a good example of how much data is being generated today, consider a simple Amazon recommendation, called an “also bought”. Let’s say you go to Amazon searching for a new washing machine. You’re looking for the right deal, the right features, and the right size. You can sort your options, compare different models and more. You arrive at what seems to be the perfect solution and there near the bottom of the page is a list of items that other customers “also bought” when viewing or purchasing the same model of washing machine. Amazon had to first identify that information, then collate and store it, and then serve it up to you based on nothing more than your online search query through their website. You leave data behind you with every single action you take online. Using Google to search for a dog training program? That query doesn’t disappear when you close the browser. Google stores your information and uses it. And it’s more than just your query terms. Google is also storing geo-location information and a great deal more. Now apply that to mobile apps, which leave an even richer trail of data behind when you’re done (consisting of both electronic data as well as possibly audio and video information, exact geo-location data, and a great deal more). To go beyond the online and device scenario, consider your frequent shopper card. You use it to get access to important discounts on your groceries or fuel, but every single swipe generates an immense amount of information about your shopping habits, your preferences, the specific retail store locations where you prefer to spend your time… You get the picture. Everything we do today generates data. However, all that information would be worthless if there wasn’t a way to store it. This is the very beginnings of data science. Storage for data must be more than just a data dump into a database somewhere online, though. Our storage solutions have become ever more sophisticated over time. We’ve moved from paper ledgers to electronic spreadsheets to incredibly intricate digital storage systems capable of holding immense amounts of information. Storage is not the end of it, though. Moore’s Law (concerning the advance of computer technology) can be applied equally well to the growth of data. As stated by O’Reilly, “The importance of Moore’s Law as applied to data isn’t just geek pyrotechnics. Data expands to fill the space you have to store it. The more storage is available, the more data you will find to put in it.” It’s the same concept behind our desire for larger homes. We want larger homes to have more space, but that space inevitably becomes filled with possessions – we buy new furniture, new dishes, new computers, new televisions. This brings us to the next hurdle – the more data you store, the more sophisticated your data analysis solutions must be. Things were simple enough when businesses ran on physical ledgers and the owner’s knowledge of customers, but today things are much more complicated. This is the very foundation of data science – increasing the ability to analyse and use the increasing volume of data available to us. Making Data Useful Data without meaning is useless. Data without context is pointless. Data without structure is unusable. Today’s organizations must do more than just warehouse information, and accessing, analysing, and using that information requires more than just basic software. In order to make data useful, it must first be analysed, or “conditioned”. Really, this is nothing more than separating the wheat from the chaff, so to speak. You need to analyse the data and then determine what is useful and what is not. For a company selling athletic shoes, your purchase of gardening tools is probably irrelevant. However, your purchase of a gym membership would be valuable information. Your search on Google for running tracks near your home would also be important data. The problem here is that while a great many new ways of delivering machine-consumable data have been created (atom feeds and microformats, for instance), much of the data found in the wild is very messy, very chaotic. This type of information cannot simply be inserted into an XML file. There’s simply too much garbage inserted into the data to make it usable by software. So, data conditioning also includes clean-up – for instance, removing the HTML code from data scraped from a website so that only the pertinent information remains, and none of the underlying code of the page. Once the data has been conditioned and cleaned, it’s time to move to the next step, which is basically quality control. For instance, let’s say you were interested in gathering physical mailing addresses for potential customers who may be interested in your new product. You could gather much of that information online, but it may not be complete. You might have the first and last name, as well as the street address of a particular person, but be missing their postcode code (this is just a very basic example). You may also have incongruous information – that is, data that doesn’t seem to relate to your goal. According to scientists, the depletion of the ozone layer was discovered because someone decided to take a look at the incongruous data that was gathered, rather than discarding it (deciding whether data is actually incongruous through gathering errors, or if there’s another story underlying that incongruity is part of the job of a data scientist). Now you have to add in the problem of human language. This can be considerable, even when you’re dealing with just one language. For instance, parsing data in English requires a significant understanding of elements that relate directly to the particular task. O’Reilly uses the following example, which is an excellent illustration of the complexities involved with quality assurance in data science. Roger Magoulas, head of O’Reilly’s data analysis group, recently had to search for Apple job listings that required candidates to have geolocation skills. The problem was that not only did Magoulas need an understanding of job listing formats, but also the ability to parse English and to separate Apple-specific listings from the wide range of other job postings out there (as well as other information, related to geolocation, but not pertinent to Apple or employment with the company). This is a perfect example of the difficulty in parsing and quality checking data, and it extends well beyond the search for employment with Apple. Consider running a query for information on Python, the programming language. Google returns an immense number of hits for the snake, instead, leaving you to sort out the right answers for yourself. And this is just for English – add in the hundreds of other languages spoken around the world, and you begin to get a sense of just how daunting and difficult it can be. Software is helpful here, but it’s not always the best solution. Often, data scientists are required to lean on their own understanding (human intelligence versus machine intelligence). Of course, this brings in the question of additional manpower and costs. A single data scientist simply doesn’t have the ability to sort through 10,000 potential listings to determine relevance, not in any realistic way. That means hiring help (which admittedly can be done relatively easily through sources like Amazon’s Mechanical Turk program or through sites like Fiverr.com or the like). Using the Data Once the data has been gathered (or harvested, if you prefer) and gone through the cleaning process and been analysed, it must be put to use. This is done in many different ways, and will depend largely on the business in question – their needs and initiatives will inform the ways that data is used. The first step here is visualization, which can be done in any number of ways, including Venn diagrams, charts, graphs, tables and more. Again, the visualization format will need to fit the business, as well as the initiative for which the data is being analysed. The visualization format for a field map of a company’s competition would look very different from the format for data depicting potential customers within a 10-mile radius of a company’s brick and mortar shop, but both are examples of what’s possible with data science. With this being said, perhaps the most common visualization format is a graph. Analysis generally leads to the production of information in numeric format. Obviously, that’s understandable by machines, but it’s not so much use for human beings. Those numbers need to be plotted out in a visual format, giving meaning to the information highlighted. For example, declining sales numbers might be interesting in a purely numeric format, but it becomes much more compelling when given life as a graphic, depicting the rapid decline of a company’s profits. In fact, visualization is so important to the data scientist that many employ it at each stage of the process. For instance, one might use scatter plots to get a sense of what’s interesting in the information gleaned before beginning an analysis. Another might plot their data to get a sense of how skewed it is, or how much false information is included before conditioning. Even animation can be include here to get a sense of how information (and corresponding real world trends) change over time. Some of the programs used to create visualizations include the following: R Processing Many Eyes (IBM) GnuPlot After data visualization comes data implementation – actually putting the information gleaned to use. This is generally not the role of the data scientist. After creating a visualization format for the information, it is handed off to others within the organization and the data scientist moves to the next project. Other teams or professionals will take the information and use it to move the company forward towards its objective. In Conclusion: The Future for Data Scientists What do the rise of Big Data and the increasing use of data science mean for the world at large? It has a number of impacts, but the most salient is this: every business will need someone to fill the position of data scientist, even if it doesn’t go by that name. Every business, every corporation, every organization and even government agencies must have qualified data scientists capable of transforming raw information into something that can be used to move the organization forward in multiple ways. According to GlassDoor.com, data science jobs have reached “critical mass” in many ways. It is currently the 15th highest paying job in demand, with almost 3,500 openings and an average salary base of over $100,000 annually (in the US – other nations vary). It is currently ranked at the 9th best job in the US, as well. This is not about to change, either. A study by McKinsey Global Institute stated that, “a shortage of the analytical and managerial talent necessary to make the most of Big Data is a significant and pressing challenge.” It goes on to estimate that up to 5 million jobs in the United States alone will require skills in data science by the year 2018. These include positions for data analysts, data engineers, statisticians and data scientists. Data science professionals have unique abilities and skills, as well. They combine the spirit of entrepreneurship with patience, mathematical skills with the ability to explore and make connections, computer science skills with an understanding of human behaviour. UC Berkley states, “Virtually every sector of the economy now has access to more data than would have been imaginable even a decade ago. Businesses today are accumulating new data at a rate that exceeds their capacity to extract value from it. The question facing every organization that wants to attract a community is how to use data effectively – not just their own data, but all the data that’s available and relevant.” Even the New York Times weighed in on the burgeoning field of data science, saying, “This hot new field promises to revolutionize industries from business to government, health care to academia.” Obviously, Big Data is here to stay, and the need for professionals capable of transforming that raw information into something that can be used to further organization goals, create community, foster customer engagement and more is immense. Source: https://beta.oreilly.com/ideas/what-is-data-science http://datascience.berkeley.edu/about/what-is-data-science/ http://www.revelytix.com/?q=content/what-data-science-0 http://www.quora.com/What-is-data-science http://venturebeat.com/2013/11/11/data-scientists-needed/ https://gigaom.com/2013/01/06/why-data-scientists-matter-data-science-is-the-future-of-everything/ http://www.forbes.com/sites/lisaarthur/2013/08/15/what-is-big-data/
Good piece Chris and well said, there is a huge difference between big (useless) data and usable (even smaller) ones.If you have seen the recent Data Science survey from Kaggle this is indeed to greatest problem for any data scientist, i.e., data cleansing and preparation to make them usable
Data Mining: Models and Methods
What is Data Mining? Data mining refers to discovery and extraction of patterns and knowledge from large data sets of structured and unstructured data. Data mining techniques have been around for many decades, however, recent advances in ML (Machine Learning), computer performance, and numerical computation have made data mining methods easier to implement on the large data sets and in business-centric tasks. Growing popularity of data mining in business analytics and marketing is also due to the proliferation of Big Data and Cloud Computing. Large distributed databases and methods for parallel processing of data such as MapReduce, make huge volumes of data manageable and useful for companies and academia. Similarly, the cost of storing and managing data is reduced by cloud service providers (CSPs) who offer a pay-as-you-go model to access virtualised servers, storage capacities (disc drives), GPUs (Graphic Processing Unit), and distributed databases. As a result, companies can store, process, and analyze more data getting better business insights. By themselves, state-of-the-art data mining methods are powerful in many classes of tasks. Some of them are anomaly detection, clustering, classification, association rule learning, regression, and summarization. Each of these tasks plays a crucial role in a whatever setting one might think of. For example, anomaly detection techniques help companies protect against network intrusion and data breach. In their turn, regression models are powerful in the prediction of business trends, revenues, and expenses. Clustering techniques have the highest utility in grouping huge volumes of data into cohesive entities that tell about patterns, dependencies both within and among them without the prior knowledge of any laws that govern observations. As these examples illustrate, data mining has the power to put data into the service of businesses and entire communities. Data Mining Models There exist numerous ways to organize and analyze data. Which approach to select depends much on our purpose (e.g. prediction, the inference of relationships) and the form of data (structured vs. unstructured). We can end up with a particular configuration of data which might be good for one task, but not so good for another. Thus, to make data usable one should be aware of theoretical models and approaches used in data mining and realize possible trade-offs and pitfalls in each of them. Parametric and Non-Parametric Models One way of looking at a data mining model is to determine whether it has parameters or not. In terms of parameters, we have a choice between parametric and non-parametric models. In the first type of models, we select a function that, in our view, is the best fit to the training data. For instance, we may choose a linear function of a form F (X) = q0 + q1 x1 + q2 x2 + . . . + qp xp, in which x’s are features of the input data (e.g house size, floor, a number of rooms) and q’s are unknown parameters of the model. These parameters may be thought of as weights that determine a contribution of different features (e.g house size, floor, number of rooms) to the value of the function Y (e.g house price). The task of a parametric model is then to find parameters Q using some statistical methods, such as linear regression or logistic regression. The main advantage of parametric models is that they contain intuition about relationships among features in our data. This makes parametric models an excellent heuristic, inference, and prediction tool. At the same time, however, parametric models have several pitfalls. If the function we have selected is too simple, it may fail to properly explain patterns in the complex data. This problem, known as underfitting, is frequent in linear functions used with the non-linear data. On the other hand, if our function is too complex (e.g with polynomials), it may end up in the overfitting, a scenario in which our model responds to the noise in data rather than actual patterns and is not generalizable to new examples. Figure #1 Examples of normal, underfit, and overfit models. Non-parametric models are free from these issues because they make no assumptions about the underlying form of the function. Therefore, non-parametric models are good in dealing with unstructured data. On the other hand, since non-parametric models do not reduce the problem to the estimation of a small number of parameters, they require very large datasets in order to obtain a precise estimate of the function. Restrictive vs. Flexible Methods Data mining and ML models may also differ in terms of flexibility. Generally speaking, parametric models, such as linear regression, are considered to be highly restrictive, because they need structured data and actual responses (Y) to work. This very feature, however, makes them suitable for inference – finding relationships between features (e.g how the crime rate in the neighborhood affects house prices). Because of this, restrictive models are interpretable and clear. This observation, though, is not true for flexible models (e.g non-parametric models). Because flexible models make no assumptions about a form of the function that controls observation, they are less interpretable. In many settings, however, the lack of interpretability is not a concern. For example, when our only interest is a prediction of stock prices, we should not care about the interpretability of the model at all. Supervised vs. Unsupervised Learning Nowadays, we hear a lot about supervised and unsupervised Machine Learning. New neural networks based on these concepts are making progress in image and speech recognition, or autonomous driving on a daily basis. A natural question, though, what is the difference between unsupervised and supervised learning approaches? The main difference is in a form of data used and techniques to analyze it. In a supervised learning setting, we use a labeled data that consists of features/variables and dependent variables (Y or response). This data is then fed to the learning algorithm that searches for patterns, and a function that controls relationships between independent and dependent variables. The retrieved function may be then applied for the prediction of future observations. In the unsupervised learning, we also observe a vector of features (e.g. house size, floor). The difference with supervised learning, though, we don’t have any associated results (Y). In this case, we cannot apply a linear regression model since there are no response values to predict. Thus, in an unsupervised setting, we are working blind in some sense. Data Mining Methods In this section, we are going to describe technical details of several data mining methods. Our choice fell on linear regression, classification, and clustering methods. These methods are one of the most popular in data mining because they solve a wide variety of tasks, including inference and prediction. Also, these methods perfectly illustrate key features of data mining models described above. For example, linear regression and classification (logistic regression) are examples of parametric, supervised, and restrictive methods, whereas clustering (k-means) belongs to a subset of non-parametric unsupervised methods. Linear Regression for Machine Learning Linear Regression is a method of finding a linear function that reasonably approximates the relationship between data points and dependent variable. In other words, it finds an optimized function to represent and explain data. Contemporary advances in processing power and computation methods allow using linear regression in combination with ML algorithms to produce quick and efficient function optimization. In this section, we will describe an implementation of the linear regression with gradient descent to produce algorithmic fitting of data to linear function. Image #1 Linear regression For this task, let’ s take a case of a house price prediction. Let’s assume we have a training set of 100 house examples (m=100). Each house in this sample may be defined as x1 x2, x3 … xm. Correspondingly, each house has a set of features or properties, such as house size and floor. Features may be thought of as variables that determine a house price. So, for example, the variable would refer to the size of the first house in the training sample. Finally, our training sample has a list of prices for each house denoted as y1, y2, ...ym.. This data tell much by itself (e.g we may apply some methods of descriptive statistics to interpret it), however, in order to run a linear regression, we should first formulate the initial hypothesis. Our hypothesis may be defined as a simple linear function with three parameters (Q). hQ(x) = Q0 + Q1x1+ Q2x2 Image#2 Gradient Descent where x1 and x2 are the features (house size and floor) and Q parameters of the function we want to predict with the linear regression. In essence, this hypothesis says that a house price is determined by house size and floor parametrized by certain parameters Q. Thus, we have a confirmation that linear regression is a parametric model, in which we try to fit right parameters to find a configuration that best explains the data. However, what method should we apply to determine the right parameters? Intuitively, our task is to fit parameters that ensure our hypothesis h(x) for each house is close to y, which is a real-world price. For that purpose, we have to define a cost function that evaluates the difference between a predicted value and an actual value. Equation#1 Cost Function for Linear Regression The right-most part of this equation is a version of a popular least-squares method that calculates a squared difference between the training value y and the value predicted by our hypothesis function hq(x). Then, our task is to minimize the cost function so that the predicted error is small as possible. One of the most popular solutions to this problem is the gradient descent algorithm based on the mathematic properties of the gradient. Gradient is a vector- valued function that points in the direction of the greatest rate of increase of our function. In the case of a multi-variate function, its gradient is the vector whose components are partial derivatives of f. Since gradient is a vector that points in the direction of the function’s growth, it may be used to find parameters that minimize our function. To achieve this, we simply need to move in the backward direction. The technique of the gradual movement down the function to find a local or global minimum is known as a gradient descent and demonstrated in the image above. To implement gradient descent for our linear regression, we need to start with random parameters Q, and then repeatedly update them in the direction opposite to the gradient vector until convergence with the global minimum (our linear function guarantees that such global minimum actually exists). The gradient descent procedure is defined by the update algorithm Equation#2 Gradient Descent Algorithm where a is the learning rate at which we set our algorithm to learn. The learning rate should not be too large, so that we don’t jump over the global minimum, and should not be too small because then the process would take much time. A partial derivative in the right-most part of the equation is calculated for each parameter to construct a gradient vector. It is derived in the following way: Equation#3 Partial Derivative of Gradient Descent Putting partial derivatives and learning rate together produces the final update rule Equation #4 Gradient Descent Update Rule that should be repeated until our algorithm finds the global minimum. That would be the point at which our parameters (Q) produce the function that best explains the training data. This learned function now may be used to predict house prices for houses not included in the training sample and employed in the inference of various relationships between features of our model. This functionality makes the linear regression with gradient descent a powerful technique both in data mining and machine learning. Classification with Logistic Regression Classification is a process of determining a class/category to which the object belongs. Classification techniques implemented via machine learning algorithms have numerous applications ranging from email spam filtering to medical diagnostics and recommender systems. Similarly to linear regression, in a classification problem, we work with a labeled training set that includes some features. However, observations in the data set map not to the quantitative value as in linear regression, but to a categorical value (e.g class). For example, patients’ medical records may determine two classes of patients: those with benign and malignant cancers. The task of the classification algorithm is then to learn a function that best predicts what type of cancer (malignant vs. benign) a patient has. If there are only two classes, the problem is known as a binary classification. In contrast, multi-class classification may be used when we have more classes of data. One of the most common classification techniques in data science and ML is the logistic regression. Logistic regression is based on the sigmoid function that has an interesting property: it maps any real number to the (0,1) interval. As a result, it may be effectively used to evaluate the probability (between 0 and 1) than an observation falls within a certain category. For example, if we define benign cancer as 0 and malignant cancer as 1, a logistic value of .6 would mean that there is a 60% chance that a patient’s cancer is malignant. These properties make sigmoid function useful for binary classification, but multi-class classification is also possible. Image#3 Sigmoid Function The formula for the sigmoid function is: Equation#5 Sigmoid Function where e is the constant e (2,71828) – a base of the natural logarithm with a property that its natural logarithm is equal to 1. To build a working classification model, we should put our hypothesis into the sigmoid function. Remember that our hypothesis function has a form of hQ(x) = Q0 + Q1x1+ Q2x2 For convenience, it may be written in the vector form z = Qt x, where superscript t refers to the matrix transpose of the vector parameters Q. As a result of this transformation, we get the following function Equation#6 Logistic regression model where z refers to the vector representation of our initial hypothesis hq(x). In order to fit parameters Q to our logistic regression model, we should first redefine it in probabilistic terms. This is also needed to leverage the power of sigmoid function as a classifier. Equation# 7 Probabilistic Interpretation of Classification The above definition stems from the basic rule that probability always adds up to 1. So, for example, if the probability of the malignant cancer is 0.7, the probability of benign cancer is automatically 1- 0.7 = 0.3. The equation above formalizes this obvious observation. Now, as we defined the hypothesis and probabilistic assumptions, it’s time to construct a cost function in the same way we did for linear regression. For that purpose, we need to transform our original sigmoid function, because it’s a complex non-convex function that can produce many local minima. If used with the least-squares method similar to the linear regression model below, a sigmoid function will have a hard time to converge. Equation #8 Linear Regression Cost Function Instead, to represent the intuition behind sigmoid function, we may use log-probability applied to the above-mentioned probabilistic definition of the classification problem. Equation#9 Logistic Regression Cost Function The graph below illustrates that log function assigns a high cost if our hypothesis is wrong, and no cost if the hypothesis is right. If y=1 and hq(x) = 1, then the cost ? 0. In contrast, if y=1 and hqx is 0, then the cost goes to infinity. The opposite happens if y=0. Image #4 - log(x) Function We can make a simplified version of the cost function by merging these two cases together. The final cost function ready for use with logistic regression has the following form: Equation #10 Logistic Regression Cost Function (Simplified) Now, when the cost function is formulated, we can easily use the gradient descent identical to the one applied in linear regression. Equation # 11 Logistic Regression Update Rule A similar technique may be applied to the multi-class classification problem with more than two classes. In general, multi-class problems use a one-vs-all approach in which we choose one class and then lump all others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction. Image #5 One-vs-all or Multiclass Classification Clustering Methods As we have seen, clustering is an unsupervised method that is useful when the data is not labeled or if there are no response values (y). Clustering observations of a data set involves partitioning them into distinct groups so that observations within each group are quite similar to each other, while observations in the different groups have less in common. To illustrate this method, let’s take an example from marketing. Assume that we have a big volume of data about consumers. This data may involve median household income, occupation, distance from the nearest urban area, and so forth. This information may be then used for market segmentation. Our task is to identify various groups of customers without the prior knowledge of commonalities that may exist among them. Such segmentation may be then used for tailoring marketing campaigns that target specific clusters of consumers. There are many different clustering techniques to do this, but the most popular are k-mean clustering algorithm and hierarchical clustering. In this section, we are going to describe k-means method, a very efficient algorithm that covers a wide range of use cases. In the k-means clustering, we want to partition observations into a pre-specified number of clusters. Although setting a number of clusters before clustering is considered to be a limitation of the k-means algorithm, it is still a very powerful technique. In our clustering problem, we are given a training set of xi,...,x(m) consumers with individual features xj.. Features are vectors of variables that describe various properties of consumers, such as median income, age, gender, and so forth. The rule is that each observation (consumer) should belong to exactly one cluster and no observations should belong to more than one cluster. Image #6 Illustration of the Clustering Process The idea behind the k-means clustering is that a good cluster is the one for which the within-cluster variation or within-cluster sum of squares (WCSS) (variance) is minimal. In other words, consumers in the same cluster should have more in common with each other than with consumers from other clusters. To achieve this configuration, our task is to algorithmically minimize WCSS for all pre-specified clusters. This task is implemented in the following equation: Equation #12 Within-cluster Sum of Squares (WCSS) where |Ck | denotes the number of observations in the kth cluster In words, the equation above says that we want to partition the observations into K clusters so that the total within-cluster variation, summed over all K clusters, is as small as possible. The within- cluster variation for the kth cluster is the sum of all of the pairwise squared Euclidean distances between the observations in this cluster, divided by the total number of observations in the kth cluster. As in the case with linear regression, to minimize this function we should start from some initial guess. Our task is to find cluster centroids (average position of all points in the space) or, means for each cluster. This may be achieved via three-step algorithm in which we: 1.Randomly initialize K cluster centroids – m1, m2 … mk 2.We assign each observation in the data set to the cluster that yields the least WCSS. Intuitively, the least WCSS is the ‘nearest’ mean (centroid). To find the ‘nearest’ centroid we should calculate the Euclidean distances between observations and centroids and select the one with the smallest distance. Equation #13 Assigning observation to the closest centroid 3.The next step is moving to a new centroid m by computing the centroid for each K cluster. The centroid of the kth cluster may be defined as a vector of the p feature means for the observations in the kth cluster. So, for example, if we have x1, x2, x3 and they belong to the same cluster (c2), then the centroid m2 is defined by the average: Equation #14 Centroid Calculation Example Since arithmetic mean is a good least-squares estimator, this step also minimizes the within-cluster sum of squares. This means that as the algorithm runs, the clustering obtained will continually improve until the result no longer changes. When this happens, the local optimum has been reached and clusters become stable. Conclusion Data mining models and methods described in this paper, allow data scientists to perform a wide array of tasks, including inference, prediction, and analysis. Linear regression is powerful in the prediction of trends and inference of relationships between features. In its turn, logistic regression may be used in the automatic classification of behaviors, processes, and objects, which makes it useful in business analytics and anomaly detection. Finally, clustering allows to make insights about unlabeled data and infer hidden relationships that may drive effective business decisions and strategic choices.
Highly informative & appreciable.......
Developing a Code of Conduct for the Data Science and Analytics Sector
Developing a Code of Conduct for the Data Science and Analytics Sector The Data Science Foundation is reviewing its Code of Conduct and the services it delivers to members. A six-month consultation period will commence January 2018 with both internal and external stakeholders from industry, education and government. In advance of this we are inviting members to participate in a fact-finding exercise that will help to shape the debate. Below this introduction you will see five questions, we would be grateful if you would send your responses by email. Looking back the Foundation has had a tremendous 2017; increasing membership numbers, developing contacts, offering members more online functionality including the development of Personal Profile Pages, Published By Pages, Messaging Facilities and the development of the Discussion Forum. We will be launching a major new initiative with a partner at Big Data LND on 15th November and will follow this up with the launch of the Data Science Writer of the Year Awards 2018. It is now time to look forward, if you are a member of the Data Science Foundation, please participate in the debate to develop a Code of Conduct for the Data Science and Analytics Sector, to help form the services we deliver and to have your say in how the sector is represented. If you are not yet a member but have a professional interest in data and advanced analytics, become a member and join the debate. Membership is free for individuals: https://datascience.foundation/joinus Code of Conduct Initial Questions Please email your answers to email@example.com Use the subject line Code of Conduct. Please answer the following questions: Would you support a Code of Conduct for the Data Science and Analytics sector? Should the code focus on professional, ethical or moral standards or all three? What should be included in the Code of Conduct? Please provide a short description What services would you like the Data Science Foundation to provide? Would participate in the debate? The debate will initially be conducted by email questionnaire and then online discussion Background Information About the Data Science Foundation The Data Science Foundation is a professional body representing the interests of people working in the data science and advanced analytics sector. Our membership consists of both users and suppliers of data services as well as universities offering data science courses and their students. The foundation aims to create an active community of data scientists, to provide a platform to share ideas and to support professional development. All members of the foundation are provided with an online Profile page and a Published By page, which showcases all articles and papers published for peer review. Aims The primary aims of the Data Science Foundation are to: Create a community of qualified and highly skilled data science professionals Provide the data science community with a forum to share ideas and support professional development Develop approved professional standards that differentiate members from others working in the data science and advanced analytics sector The Data Science Foundation is working to: Raise the profile of data science in the UK, to educate business people about the benefits of knowledge-based decision making and to encourage firms to make optimal use of their data. Launch an education programme which includes seminars covering topics such as; ‘Helping business people understand big data’ and ‘Helping data scientists communicate with business people’. An ‘Introduction to data science’ talk will be offered to schools. Improve the way organizations use their data; by helping organizations form partnerships with universities and consultancies. The website The foundation’s website has been built as a communications platform, a publishing tool and as a means for members to procure expertise or obtain employment. The site is a source of information for the media and for those interested in learning about data science. We will ensure that the website: Contains extensive and accurate information about big data and data science education Displays accurate records of corporate, supplier, individual and associate members Is the ideal place to find a data science provider and to start a data science project Creates employment for members via the CV board Publishes the most sought-after job opportunities from leading firms Becomes a platform that allows individuals to gain recognition for their expertise and become leading figures within the industry Code of Conduct The Data Science Foundation Code of Conduct applies to all members. Promotion of Good Practices We adhere to the fact that data scientists should follow best practices at all times, and never encourage or suggest a business or client take actions that are or could be construed as criminal or unethical. All members will strive to provide competent services at all times. Integrity and Honesty We believe in the value and importance of integrity and honesty in all interactions and transactions, whether providing informal advice, a formal project plan, or visualised data for information dissemination. Capability and Expertise The Foundation advocates for more formal guidelines in terms of professional capability and expertise, and works with leading education centres and universities to develop degree courses to this end. Transparency The Data Science Foundation advocates for transparency at all levels, both within the Foundation and within partner organisations, educational institutions and government agencies. We believe in being forthright and direct, with no hidden agenda. Confidentiality Confidentiality is an essential consideration, particularly with sensitive business data. We adhere to the strictest confidentiality stipulations to safeguard these vital business and organisation assets. Information created, developed, used or learned in the course of employment with a particular client, business or organisation is considered completely confidential. Security All members must ensure that data is secure at all times, safeguarded from all threats, including but not limited to viruses, malware, internal and external hacking attempts, theft, and accident. All members must utilise industry-standard software/hardware to ensure data security at all times. Professional Standards We require all members of the Foundation to adhere to strict professional standards regarding integrity, honesty, quality, confidentiality and more. We also enforce professional standards in terms of working practice, data quality, and standards of evidence. Misuse or misrepresentation of data is not permissible. The Foundation practices strict enforcement of professional standards, and repercussions can include expulsion from the Foundation. General Membership Policy Our general membership policy applies to all members, including corporate members. Members are granted several key benefits, including access to Foundation publications, the ability to join discussions and forums, to network with others in the data science industry and more. The Data Science Foundation offers different membership options, including corporate membership and associate membership, each of which delivers unique benefits and advantages.
Hey Chris,Interesting piece. I have done some similar work myself time ago, but I have always wondered whether we should simply indicate how to build a code of conduct (and making it collectively afterwards) or rather drafting something that we hope people will adopt in their daily jobsF
Data Science Foundation will be at Big Data LND
The Data Science Foundation and Big Data LDN 15-16 November 2017 – Stand 327 Olympia London Meet us at stand 327 Big Data LDN on 15-16th November 2017! Find out more about the work of the Data Science Foundation and see how becoming a member would help you make more Data Science Connections. We are launching the Data Science Writer of the Year Awards 2018. The awards recognise the contribution made by individuals who create and share data science knowledge and understanding. All members of the Data Science Foundation are eligible to participate in the awards. Individual membership is free of charge. Big Data LDN is a free to attend conference and exhibition open to all, and will host leading global data and analytics experts, ready to arm you with the tools you need to deliver the most effective data-driven strategy. With content divided into comprehensive sections, you’ll have the opportunity to ask the big questions, share ideas with forward-thinking, likeminded peers, and learn from leading members of the Data community. Big Data LDN is back for a second year and is set to be larger than ever in 2017. The two-day event is essential for those with businesses wanting to deliver a data-driven strategy. Get the latest updates on fast/real-time data, artificial intelligence, machine learning, GDPR, deep learning, self-service analytics and much more. The event will host leading, global data and analytics experts, ready to arm you with the tools to deliver your most effective data-driven strategy. Discuss the big questions and share ideas with forward-thinking peers and leading members of the data community. Be in the vanguard of the data revolution, sign up to Big Data LDN and learn how to build a bright data-driven future for your business. Register free here and visit our stand at the event https://bigdataldn.com
The Data Science Foundation will be on stand 327 at Big Data LDN. It would be great to meet.
Data Science: Self learning
Can I become a self-taught data scientist? by Priyam Kakati https://www.quora.com/Can-I-become-a-self-taught-data-scientist/answer/Priyam-Kakati?share=f8b8f366&srid=trpA