Our world is increasingly fuelled by data – an unparalleled amount of data, to be precise. More information is available today than at any other point in human history, and all of that data has value, particularly to businesses, organizations and government agencies. Data makes the world go round today. However, getting at that data and actually putting it to use can be difficult. Moreover, not all information has value, depending on the needs of the user. A means to identify, catalogue, collate and extract useful data from the sea of other information was necessary – data science.
What Is Data Science?
Depending on where you look, you’ll find a number of different definitions for data science. Some believe that it’s merely an offshoot of statistics. Others believe that it’s a combination of a few different knowledge extraction models. Yet others believe that it’s something else entirely. It’s important to note that “data science” isn’t true science in the technical definition of the term. Data scientists aren’t trying to prove or disprove a hypothesis. So, what is data science, then? It’s actually difficult to create a single, cohesive definition for the term simply because there are so many potential applications of this discipline, from statistics to data modelling to analytics and more. Perhaps the closest we have come to a unified definition comes from Quora user Drew Conway, a PhD student at NYU, who said, “Data science most often refers to the tools and methods used to analyse large amounts of data. As such, the discipline is an amalgamation of many bits from other areas of research. For tools, the influence primarily comes from computer science, where issues of algorithmic efficiency and storage scalability form the main focus. For analysis, however, the influences are much more varied. Modern methods are borrowed from both the so-called hard sciences (physics, statistics, graph theory) and the social sciences (economics, sociology, political sciences, etc.). Specific classes of techniques that are naturally interdisciplinary are also very popular, such as machine learning.” Michael Driscoll, a specialist in data, analytics and visualization, has another definition. “Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools and materials, coupled with a theoretical understanding of what’s possible.”
The University of California at Berkley also sums up data science rather well by saying, “The field of data science is emerging at the intersection of the fields of social science and statistics, information and computer science, and design.” However, data science is not just “using data.”. That’s the end goal, yes, but the discipline focuses on how to first organize and access available data, and then to put it to use effectively. That last word is the key here – “effectively”. Data science enables the effective use of vast quantities of information for other purposes, while also enabling the creation of additional data products (which are themselves data, further feeding the cycle). In the end, data science truly isn’t “science”. Data scientists don’t work in the world of academics. Their employment is solely contingent on the needs of businesses and industries, just as much as a sales clerk’s position or a particular supplier’s line of products.
So, what do data scientists actually do, though? Here’s a rough outline:
- Ask questions in order to answer or solve known problems or find new solutions to problems with which businesses must deal
- Define data necessary for a particular need
- Work with existing data to collect, store and explore data
- Determine the needed analysis type for a particular situation or type of data
- Use algorithms and other tools to parse, clean, check quality and utilize data
- Transform insights learned into formats usable by non-data scientists (for use by others within the business), including the creation of graphs, charts, infographics and more
- Create software to automate data science tasks based on specific business requirements and data sources
So, data scientists must successfully combine several professional fields, including statistics, software programming, mathematics, research and subject expertise.
A Few Basic Examples of Use Cases
Before we dive too deep down the rabbit hole, let’s consider a few basic real world examples of data science in action.
- Microsoft Word: Sure, MS Word isn’t the most advanced piece of software on the planet, but the word processing program does show us some very good applications of data science. Consider the fact that the program essentially learns more about its users through interaction – it stores and parses data on its own. Microsoft has also done a great deal of work in building the program’s spellchecking and grammar capabilities. Does it match what you’d get from a professional editor? No. However, it does an excellent job for a computer program (that’s not to say there aren’t better programs available, but Word is the industry standard, and Microsoft’s achievements in data usage are noteworthy).
- Google PageRank: Google might be the world’s uncontested master when it comes to data everything. However, the search giant’s PageRank function is an ideal example of data science in action. This function essentially uses the number of links pointing at a particular domain (data outside the page itself) to help rank the authority and relevance of different websites.
- Facebook: Sure, the big blue social network has access to tons of information, but they put it to use in a number of innovative ways. For instance, Facebook uses the vast amounts of information it has about various users to help make suggestions for new friends and connections. These suggestions are based on patterns of friendships, including among people that you might not even be connected with on the site, and they can be extremely accurate (sometimes frighteningly so).
These are just a few relatively basic examples of data science in action in the world around us, with programs and websites used every day.
Why Is Data Science Necessary?
Information is vital to every aspect of human existence, and it has been since the dawn of our species. A hunter-gatherer had to rely on his or her knowledge (information) to make decisions about various plant species (would it be deadly to eat?) as well as the animals hunted for food and pelts. Early agriculturalists had to rely on information to make informed decisions about planting, harvesting and preparation. Putting data to use has been with us since time out of mind – that hasn’t changed. What has changed, though, is the volume of data available. O’Reilly makes an excellent point here. “The question facing every company today, every start-up, every non-profit, every project site that wants to attract a community, is how to use data effectively – not just their own data, but all the data that’s available and relevant. Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well defined kinds of analytics. What differentiates data science from statistics is that data science is a holistic approach. We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into tractable form, making it tell its story, and presenting that story to others.” Given the fact that humans have been parsing, storing, organizing and using data for millennia, it’s natural to wonder why a new discipline is even necessary. Isn’t statistics enough? Isn’t the use of current database software sufficient? To put it simply – no. According to a story in VentureBeat, “Today’s modern business needs to manage far more data than ever before, and few have the talent on staff for the job. Projections indicate that the (data scientist) market will experience meteoric growth in the next several years.” GigaOM backs that up. “Every organization will need someone wearing the data scientist hat just like every organization has people responsible for product, sales, marketing and support.” The rise of Big Data and its increasing importance to businesses or all sizes, in every industry, and even the government sector, has made data science not only vital, but one of the fastest growing fields in the world.
What Is Big Data, Really?
You hear the term “Big Data” thrown around a lot today, but what does it mean, really? Organizations have had access to massive amounts of information for a very long time. Oil companies are prime examples of this, but there are many others, including massive retail chains like Wal-Mart. What makes today different in terms of the amount of information available? What makes it warrant a specific designation of “Big Data”? According to Oxford University Press’ Oxford Dictionary, Big Data is, “data sets that are too large and complex to manipulate or interrogate with standard methods or tools.” That tells us a little bit, but it’s not the full story. Forbes magazine weighs in with a slightly different take on the question. In a story written by Lisa Arthur for Forbes, the author defines BD as, “a collection of data from traditional and digital sources inside and outside your company that represents a source for ongoing discovery and analysis.” That’s a bit more illuminating. Arthur goes on to state that, “In defining big data, it’s also important to understand the mix of unstructured and multi-structured data that comprises the volume of information. Unstructured data comes from information that is not organized or easily interpreted by traditional databases or data models, and typically, it’s very text-heavy. Multi-structured data refers to a variety of data formats and types and can be derived from interactions between people and machines, such as web applications or social networks. A great example is web log data, which includes a combination of text and visual images along with structured data like form or transactional information.” So, unstructured information could be information that’s entered into a form field (either online or off). Multi-structured data could be derived from scraping a website.
There are several conditions that go into making Big Data; that we will now describe as the Five Vs of Big Data:
- Volume (the sheer amount)
- Velocity (the speed information is generated and aggregated)
- Variety (types of data)
- Veracity (authenticity of data)
- Value (what the data is actually worth to the company)
However, it’s very important to understand that the meaning of Big Data will vary from business to business and organization to organization. In each instance, this information will do something different, offer something different, and mean something different.
The Explosion of Data Available Today
As mentioned, there’s more information available today than at any point in history, and that amount is growing exponentially. In a sense, it feeds itself. As more data is explored, collated, categorized and packaged, more information is created. To truly understand not only what data science is but why it’s become such an essential skillset and discipline to the modern world, it’s important to understand the lifecycle of data – where it comes from, how that information is used, and more. We’ll begin by exploring where data originates. It comes from everywhere. Every single search query through Google or Bing is data. Every picture uploaded is data. Every Vine video created is data. You get the idea. However, it’s not limited to data made available online. Data is literally everywhere. O’Reilly states that, “Data is everywhere: your government, your web server, your business partners, even your body. While we aren’t drowning in a sea of data, we’re finding that almost everything can (or has) been instrumented.” For a good example of how much data is being generated today, consider a simple Amazon recommendation, called an “also bought”. Let’s say you go to Amazon searching for a new washing machine. You’re looking for the right deal, the right features, and the right size. You can sort your options, compare different models and more. You arrive at what seems to be the perfect solution and there near the bottom of the page is a list of items that other customers “also bought” when viewing or purchasing the same model of washing machine. Amazon had to first identify that information, then collate and store it, and then serve it up to you based on nothing more than your online search query through their website. You leave data behind you with every single action you take online. Using Google to search for a dog training program? That query doesn’t disappear when you close the browser. Google stores your information and uses it. And it’s more than just your query terms. Google is also storing geo-location information and a great deal more. Now apply that to mobile apps, which leave an even richer trail of data behind when you’re done (consisting of both electronic data as well as possibly audio and video information, exact geo-location data, and a great deal more). To go beyond the online and device scenario, consider your frequent shopper card. You use it to get access to important discounts on your groceries or fuel, but every single swipe generates an immense amount of information about your shopping habits, your preferences, the specific retail store locations where you prefer to spend your time… You get the picture.
Everything we do today generates data.
However, all that information would be worthless if there wasn’t a way to store it. This is the very beginnings of data science. Storage for data must be more than just a data dump into a database somewhere online, though. Our storage solutions have become ever more sophisticated over time. We’ve moved from paper ledgers to electronic spreadsheets to incredibly intricate digital storage systems capable of holding immense amounts of information. Storage is not the end of it, though. Moore’s Law (concerning the advance of computer technology) can be applied equally well to the growth of data. As stated by O’Reilly, “The importance of Moore’s Law as applied to data isn’t just geek pyrotechnics. Data expands to fill the space you have to store it. The more storage is available, the more data you will find to put in it.” It’s the same concept behind our desire for larger homes. We want larger homes to have more space, but that space inevitably becomes filled with possessions – we buy new furniture, new dishes, new computers, new televisions. This brings us to the next hurdle – the more data you store, the more sophisticated your data analysis solutions must be. Things were simple enough when businesses ran on physical ledgers and the owner’s knowledge of customers, but today things are much more complicated. This is the very foundation of data science – increasing the ability to analyse and use the increasing volume of data available to us.
Making Data Useful
Data without meaning is useless. Data without context is pointless. Data without structure is unusable. Today’s organizations must do more than just warehouse information, and accessing, analysing, and using that information requires more than just basic software. In order to make data useful, it must first be analysed, or “conditioned”. Really, this is nothing more than separating the wheat from the chaff, so to speak. You need to analyse the data and then determine what is useful and what is not. For a company selling athletic shoes, your purchase of gardening tools is probably irrelevant. However, your purchase of a gym membership would be valuable information. Your search on Google for running tracks near your home would also be important data. The problem here is that while a great many new ways of delivering machine-consumable data have been created (atom feeds and microformats, for instance), much of the data found in the wild is very messy, very chaotic. This type of information cannot simply be inserted into an XML file. There’s simply too much garbage inserted into the data to make it usable by software. So, data conditioning also includes clean-up – for instance, removing the HTML code from data scraped from a website so that only the pertinent information remains, and none of the underlying code of the page.
Once the data has been conditioned and cleaned, it’s time to move to the next step, which is basically quality control. For instance, let’s say you were interested in gathering physical mailing addresses for potential customers who may be interested in your new product. You could gather much of that information online, but it may not be complete. You might have the first and last name, as well as the street address of a particular person, but be missing their postcode code (this is just a very basic example). You may also have incongruous information – that is, data that doesn’t seem to relate to your goal. According to scientists, the depletion of the ozone layer was discovered because someone decided to take a look at the incongruous data that was gathered, rather than discarding it (deciding whether data is actually incongruous through gathering errors, or if there’s another story underlying that incongruity is part of the job of a data scientist). Now you have to add in the problem of human language. This can be considerable, even when you’re dealing with just one language. For instance, parsing data in English requires a significant understanding of elements that relate directly to the particular task. O’Reilly uses the following example, which is an excellent illustration of the complexities involved with quality assurance in data science.
Roger Magoulas, head of O’Reilly’s data analysis group, recently had to search for Apple job listings that required candidates to have geolocation skills. The problem was that not only did Magoulas need an understanding of job listing formats, but also the ability to parse English and to separate Apple-specific listings from the wide range of other job postings out there (as well as other information, related to geolocation, but not pertinent to Apple or employment with the company). This is a perfect example of the difficulty in parsing and quality checking data, and it extends well beyond the search for employment with Apple. Consider running a query for information on Python, the programming language. Google returns an immense number of hits for the snake, instead, leaving you to sort out the right answers for yourself. And this is just for English – add in the hundreds of other languages spoken around the world, and you begin to get a sense of just how daunting and difficult it can be. Software is helpful here, but it’s not always the best solution. Often, data scientists are required to lean on their own understanding (human intelligence versus machine intelligence). Of course, this brings in the question of additional manpower and costs. A single data scientist simply doesn’t have the ability to sort through 10,000 potential listings to determine relevance, not in any realistic way. That means hiring help (which admittedly can be done relatively easily through sources like Amazon’s Mechanical Turk program or through sites like Fiverr.com or the like).
Using the Data
Once the data has been gathered (or harvested, if you prefer) and gone through the cleaning process and been analysed, it must be put to use. This is done in many different ways, and will depend largely on the business in question – their needs and initiatives will inform the ways that data is used. The first step here is visualization, which can be done in any number of ways, including Venn diagrams, charts, graphs, tables and more. Again, the visualization format will need to fit the business, as well as the initiative for which the data is being analysed. The visualization format for a field map of a company’s competition would look very different from the format for data depicting potential customers within a 10-mile radius of a company’s brick and mortar shop, but both are examples of what’s possible with data science. With this being said, perhaps the most common visualization format is a graph. Analysis generally leads to the production of information in numeric format. Obviously, that’s understandable by machines, but it’s not so much use for human beings. Those numbers need to be plotted out in a visual format, giving meaning to the information highlighted. For example, declining sales numbers might be interesting in a purely numeric format, but it becomes much more compelling when given life as a graphic, depicting the rapid decline of a company’s profits. In fact, visualization is so important to the data scientist that many employ it at each stage of the process. For instance, one might use scatter plots to get a sense of what’s interesting in the information gleaned before beginning an analysis. Another might plot their data to get a sense of how skewed it is, or how much false information is included before conditioning. Even animation can be include here to get a sense of how information (and corresponding real world trends) change over time. Some of the programs used to create visualizations include the following:
- Many Eyes (IBM)
After data visualization comes data implementation – actually putting the information gleaned to use. This is generally not the role of the data scientist. After creating a visualization format for the information, it is handed off to others within the organization and the data scientist moves to the next project. Other teams or professionals will take the information and use it to move the company forward towards its objective.
In Conclusion: The Future for Data Scientists
What do the rise of Big Data and the increasing use of data science mean for the world at large? It has a number of impacts, but the most salient is this: every business will need someone to fill the position of data scientist, even if it doesn’t go by that name. Every business, every corporation, every organization and even government agencies must have qualified data scientists capable of transforming raw information into something that can be used to move the organization forward in multiple ways. According to GlassDoor.com, data science jobs have reached “critical mass” in many ways. It is currently the 15th highest paying job in demand, with almost 3,500 openings and an average salary base of over $100,000 annually (in the US – other nations vary). It is currently ranked at the 9th best job in the US, as well. This is not about to change, either. A study by McKinsey Global Institute stated that, “a shortage of the analytical and managerial talent necessary to make the most of Big Data is a significant and pressing challenge.” It goes on to estimate that up to 5 million jobs in the United States alone will require skills in data science by the year 2018. These include positions for data analysts, data engineers, statisticians and data scientists. Data science professionals have unique abilities and skills, as well. They combine the spirit of entrepreneurship with patience, mathematical skills with the ability to explore and make connections, computer science skills with an understanding of human behaviour.
UC Berkley states, “Virtually every sector of the economy now has access to more data than would have been imaginable even a decade ago. Businesses today are accumulating new data at a rate that exceeds their capacity to extract value from it. The question facing every organization that wants to attract a community is how to use data effectively – not just their own data, but all the data that’s available and relevant.” Even the New York Times weighed in on the burgeoning field of data science, saying, “This hot new field promises to revolutionize industries from business to government, health care to academia.” Obviously, Big Data is here to stay, and the need for professionals capable of transforming that raw information into something that can be used to further organization goals, create community, foster customer engagement and more is immense.