Human Resources for Big Data Professions: A systematic Classification of Job Roles and Required Skill Sets
Andrea De Mauro *
Department of Enterprise Engineering
University of Rome Tor Vergata
Via del Politecnico, 1, 00133, Rome, Italy
* Corresponding author
Department of Civil and Mechanical Engineering
University of Cassino and the Southern Lazio
Via G. Di Biasio, 43, 03043, Cassino (FR), Italy
Department of Civil and Mechanical Engineering
University of Cassino and the Southern Lazio
Via G. Di Biasio, 43, 03043, Cassino (FR), Italy
School of Business and Management at Lappeenranta
University of Technology (LUT), Finland
The rapid expansion of Big Data Analytics is forcing companies to rethink their Human Resource (HR) needs. However, at the same time, it is unclear which types of job roles and skills constitute this area. To this end, this study pursues to drive clarity across the heterogeneous nature of skills required in Big Data professions, by analyzing a large amount of real-world job posts published online. More precisely we: 1) identify four Big Data 'job families'; 2) recognize nine homogeneous groups of Big Data skills (skill sets) that are being demanded by companies; 3) characterize each job family with the appropriate level of competence required within each Big Data skill set. We propose a novel, semi-automated, fully replicable, analytical methodology based on a combination of machine learning algorithms and expert judgement. Our analysis leverages a significant amount of online job posts, obtained through web scraping, to generate an intelligible classification of job roles and skill sets. The results can support business leaders and HR managers in establishing clear strategies for the acquisition and the development of the right skills needed to leverage Big Data at best. Moreover, the structured classification of job families and skill sets will help establish a common dictionary to be used by HR recruiters and education providers, so that supply and demand can more effectively meet in the job marketplace.
Big Data, Business Intelligence, Human Resources Management, Machine Learning, Topic Modelling.
The phenomenon of Big Data - defined as those "information assets characterized by such a high volume, velocity and variety to require specific technology and analytical methods for its transformation into value"  - has reached a prominent position on the agendas of business managers around the world. Multiple previous studies have proven the crucial role of human capital in the success of companies, especially those characterized by a high degree of technology intensity [2-6]. Consequently, firms need to quickly secure the appropriate competencies in the area of Big Data: such a race for acquiring the right talent does not seem to slow down while the labor market is unable to cope with an exponentially increasing demand. Emerging Big Data job roles are making a rapid entry into the list of openings in the hands of corporate recruiters. However, the description of skills and responsibilities of Data Analytics jobs is often nebulous and companies tend to rely on subjective interpretations of their organizational needs. For instance, we have recently witnessed the establishment of the nearly-mythological role of Data Scientist, a professional able to individually cope with a company's most analytical necessities. This simplistic vision clearly neglects the complexity of all those varied and specific skills that are required to gather information, organize it and transform it into insights that can produce an economical advantage. In fact, there is a clear research gap regarding the formal definition of the most prominent Big Data jobs and of the required educational needs [7,8].
In this study, we aim to bridge the above-mentioned gap in the current knowledge by bringing clarity over this vital aspect of Big Data, i.e. its professional workforce and the most required skill sets. In particular, we provide a data-based description of job roles and skills that companies need in order to make use of Big Data. This enriches the literature by offering a structured framework for further research on competency requirements in this business-relevant field. The results also contribute to practice by providing an overarching understanding of what constitutes the contemporary professions around Big Data in organizations, helping HR and other managers to search for better recruitments and develop human capital towards desired directions.
The main data input of this study was a mass-download of a considerable number of online job posts, obtained by means of web-scraping techniques. Web-scraping consists in systematically collecting web pages through computer software. In our case, we have acquired more than 2.700 job posts which contained the keywords 'Big Data' in either the title or the description. We have then applied a set of text mining and classification algorithms in order to recognize which skills are required within each job type and to which extent. The final results confirm that 'Data Scientist' is an umbrella term that loosely describes the complex set of interconnected skills required by companies exploiting Big Data Analytics. Business Managers and HR professionals can use our 'job family vs. skill set' classification when designing job posts and when assessing the overall sufficiency of a company's human capital with regard to Big Data.
The paper is organized as follows: in Section 2, we present previous works on Big Data and Data Scientists, stripping away the many myths related to Data Science as a discipline. In Section 3 we describe the 4-step methodology we have used to acquire the job posts and systematically analyze their content, while in Section 4 we discuss the obtained results and provide a description of the job families we identified and their related skill sets. Section 5 summarizes our conclusions and suggests future extensions to the current work.
2. Related work
2.1 Big Data
The term 'Big Data Analytics' has become popular within IT communities and scholar research as of 2011 . Its meaning is the result of a disorganized evolution and the merge of several more traditional concepts such as: 'Very Large Databases' and 'Data Mining'. The vagueness of the concept of Big Data has resulted in a proliferation of multiple, sometimes contradictory definitions [10,11]. However, its essential characteristics (Information, Technology, Methods and Impact) have been identified and provide a conceptual framework for comprehending its overall meaning [1,12]:
Our society is witnessing an unprecedented growth in information availability. In particular, as noticed by Hilbert , over the last two decades we have seen an exponential increase of information flow, stock and computational power. The characteristics of information are also changing quickly: data is now more 'personal', meaning that its majority is deemed to be created and consumed directly by human beings, as they interact with other individuals and machines through their personal devices. Data is now also more varied in type than it was in the past: in fact, traditional numeric datasets are becoming a small portion of the entire digital universe, which is acquiring more unstructured data types, such as audio/video, images and human speech . It is important to notice that data and information refer to separate but adjacent concepts. In fact they constitute the base components of both the Knowledge Pyramid  and the Data-Information-Knowledge-Wisdom hierarchy . Within this study, we use the word data when referring to a raw collection of values, normally generated through recording of events, while information indicates the next level of contextualization and structure added to data with the purpose of enabling human cognition.
The development of increasingly cheaper and more powerful technologies for storing, transmitting and processing data is one of the fundamental enablers of the rising of Big Data. Storing capacity of integrated circuits has grown exponentially over the last 50 years, as the density of transistors has nearly doubled every 24 months, following Moore's law . Processing data has become faster and cheaper thanks to the evolution of distributed computing and the availability of faster networks. For instance, a popular technology connected with Big Data today is Hadoop, an open source framework that lets clusters of dispersed machines co-operate in order to achieve higher performance through parallel computing . Another feature of Big Data Technology is the emergence of cloud computing, which allows companies to keep fixed costs at a minimum, thanks to the pay-as-you-go financial model of cloud-based services.
The usage of Big Data entails the adoption of novel analytical methods for the transformation of big quantities of information into insights and, hence, value for the business. Furthermore, such data is often ill-structured, embedding diverse forms of coupling relationships whose modeling is critical, yet very challenging . The peculiar features of Big Data force practitioners to rethink their traditional data analysis toolkits and deepen the expertise in statistics and machine learning . A partial list of the more recent analytical techniques we most frequently encounter in Big Data applications include: Cluster analysis, Genetic algorithms, Natural Language Processing, Speech and Image recognition, Neural Networks, Predictive modelling, Regression Models, Social Network Analysis, Sentiment Analysis, Signal Processing and Data Visualization [19-21].
The rise of Big Data has pervasively impacted a myriad of aspects of human life, ranging across science, economics, culture and society, in both positive and negative ways. Businesses are given the opportunity to create economic value through the analysis of Big Data. According to Davenport , Big Data can drive companies through cost reduction, improvements in decision making and improvements in products and services. As a result, companies that have more aggressively shown their interest in Big Data tend to be more productive than their industry peers . Big Data carries also widespread concerns of adverse impact on society, companies and individuals . The biggest concern is related to privacy: datasets carrying digital traces of a person's life can be used to uncover private details or even predict the future behaviour of individuals. This makes privacy a primary concern for those companies who want to establish a sustainable use of Big Data when interacting with consumers.
Information, Technology, Methods and Impact correspond to the most critical components of Big Data and are explicitly called out in the consensual definition which we reported in the introduction [1,12]. The emergence of novel approaches across each of these components has brought considerable challenges for human resources management within existing companies. The advent of new sources of data coupled with the renewal of methods and technologies used for business-impacting analytics require the development of new interdisciplinary competencies spanning from IT skills to business domain knowledge and communication skills . This poses a talent challenge for companies, seeking to upgrade their human capital, and for educators, who need to prepare the future generation of Big Data professionals and Analytics-savvy managers.
2.2 Myths and truths on Big Data jobs
The surge in popularity of the term 'Big Data' has been shortly followed by the establishment of another popular expression, strongly connected with the previous one, which had an exponential increase in popularity as of late 2012: 'Data Science'. Figure 1 shows the steep increase in web searches including the terms "Data Science" and "Data Scientists", obtained through Google Trends tool.
Figure 1: Popularity of 'Data Science' and 'Data Scientist' among web users between 2007 and 2016 (*: 2016 corresponds to average popularity through the month of October '16 only). Values are proportional to the number of queries run by Google Search users, normalized to the highest point in the chart. Source: Google Trends
Differently from more traditional disciplines of science, the boundaries of this field have not been formally clarified from an academic point of view. Provost and Fawcett suggest two reasons behind the lack of clarity around the concept of data science . Firstly, data science is strongly associated and confused with the concepts of Big Data and Data-Driven Decision-making, DDD . Secondly, this scientific discipline is still at the stage of being more practical and experimental versus theoretical and methodological. At this point of the discipline development, there is a natural tendency for people to confuse the definition of the field with the description of what the practitioners of the field - data scientists, in this case - do.
Literature provides multiple, partially concurring descriptions of the characters Data Scientists should display. We notice that these items could be conceptually split into two groups. The first comprises the abilities which concur to the transformation of raw data into insights, with an emphasis on technology and methods: this group includes more technical skills, such as the ability to use Big Data tools, and proficiency in statistics and machine learning methods. The focus of this first group is essentially towards data and systems and requires "harder skills" and specific data and analytics knowledge. The second group consists of the capacity to transform insights into organizational value creation, involving the right people and core business processes in the organization. This group includes soft skills, mainly in the area of communication, general management techniques and knowledge of a specific business domain. Table I includes a non-exhaustive list of the many 'traits' a data scientist may exhibit.
|Focus||Data Scientist Traits||Source|
|Big Data Tools Expert||[7,8,25]|
|Data Ethical Manager|||
|Data Manager and Strategist||[7,8,30]|
Table I: Traits displayed by Data Scientists, as encountered in existing literature.
In our opinion the blurred understanding around the concepts of data science and data scientist is having two adverse consequences: the first one is the creation of the misleading myth around the role of data scientist, which is seen as the single most important actor in the process of creating economic value through the usage of data. The second consequence, linked with the previous, is the disregard versus the other fundamental players concurring to a mature exploitation of data in a firm. As also noticed by Miller , the sole role of Data Scientist is not accounting for the full realm of talent requirements companies are experiencing within the area of Big Data Analytics. The fundamental heterogeneity of the many facets displayed in Table I is an indication of the confusion generated over this concept, which brought many roles and skills to be indiscriminately associated under the same umbrella term of 'Data Scientist'. To provide more understanding in this regard, the current study aims to provide clarity on the job roles and skills companies need to develop and retain in order to fully reap the benefit of Big Data.
In order to produce a data-based classification of Big Data job families and skill sets, we have designed a 4-step methodology by combining a series of existing analytical practices. First, we have downloaded a substantial amount of related online job posts, by means of web scraping techniques. Second, we have analyzed the words occurring in the job post titles in order to categorize them within a number of job families through expert judgment. Third, we have identified relevant skill sets by applying a topic modeling algorithm on the content of the job posts. Fourth and final step was to assess the relative importance of skill sets within job families by analyzing the average degree of presence of each skill set within job posts of each family. In the following sections, we describe every step in more detail.
3.1 Web Scraping
The World Wide Web contains a vast amount of information in various forms and levels of structure. In order to leverage such information, a computer program or an automated script (i.e. web crawler) can systematically retrieve webpages and store them in a central location . Web scraping consists in looking for specific data elements of interest from a series of semi-structured web pages, extracting them through crawlers and storing them into more structured data sets . In order to retrieve a large number of online job posts related to Big Data we have used a web scraping tool, as previously done by Capiluppi and Baravalle , to retrieve the title and the job description of every job post, create a database on the basis of the subsequent analytical steps in this study.
There is a number of commercial and open-source web crawlers and web scraping tools available. However, a thorough survey of the existing tools goes beyond the scope of this paper [38,39]. For the sake of this study, we have decided to use Portia1, a visual web scraper that can be configured through a web-based guided procedure. We tested Portia over a number of online job websites in order to assess the feasibility of retrieving a relatively clean list of relevant job post titles and description, considering the complexity of the web pages structure and the efficacy of the search functionality.
We have compiled a list of online job websites, annotating their characteristics in terms of quantity of Big Data-related job posts, geographic scope and overall feasibility of web scraping. Table II shows the results of our assessment across job post websites. The last column reports our qualitative indication of web scraping feasibility: this depends on the level of standardization of job post pages HTML structure and is linked with the quality of scraping results obtained through Portia. After assessing these features we have selected Dice.com as a source of our study as it is the website that presented the best relative assessment across the parameters mentioned above.
|Website||# of relevant posts||Geographic scope||Web scraping feasibility|
|Infojobs||100+||Italy, Spain, Brazil||⚫⚫⚫|
Notes: The number of dots in the last column indicate how feasible it was to scrape from each website (⚫ = it is not possible to download quality results; ⚫ ⚫ = some elements of text can be downloaded with varying quality; ⚫ ⚫ ⚫ =nearly all posts can be successfully retrieved)
Table II: Website selection matrix.
We have let Portia run a daily web scraping session overall job posts carrying the exact phrase 'Big Data' within title or description for around 2 continuous months during the fall of 2015. After removing all duplicate and incomplete entries we were left with a data set of 2.786 job posts, which we have used as an input for the analysis of job families and skills.
3.2 Identification of Job Families
After retrieving the full list of job posts we have applied basic text mining techniques and expert judgment in order to identify the essential Big Data job families. In order to do so, we have calculated all possible couples of adjacent words (bigrams) appearing in the job titles and we have sorted them by decreasing number of occurrences. Then we have collectively reviewed the list of the most frequent bigrams: by considering the type of roles we have found in literature (see section 2.2) we were able to recognize 4 essential groups of job roles:
- Business Analysts (Business-facing analysts, project/program managers, Business advisors)
- Data Scientists (Quantitative analysts, Statisticians, Modelers)
- Developers (BI coders, Machine learning implementers)
- System Managers (Architects, Infrastructure Admins)
For each job role, we have identified the most frequent bigrams: Figure 2 shows the word cloud with the top 50 words recurring in the job titles we have analyzed while Table III reports the top bigrams falling within each identified job family. On the whole, we were able to categorize 69% of the total list of downloaded job posts into non-ambiguous families.
Figure 2: Word cloud showing the top 50 words recurring in the Job Title. The font size of each word is proportional to the number of occurrences of each word.
|Business Analyst||Data Scientist||Developer||Engineer|
|Project Manager||Data Engineer||Software Engineer||Data Architect|
|Business Analyst||Data Scientist||Java Developer||DevOps Engineer|
|Product Manager||Data Analyst||Hadoop Developer||Solution Architect|
|Program Manager||Data Consultant||Software Developer||Systems Engineer|
Table III: Top occurring bigrams in job titles, grouped by job family.
4.1 Identification of Skill sets
The objective of the third phase of the process was to cluster skills within homogeneous groups, that we call skill sets. In this context, homogeneity refers to the reasonable assumption that skills belonging to the same skill set (like 'risk management' and 'planning' within the skill set of 'Project Management') are more likely to appear together in the same job descriptions. We can draw an immediate analogy with topics within text documents: each document contains homogeneous groups of keywords that characterize the topics dealt by the document. It is important to notice that multiple skill sets, in different proportion, can be required by one single job role. Hence, a traditional clustering technique based on mixture model (such as k-means) would not suffice to represent the complex set of competency requirements included in job posts. Instead, we needed to rely on mixed-membership models  where the assumption that a unit belongs to a single cluster is violated . For the sake of identifying skill sets within job posts, we decided to adopt the mixed-membership model Latent Dirichlet Allocation, LDA , which has proven to work effectively at analyzing user-generated content like job posts .
LDA uses Bayesian Estimation Techniques in order to infer a vector representing the degree of membership (topic proportion) of each element (document) to each group (topic). By applying LDA, each topic can be seen as a distribution over the dictionary of words included in the corpus of the documents under study. The list of the most common words within a topic (keywords) can be used by an expert to deduce a meaningful description of the topic. For the sake of the current study, we have applied LDA on the description of the job posts: the 'keywords' referred to 'job skills' while the concept of 'topic' was substituted by the one of 'skill sets'.
As suggested by Moro et al. , in order to keep the scope within a manageable list of skills, we have defined a dictionary that encompasses the more common terms which could be unambiguously linked to relevant skills within the domain of Big Data Analytics. We have again used the R package 'tm'  in order to run LDA on job descriptions and retrieve the skills required in Big Data jobs.
The inputs to LDA are the input documents to be analyzed (in our case the job descriptions downloaded from dice.com) and the number of topics k to be identified. As suggested by Chang et al. , and confirmed by Blei , we can select k by applying human evaluation among alternative values, so that the interpretation of the machine-generated model results as meaningful as possible for a human. The authors have collectively evaluated multiple outputs of LDA with k ranging from 2 to 30 and have consensually agreed that the most significant set of topics was reached with k=9.
Table IV shows the 9 skill sets we have identified through LDA: for each of those we have provided a title and the list of the top 20 keywords associated within each skill set.
|Skill set #||1||2||3||4||5|
|Skill set #||6||7||8||9|
Table IV: The 20 most popular keywords referring to skills, grouped by skill sets, as per LDA output. The title in bold is a human interpretation of the generic focus of each skill set.
4.2 Mapping of Skill Sets by Job Family
The objective of the fourth and last step of our analytical process was to characterize each job family recognized within phase 2 with a mix of relevant skill sets, identified in phase 3.
The presence (or topic proportion, in the context of topic modeling) is a measure of the extent at which each skill set is represented within each job post description. This measure is an output of LDA and is stored as a "presence matrix" where the rows are the job posts and the 9 columns correspond to the skill sets. By analyzing the presence of each skill set within each job description we have inferred the degree of centrality of each skill set in each job family. In order to do so, we have first grouped the presence factors by job family, using the classification obtained through phase 2 and calculated the average presence of each skill set within every job family. The resulting matrix C showed the average level of centrality of a skill set within every job family. We have then normalized matrix C by dividing every column by its average values, obtaining matrix C^, Whose elements C^i,j can be used to access the centrally of each spacific skill j in a job family i by means of the following relation:
- C^i,j < 1: Skill j is not typically relevant within job family i;
Notes: the number of dots in each cell indicate the relevancy of a Big Data Skill set within a Big Data job family and is based on the matrix Ì‚ ( Ì‚ , <0.85 ïƒ no dots, ( Ì‚ , âˆˆ [ . , ] ïƒ 1 dot, Ì‚ , âˆˆ [ , . ] ïƒ 2 dots, Ì‚ , >1.15 3 dots)
Table V: Big Data Job Families vs. Skill sets:
In this section, we provide an identikit of the job roles belonging to each Big Data job family. This description was obtained primarily by leveraging the family vs. skill set assessment which we obtained through the process described above, and that we convey graphically through the alluvial diagram reported in Figure 3. We have also read a sample of job posts for each family and extracted some recurring features which helped to characterize the family. In the following section, some snippets of text (within quotation marks and in italic) are extracted from the downloaded job posts in order to illustrate actual examples the type of responsibilities related to each job family.
Figure 3: Alluvial Diagram of Big Data Job Families vs. Big Data Skill
5.1 Job Family 1: Business Analyst
The role of a Business analyst focuses on transforming relevant insights into actual business impact and includes elements of organizational effectiveness. Job posts within this family mention responsibilities in the area of analytical business advisory ("drive decision making through analytics, influence sales and marketing strategies, provide analytical support to business initiatives, report strategic insights to partners") and project management ("analyze and document business needs, communicate progress and results effectively, bring to life recommended actions"). Our research shows that the primary skills for a Business Analyst are in the domain of project management and business impact: we have used this last term to identify a mix of industry-specific skills and broader management competencies, such as effective communication, business process transformation and financial acumen. Business Analysts are the bridge between business decision makers and more technical roles: as a consequence, they also have a practical understanding of analytical methods and related technology which they relate to mainly as users.
5.2 Job Family 2: Data Scientist
The focus for a data scientist is on data itself and on the analytical methods for the transformation of data into insights. Job roles in this family include the responsibility to "identify patterns, apply context and intelligence, extract relevant information hidden in the large volumes of data, design and implement data models and statistical methods, integrate research and best practices into problem avoidance and continuous improvement". Posts usually mention specific analytical techniques ("classification, collaborative filtering, association rules, neural networks, heuristic approaches"), scripting or programming languages ("Python, SQL, Java, Ruby") and statistical platforms ("R, SAS, Matlab") deemed to be critical for the advertised position. According to our analysis, Data scientists' main skill set is definitely analytics, as they know and leverage Big Data Methods like anyone else in their company. Data Scientists need also to understand the business context in which they operate and use project management techniques in order to interact effectively with the rest of the organization. They should also be confident in accessing corporate data warehouses and can write scripts for querying databases.
5.3 Job Family 3: Big Data Developer
The main objective of Big Data Developers is to design, develop and modify data-reliant application software. Job descriptions within this family mention that candidates will "develop dashboards and data solutions, design, build, and deliver new reports, prototype working proof of concepts for multi-threaded, multi-server applications, integrate third party applications through Application Programming Interfaces". Posts also refer to responsibilities over the application lifecycle of the analytical product, which include "design, development and implementation of automation innovations, development of automated testing scripts, on-going advanced application support"). Their primary skill is undoubtedly coding, but they also need a solid expertise in systems management, cloud computing and distributed technologies. Big Data developers also require a basic understanding of database management, corporate data architecture, and need to know how analytics are used in the context of their company.
5.4 Job Family 4: Big Data Engineer
Big Data Engineers focus on building and maintaining the full technology infrastructure which enables storage and processing of Big Data. Roles in this family are responsible to: "manage the enterprise analytics server platform, support all processes to load and manage the analytics data store and integrate new data sources, ensure capacity, backups, failover, and disaster recovery processes are in place, deploy custom cloud-based applications, scale backend data storage platforms." Job posts usually include specific mentions of the technologies adopted in the company's Big Data stack and might include (in no specific order) "Hadoop, Cassandra, MongoDB, MySQL, Hana, Ceph, GlusterFS, Azure, Amazon Web Services". The primary skill set for Big Data Engineers relates to data architecture and includes the competencies needed to construct and manage the corporate Big Data ecosystem in a sustainable manner. This includes the ability to take care of the variety of complexities inherent to systems management (spanning from information security to performance monitoring), cloud computing and distributed processing. Job posts also refer to the ability to interact with databases, adopt project management processes and have a general understanding of how data can support the company's strategy.
6. Conclusions and Managerial Implications
Data scientists have been put under the spotlight as the - supposedly - protagonists of the Big Data revolution in companies . Firms need to get the right analytical skills and expertise added to their human capital but this goes well beyond acquiring data scientists alone. Managers still question themselves on which new talent they need and on how to upgrade the skills of their current human resources [22,34]. With the present study, we have provided structure and clarity to the multifaceted landscape of Big Data-related human resource needs, by offering a systematized nomenclature and characterization of job roles and skills. This contributes to the literature by enabling a coherent framework upon which to build future investigations, as desired my multiple researchers [7,8]
We have assembled a semi-automated analytical process, based on web scraping, expert judgment, text mining and topic modelling techniques in order to systemically review the current job offers related to Big Data, using more than 2.700 job descriptions posted online. Our findings confirm the ideas of Miller , who suggested that Data Scientists and their deep expertise on Analytical methods are far from being sufficient in granting companies a real competitive advantage. The evidence from our analysis suggests that there are 4 different job families related to Big Data, which are: Business Analysts, Data Scientists, Big Data Developers and Big Data Engineers. We have characterized each of them with a data-based assessment of the skill sets required by each role family and the required level of proficiency. We have built a 'Big Data Job Families vs. Skill sets matrix' (Table V) which can be used by business managers to structure their recruitment programs and functional career paths and also by universities for the sake of shaping their curricula and degree programs. The matrix also suggests a natural clustering of Big Data Job Families into two separate groups:
Technology-Enabler professionals: Developers and Engineers have multiple overlaps in terms of skill sets and their role tend to be more technical-facing and focused on systems and applications;
Business-Impacting professionals: Business Analysts and Data Scientists share multiple skill requirements and have a more business-oriented role, focusing on data analysis, in direct connection with economic impact and organizational value creation.
There results provide particularly useful insights for organizations and managers working in industries that are transformed by 'digital disruption'. For instance, functional managers can use our results to build more meaningful and structured job descriptions for hiring. Moreover, HR managers can design Big Data career and competency development frameworks in a way that is coherent with the most prominent business needs and industry trends. The results also provide useful guidance to educational institutions (such as universities and their masters' programmes) that aim to focus their efforts in developing skills and competences that are needed in the future. To this end, our results suggest that Data Scientist is not a profession which is homogeneous, but includes both 'hard' and 'soft' skills, as well as different connotations towards organizational processes, technologies, and value creation. Thus, educational programmes could be tailored accordingly, taking into account the specific industry needs.
Beside the above-described results, the present study contains an original methodological proposal that can be reapplied to bring clarity to other domains. In fact, the conjoint use of web scraping and machine learning techniques for classifying jobs and describing them in terms of skill requirements is innovative and can be reused in similar future studies focusing on any other professional field.
The current investigation is affected by a number of known limitations which provide stimulating opportunities for future research: firstly, the analysis of job posts was largely based on US-based positions and might not consider relevant trends generated in other geographical regions. Secondly, we provided clarity upon the features of job roles, each considered individually, while neither offering details on their mutual interactions, nor examining how the organization as a whole should be structured. Lastly, the characterization of the various Big Data skill sets does not provide a precise indication of the behaviors professionals should display: the addition of a proficiency assessment tool for those skills would enable companies to rate the maturity of their analytical workforce.
- A. De Mauro, M. Greco, M. Grimaldi, A formal definition of Big Data based on its essential features, Libr. Rev. 65 (2016) 122-135. http://www.emeraldinsight.com/doi/abs/10.1108/LR-06-2015-0061 (accessed April 1, 2016).
- M.G. Colombo, L. Grilli, Journal of Business Venturing On growth drivers of high-tech start- ups : Exploring the role of founders â€™ human capital and venture capital ✩, J. Bus. Ventur. 25 (2010) 610-626. doi:10.1016/j.jbusvent.2009.01.005.
- M. Delgado-Verde, G. Martin-De Castro, J. Amores-Salvado, Intellectual capital and radical innovation: Exploring the quadratic effects in technology-based manufacturing firms, Technovation. 54 (2016) 35-47. doi:10.1016/j.technovation.2016.02.002.
- J. Siepel, M. Cowling, A. Coad, Non-founder human capital and the long-run growth and survival of high-tech ventures, Technovation. 59 (2017) 34–43. doi:10.1016/j.technovation.2016.09.001.
- J. Saenz, N. Aramburu, M. Buenchea, M. Vanhala, P. Ritala, How much does firm-specific intellectual capital vary? Cross-industry and cross-national comparison, Eur. J. Int. Manag. 11 (2017) 129-152. doi:10.1504/EJIM.2017.082529.
- G. Morales-Alonso, I. Pablo-Lerchundi, M.C. Nuez-Del-Rio, Entrepreneurial intention of engineering students and associated influence of contextual factors, Int. J. Soc. Psychol. (2016). doi:10.1080/02134748.2015.1101314.
- S. Miller, Collaborative Approaches Needed to Close the Big Data Skills Gap, J. Organ. Des. 3 (2014) 26-30. doi:10.7146/jod.3.1.9823.
- I.-Y. Song, Y. Zhu, Big data and data science: what should we teach?, Expert Syst. (2015). doi:10.1111/exsy.12130.
- A. Gandomi, M. Haider, Beyond the hype: Big data concepts, methods, and analytics, Int. J. Inf. Manage. 35 (2014) 137-144. doi:10.1016/j.ijinfomgt.2014.10.007.
- J.S. Ward, A. Barker, Undefined By Data: A Survey of Big Data Definitions, (2013) 2.
- O. Ylijoki, J. Porras, Perspectives to Definition of Big Data : A Mapping Study and Discussion, J. Innov. Manag. 1 (2016) 69-91. http://hdl.handle.net/10216/83250.
- A.J. van Altena, P.D. Moerland, A.H. Zwinderman, S.D. Olabarriaga, Understanding big data themes from scientific biomedical literature through topic modeling, J. Big Data. 3 (2016) 23. doi:10.1186/s40537-016-0057-0.
- M. Hilbert, Big Data for Development: A Review of Promises and Challenges, Dev. Policy Rev. 34 (2016) 135-174. doi:10.1111/dpr.12142.
- P. Russom, Big data analytics, TDWI Best Pract. Rep. (2011).
- J. Rowley, The wisdom hierarchy: representations of the DIKW hierarchy, J. Inf. Sci. 33 (2007) 163-180. http://jis.sagepub.com/content/33/2/163.abstract (accessed May 11, 2015).
- G.E. Moore, Cramming more components onto integrated circuits, Reprinted from Electronics, volume 38, number 8, April 19, 1965, pp.114 ff., IEEE Solid-State Circuits Newsl. 11 (2006) 33-35. doi:10.1109/N-SSC.2006.4785860.
- T.H. Davenport, J. DychÃ©, Big Data in Big Companies, International Institute for Analytics, Portland, OR, 2013. http://resources.idgenterprise.com/original/AST-0109216_Big_Data_in_Big_Companies.pdf
- L. Cao, Coupling learning of complex interactions, Inf. Process. Manag. 51 (2015) 167-186. doi:10.1016/j.ipm.2014.08.007.
- J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, A. Hung Byers, Big data: The next frontier for innovation, competition, and productivity, 2011. doi:10.1080/01443610903114527.
- H. Chen, R. Chiang, V. Storey, Business Intelligence and Analytics: From Big Data to Big Impact, MIS Q. 36 (2012) 1165-1188.
- B. Marr, Big Data: Using SMART Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance, Wiley, 2015.
- T.H. Davenport, Big Data at Work: Dispelling the Myths, Uncovering the Opportunities, Harvard Business Review Press, 2014.
- J. Bughin, Big data, Big bang?, J. Big Data. 3 (2016) 2. doi:10.1186/s40537-015-0014-3.
- D. Boyd, K. Crawford, Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon, Information, Commun. Soc. 15 (2012) 662-679. doi:10.1080/1369118X.2012.678878.
- F. Provost, T. Fawcett, Data Science and its Relationship to Big Data and Data-Driven Decision Making, Data Sci. Big Data. 1 (2013) 51-59. doi:10.1089/big.2013.1508.
- E. Brynjolfsson, L.M. Hitt, H.H. Kim, Strength in Numbers: How Does Data-Driven Decisionmaking Affect Firm Performance?, SSRN Electron. J. (2011) 1-28. doi:10.2139/ssrn.1819486.
- T.H. Davenport, D.J. Patil, Data Scientist: The Sexiest Job Of the 21st Century, Harv. Bus. Rev. 90 (2012) 70-76.
- D. Conway, The Data Science Venn Diagram, (2010). http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram (accessed April 16, 2016).
- V. Mayer-Schonberger, K. Cukier, Big Data: A Revolution That Will Transform How We Live, Work and Think, John Murray, London, 2013.
- B. Wixom, T. Ariyachandra, D. Douglas, M. Goul, B. Gupta, L. Iyer, U. Kulkarni, B.J.G. Mooney, G. Phillips-Wren, O. Turetken, The current state of business intelligence in academia: The arrival of big data, Commun. Assoc. Inf. Syst. 34 (2014) 1-13. doi:10.1016/j.cub.2013.07.021.
- H. Chen, R. Chiang, V. Storey, Business Intelligence and Analytics: From Big Data to Big Impact, MIS Q. 36 (2012) 1165-1188. http://search.ebscohost.com/login.aspx?direct=true&profile=ehost&scope=site&authtype=cra wler&jrnl=02767783&AN=83466038&h=7OlQMXoL1qrJ31m3%2FMaAtQ0gADgZ7JWseTKbP14GdBSeWYAE4xXJsD3ON9pvSGHFSHfOx9tniPXX%2BMaOgEVTkQ%3D%3D&crl =c (accessed March 26, 2014)
- I.-Y. Song, Y. Zhu, Big data and data science: what should we teach?, Expert Syst. (2015). doi:10.1111/exsy.12130.
- T.H. Davenport, D.J. Patil, Data Scientist: The Sexiest Job Of the 21st Century, Harv. Bus. Rev. 90 (2012) 70-76. http://18.104.22.168/strategic/articles/data_scientist-the_sexiest_job_of_the_21st_century.pdf (accessed July 20, 2014).
- A. McAfee, E. Brynjolfsson, Big data: the management revolution, Harv. Bus. Rev. 90 (2012) 61-67. doi:10.1007/s12599-013-0249-5.
- M. Kobayashi, K. Takeda, Information retrieval on the web, ACM Comput. Surv. 32 (2000) 144-173. doi:10.1145/358923.358934.
- E. Vargiu, M. Urru, Exploiting web scraping in a collaborative filtering- based approach to web advertising, Artif. Intell. Res. 2 (2012) 44-54. doi:10.5430/air.v2n1p44.
- A. Capiluppi, A. Baravalle, Matching demand and offer in on-line provision: A longitudinal study of monster.com, Proc. - 12th IEEE Int. Symp. Web Syst. Evol. WSE 2010. (2010) 13-21. doi:10.1109/WSE.2010.5623576.
- T. Ozacar, A tool for producing structured interoperable data from product features on the web, Inf. Syst. 56 (2016) 36-54. doi:10.1016/j.is.2015.09.002.
- C. Girardi, F. Ricca, P. Tonella, Web crawlers compared, Int. J. Web Inf. Syst. 2 (2006) 85-94. doi:10.1108/17440080680000104.
- E.M. Airoldi, D.M. Blei, S.E. Fienberg, E.P. Xing, Mixed membership stochastic blockmodels,Mach. Learn. Res. 9 (2008) 1981-2014. doi:10.1016/j.bbi.2008.05.010.
- E.M. Airoldi, D.M. Blei, E.A. Erosheva, S.E. Fienberg, Handbook of Mixed Membership Models and Their Applications, CRC Press, 2014.
- D.M. Blei, Introduction to Probabilistic Topic Models, Commun. ACM. 55 (2012) 77-84. doi:10.1145/2133806.2133826.
- B. Ma, N. Zhang, G. Liu, L. Li, H. Yuan, Semantic search for public opinions on urban affairs:probabilistic topic modeling-based approach, Inf. Process. Manag. 52 (2016) 430-445. doi:10.1016/j.ipm.2015.10.004.
- S.M.C. Moro, P.A.R. Cortez, P.M.R.F. Rita, Business intelligence in banking: A literature analysis from 2002 to 2013 using text mining and latent Dirichlet allocation, Expert Syst. Appl.42 (2014) 1314-1324. doi:10.1016/j.eswa.2014.09.024.
- I. Feinerer, K. Hornik, D. Meyer, Text mining infrastructure in R, J. .... 25 (2008). http://onlinelibrary.wiley.com/doi/10.1002/wics.10/full (accessed February 2, 2014).
- J. Chang, S. Gerrish, C. Wang, D.M. Blei, Reading Tea Leaves: How Humans Interpret Topic Models, Adv. Neural Inf. Process. Syst. 22. (2009) 288--296. doi:10.1.1.100.1089.