Login  |  Join Us  |  Subscribe to Newsletter
Login to View News Feed and Manage Profile
☰
Login
Join Us
Login to View News Feed and Manage Profile
Agency
Agency
  • Home
  • Information
    • Discussion
    • Articles
    • Whitepapers
    • Use Cases
    • News
    • Contributors
    • Subscribe to Newsletter
  • Courses
    • Data Science & Analytics
    • Statistics and Related Courses
    • Online Data Science Courses
  • Prodigy
    • Prodigy Login
    • Prodigy Find Out More
    • Prodigy Free Services
    • Prodigy Feedback
    • Prodigy T&Cs
  • Awards
    • Contributors Competition
    • Data Science Writer Of The Year
  • Membership
    • Individual
    • Organisational
    • University
    • Associate
    • Affiliate
    • Benefits
    • Membership Fees
    • Join Us
  • Consultancy
    • Professional Services
    • Project Methodology
    • Unlock Your Data
    • Advanced Analytics
  • Resources
    • Big Data Resources
    • Technology Resources
    • Speakers
    • Data Science Jobs Board
    • Member CVs
  • About
    • Contact
    • Data Science Foundation
    • Steering Group
    • Professional Standards
    • Government And Industry
    • Sponsors
    • Supporter
    • Application Form
    • Education
    • Legal Notice
    • Privacy
    • Sitemap
  • Home
  • Information
    • Discussion
    • Articles
    • Whitepapers
    • Use Cases
    • News
    • Contributors
  • Courses
    • Data Science & Analytics
    • Statistics and Related Courses
    • Online Data Science Courses
  • Prodigy
    • Prodigy Login
    • Prodigy Find Out More
    • Prodigy Free Services
    • Prodigy Feedback
    • Prodigy T&Cs
  • Awards
    • Contributors Competition
    • Data Science Writer
  • Membership
    • Individual
    • Organisational
    • University
    • Associate
    • Affiliate
    • Benefits
    • Membership Fees
    • Join Us
  • Consultancy
    • Professional Services
    • Project Methodology
    • Unlock Your Data
    • Advanced Analytics
  • Resources
    • Big Data Resources
    • Technology Resources
    • Speakers
    • Data Science Jobs Board
    • Member CVs
  • About
    • Contact
    • Data Science Foundation
    • Steering Group
    • Professional Standards
    • Government And Industry
    • Sponsors
    • Supporter
    • Application Form
    • Education
    • Legal Notice
    • Privacy
    • Sitemap
  • Subscribe to Newsletter

Data Science : Brief understanding of Typical Project Life-cycle, Tools, Techniques and skills

A DSF Whitepaper
01 August 2019
Dibyendu Banerjee
Author Profile
Other Articles
Follow (18)

Share with your network:

Data science projects do not have a nice clean lifecycle with well-defined steps like software development lifecycle (SDLC). We see in many forums discussion on Data Science, Business Analytics, Big Data, and Machine learning. However, what is the life cycle or steps for such projects. Nobody seems to be able to come up with a compact explanation of how the whole process goes.

People often confuse the lifecycle of a data science project with that of a software engineering project. That should not be the case, as data science is more of science and less of engineering. There is no one-size-fits-all workflow process for all data science projects and data scientists have to determine which workflow best fits the business requirements.

Every step in the lifecycle of a data science project depends on various data scientist skills and data science tools. The typical lifecycle of a data science project involves jumping back and forth among various interdependent science tasks using variety of tools, techniques (mostly statistical methods and formulae), programming etc.

Let us try to see what could be a typical life cycle.

Before you can even start on a data science project, it is critical that you understand the problem you are trying to solve.

 
 

Before you can even start on a data science project, it is critical that you understand the problem you are trying to solve.

It is important to understand the various specifications, requirements, priorities and required budget. You must possess the ability to ask the right questions. Here, you assess if you have the required resources present in terms of people, technology, time and data to support the project. In this phase, you also need to frame the business problem and formulate initial hypotheses (IH) to test.

According to Microsoft Azure's blog. we typically use data science to answer five type of questions:

  1. How much or how many?(regression)
  2. Which category? (classification)
  3. Which group? (clustering)
  4. is this weird? (anomaly detection)
  5. Which option should be taken? (recommendation)

In this satge.you should also be identifying the central objective of your project by identifying the variables that need to be predict.

Data Wrangling, sometimes referred to as data Munging

 
 

Data Wrangling, sometimes referred to as data Munging

Data wrangling is the process of cleaning and unifying messy and complex data sets for easy access and analysis. Transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

In other words, it is a Data Cleaning activity so why not we call it as Scrubbing

MUNGING = SCRUBBING = DATA CLEANING

In this phase, you require analytical sandbox in which you can perform analytics for the entire duration of the project. However, before that further, you will perform ETL (extract, transform and transform) to get data into the sandbox. 

Data might need to be collected from multiple type of data sources.

Few Example of Data Source.

  • File format Data(Spreadsheet, CSV, Text files, XML, jSON)
  • Relational Database
  • Non-relational Database(NoSQL)
  • Scrapping Website Data using tools

SKILLS/Tools/Techniques

  • Database Management: MySQL, PostgresSQL, MongoDB
  • Querying Relational Databases
  • Retrieving Unstructured Data: text, videos, audio files, documents
  • Distributed Storage: Hadoops, Apache Spark/Flink
  • R packages to read from file format
  • Python libraries to read from files

EXPLORE… EXPLORE… EXPLORE

 
 

EXPLORE… EXPLORE… EXPLORE

Object of this step is to apply scientific (Statistical) methods to make data more feasibly to feed into MODELS, in other words choosing baseline model is the outcome of this phase.

Exploratory analysis is often described as a philosophy, and there are no fixed rules for how you approach it. There are no shortcuts for data exploration.

Remember the quality of your inputs decide the quality of your output. Therefore, once you have got your business hypothesis ready, it makes sense to spend lot of time and efforts here.

Below are the some of the standard practices involved to understand, clean and prepare your data for building your predictive model:

  1. Variable Identification
  2. Univariate Analysis
  3. Bi-variate Analysis
  4. Missing values treatment
  5. Outlier treatment
  6. Variable transformation
  7. Variable creation

Finally, we will need to iterate over steps 4 – 7 multiple times before we come up with our refined model.

Model Building is the core activity of a data science project. It is carried out either Statistical Driven - Statistical Analytics or using Machine Learning Techniques.

 
 

Model Building is the core activity of a data science project. It is carried out either Statistical Driven - Statistical Analytics or using Machine Learning Techniques.

The below first picture explains different stages of analytics. Second picture is typical flow of Data Science activities, which shows statistical modeling, are followed by ML.

 
 

The below first picture explains different stages of analytics. Second picture is typical flow of Data Science activities, which shows statistical modeling, are followed by ML.

Difference between statistical modeling and ML

Machine Learning is An algorithm that can learn from data without relying on rules-based programming.Statistical Modelling is Formalization of relationships between variables in the form of mathematical equations.

MACHINE LEARNING?

Undoubtedly, Machine Learning is the most in-demand technology in today’s market. Its applications range from self-driving cars to predicting deadly diseases.

Machine Learning Terminologies

Algorithm: A Machine Learning algorithm is a set of rules and statistical techniques used to learn patterns from data and draw significant information from it. It is the logic behind a Machine Learning model. An example of a Machine Learning algorithm is the Linear Regression algorithm.

 

Model: A model is the main component of Machine Learning. A model is trained by using a Machine Learning Algorithm. An algorithm maps all the decisions that a model is supposed to take based on the given input, in order to get the correct output.

Predictor Variable: It is a feature(s) of the data that can be used to predict the output.

Response Variable: It is the feature or the output variable that needs to be predicted by using the predictor variable(s).

Training Data: The Machine Learning model is built using the training data. The training data helps the model to identify key trends and patterns essential to predict the output.

Testing Data: After the model is trained, it must be tested to evaluate how accurately it can predict an outcome. This is done by the testing data set.

Machine Learning Types

A machine can learn to solve a problem by following any one of the following three approaches. These are the ways in which a machine can learn:

  • Supervised Learning
  • Unsupervised Learning
  • Reinforcement Learning(Out of Scope of this document)
Supervised Learning

Supervised learning is a technique in which we teach or train the machine using data, which is well labeled.

To understand Supervised Learning let us consider an analogy. As kids we all needed guidance to solve math problems. Our teachers helped us understand what addition is and how it is done. Similarly, you can think of supervised learning as a type of Machine Learning that involves a guide. The labeled data set is the teacher that will train you to understand patterns in the data. The labeled data set is nothing but the training data set.

The pic below shows Supervised Learning. By doing so, you are training the machine by using labeled data. In Supervised Learning, there is a well-defined training phase done with the help of labeled data.

 
Unsupervised Learning

Unsupervised learning involves training by using unlabeled data and allowing the model to act on that information without guidance. Think of unsupervised learning as a smart kid that learns without any guidance.

 
 
List of Common Machine Learning Algorithms

Here is the list of commonly used machine learning algorithms. These algorithms can be applied to almost any data problem:

Regression Algorithms

Regression is concerned with modeling the relationship between variables that is iteratively refined using a measure of error in the predictions made by the model. Regression methods are a workhorse of statistics and have been co-opted into statistical machine learning.

The most popular regression algorithms are:

  • Ordinary Least Squares Regression (OLSR)
  • Linear Regression
  • Logistic Regression
  • Stepwise Regression
  • Multivariate Adaptive Regression Splines (MARS)
  • Locally Estimated Scatterplot Smoothing (LOESS)
Clustering Algorithms

Clustering, like regression, describes the class of problem and the class of methods.Clustering methods are typically organized by the modeling approaches such as centroid-based and hierarchal. All methods are concerned with using the inherent structures in the data to best organize the data into groups of maximum commonality.

The most popular clustering algorithms are:

  • k-Means
  • k-Medians
Dimensionality Reduction Algorithms

Like clustering methods, dimensionality reduction seek and exploit the inherent structure in the data, but in this case in an unsupervised manner or order to summarize or describe data using less information.

This can be useful to visualize dimensional data or to simplify data, which can then be used in a supervised learning method. Many of these methods can be adapted for use in classification and regression.

  • Principal Component Analysis (PCA)
  • Principal Component Regression (PCR)
  • Partial Least Squares Regression (PLSR)
  • Linear Discriminant Analysis (LDA)
  • Mixture Discriminant Analysis (MDA)
Instance-based Algorithms

Instance-based learning model is a decision problem with instances or examples of training data that are deemed important or required to the model. Such methods typically build up a database of example data and compare new data to the database using a similarity measure in order to find the best match and make a prediction. For this reason, instance-based methods are also called winner-take-all methods and memory-based learning. Focus is put on the representation of the stored instances and similarity measures used between instances.

The most popular instance-based algorithms are:

  • k-Nearest Neighbor (kNN)
  • Learning Vector Quantization (LVQ)
  • Self-Organizing Map (SOM)
  • Locally Weighted Learning (LWL)
Decision Tree Algorithms

Decision tree methods construct a model of decisions made based on actual values of attributes in the data.

Decisions fork in tree structures until a prediction decision is made for a given record. Decision trees are trained on data for classification and regression problems. Decision trees are often fast and accurate and a big favorite in machine learning.

The most popular decision tree algorithms are:

  • Classification and Regression Tree (CART)
Other important algorithms are:
  1. Naive Bayes
  2. Random Forest
  3. Dimensionality Reduction Algorithms
  4. Neural Network Algorithms
  5. Natural Language Processing (NLP)
Tools

There are various R packages available

Python - Scikit-learn – It is a free library, which contains simple and efficient tools for data analysis and mining purposes. You can implement various algorithm, such as logistic regression, time series algorithm using scikit-learn.

Interpreting data refers to the presentation of your data to a non-technical layman. We deliver the results in to answer the business questions we asked when we first started the project, together with the actionable insights that we found through the data science process.

 

Interpreting data refers to the presentation of your data to a non-technical layman. We deliver the results in to answer the business questions we asked when we first started the project, together with the actionable insights that we found through the data science process.

Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

On top of that, you will need to visualize your findings accordingly, keeping it driven by your business questions. It is essential to present your findings in such a way that is useful to the organization, or else it would be pointless to your stakeholders.

Note : Data Visualization is a technique where data is visualized using certain tools. Visualization is used by data scientist as and when required say it is EDA or Data Wrangling etc. Hence, form general life cycle perspective DATA VISUALIZATION can be more generically called and as getting INSIGHTS.

SKILLS

In this process, technical skills only are not sufficient. One essential skill you need is to be able to tell a clear and actionable story. If your presentation does not trigger actions in your audience, it means that your communication was not efficient. Remember that you will be presenting to an audience with no technical background, so the way you communicate the message is key.

  • Tools
  • Tableau
  • Power BI
  • R – ggplot2, lattice,
  • Python – Matpoltlib, Seaborn, Plotly.
Range of Technologies Brands in Data Science
Rate this Whitepaper
Rate 1 - 10 by clicking on a star
(8 Ratings) (4 Comments) (10169 Views)
Download

If you found this Whitepaper interesting, why not review the other Whitepapers in our archive.

Login to Comment and Rate

Email a PDF Whitepaper

Comments:

Abhishek Mishra

19 Apr 2020 05:45:40 PM

technical skills only are not sufficient. One essential skill you need is to be able to tell a clear and actionable story - quite a strong point

Balakrishnan Subramanian

20 Apr 2020 04:19:57 PM

The article "Data Science : Brief understanding of Typical Project Life-cycle, Tools, Techniques and skills" is presented very neatly.

Sureshkumar Sundaram

23 Apr 2020 03:46:19 PM

Very well written. The above article data science ,Brief understanding of Typical Project Life-cycle, Tools, Techniques and skills ,all are clearly given.

Abhishek Mishra

25 Apr 2020 10:37:39 AM

he typical lifecycle of a data science project involves jumping back and forth among various interdependent data science tasks using variety - Agree

Go to discussion page

Categories

  • Data Science
  • Data Security
  • Analytics
  • Machine Learning
  • Artificial Intelligence
  • Robotics
  • Visualisation
  • Internet of Things
  • People & Leadership Skills
  • Other Topics
  • Top Active Contributors
  • Balakrishnan Subramanian
  • Abhishek Mishra
  • Mayank Tripathi
  • Michael Baron
  • Santosh Kumar
  • Recent Posts
  • AN ADAPTIVE MODEL FOR RUNWAY DETECTION AND LOCALIZATION IN UNMANNED AERIAL VEHICLE
    12 November 2021
  • Deep Learning
    05 November 2021
  • Machine Learning
    05 November 2021
  • Data is a New oil : A step into WSN enabled IoT and security
    26 October 2021
  • Highest Rated Posts
  • Data Driven Business Models in FMCG & Retail
  • Internet of Things (IOT): Network Protocol Queue and Enabling Technologies
  • Understanding Buzzwords in Data Science
  • Data Analysis with Pandas
  • What have the changes made to primary and secondary assessment frameworks done to the ‘London effect’ in school performance?
To attach files from your computer

    Comment

    You cannot reply to your own comment or question. You can respond to another member's comment in this thread.

    Get in touch

     

    Subscribe to latest Data science Foundation news

    I have read and agree to the Data science Foundation Privacy Policy

    • Home
    • Information
    • Resources
    • Membership
    • Services
    • Legal
    • Privacy
    • Site Map
    • Contact

    © 2022 Data science Foundation. All rights reserved. Data S.F. Limited 09624670

    Site By-Peppersack

    We use cookies

    Cookie Information

    We are using cookies to provide statistics that help us to improve your experience of our site. You can choose to use the site without cookies. However, by continuing to use the site without changing your settings, you are agreeing to our use of cookies.

    Contact Form

    This member is participating in the Prodigy programme. This message will be directed to Prodigy Admin the Prodigy Programme manager. Find out more about Prodigy

    Complete your membership listing and tell others about your interests, experience and qualifications with a Personal Profile page.

    Add a Personal Profile

    Your Personal Profile page is missing information about your experience and qualifications that other members would find interesting. Click here to update.

    Login / Join Us

    Login to your membership account to view your personalised news feed, update your profile, manage your preferences. publish articles and to create a following.

    If you are not a member but work with or have an interest in Data Science, Machine Learning and Artificial Intelligence, join us today.

    Login | Join Us

    Support the work of the Data Science Foundation

    Help to fund our work and enable us to provide free communications and knowledge sharing services to members across the globe.

    Click here to set-up a donation of £30 per year

    Follow

    Login

    Login to follow this member

    Login