Data science is the study of data, it may be structured or unstructured. It involves understanding, extracting values and visualising data. Various machine learning algorithms and statistical methods are used for this. It’s the hottest topic of 21st century and the goal is to predict the information from the existing data. Business intelligence(BI) is to analyse and report on data, it's a subset of data science. Building predictive models helps business and markets to accelerate growth and development
The following skills are required to be Data Scientist
- Data Mining
- Data Analysis
- Data Visualisation
- Machine learning
- Programming Languages
Data mining is the technique of discovering patterns and the extraction of useful information from the data. The other name for data mining is Knowledge Discovery of Data (KDD). For more accurate models we require more data.
Stages of data mining:
This is the first stage of data mining, it consists of collecting data along with cleaning and transforming according to the needs of the problem. It can be done automatically as well as manually. For manual data exploration queries, programming language scripts can be used.
Data modeling is to apply algorithms to data. The aim is to choose the best data model based on the problem. Different models are applied to the same data in order to choose the best and most relevant model. Bagging, Boosting and Meta Learning are some popular techniques
The final stage is the deployment of the model. The model proven to be the best fit in the previous stage. It is important because the whole study is based on this. Before deployment we ensure the model is the one with the least noise.
Data analysis is the process of discovering useful results. Mined and cleaned data goes to analytic tools where patterns in the data are found. In simpler terms it is analysis of past or future data. Data analyst uses various techniques for analysing data, this can be done manually as well as automatically. Programming languages and analytic tools like R and python are used.
Types of data analysis:
Analysis which is done on text data is called text analysis. It is a method used for converting data into important information which can be used in multiple industries. Sentimental analysis and lexical analysis are the part of text analysis. For example, text analysis help us to sort and rank the webpages
Predictive analysis is the analysis of the unknown future results. It uses many techniques including machine learning and artificial intelligence. It combines statistics with computational intelligence and produces result in the form of expected future values. Fraud detection and risk management are applications of the predictive analysis
Data visualisation is the technique for visualising the analysed data. Large amount of data are very difficult to understand. Data visualisation techniques such as graphs and charts help us to see trends and pattern in complex data sets
Types of Data Visualisation
There are also many data visualisation tools like Qlickviews and FusionCharts which help us to visualise data without running programmes. Manual data visualisation can be done by Python and R.
Statistics is the building block of all machine learning algorithms. It helps us to get deep and precise knowledge of data which helps us to study the data. Without statistics, we wouldn’t be able to do machine learning or data science
Two categories of statistics:
Provides information/description about the data. Data is categorised and organised based on the given parameter. It can be through the numerical value, table or by graphs
Predicts the output based on past data. The methods of inferential statistics are based on estimation of parameters and testing of hypotheses.
Machine learning is a part of data science and is an application of artificial intelligence, systems have the ability to learn automatically and to improve with experience. Machine learning algorithms are used for classification, regression and clustering.
A technique used to predict the dependent variable in a set of independent variable.
A technique used for approximating a mapping function (f) from input variables (X) to discrete output variables (y)
A technique for dividing the population or data points into a number of groups such that the data points in the same groups are more similar to other data points in the same
group and dissimilar to the data points in other groups
Knowledge of programming languages is must for data scientists. There are many languages with Python and R being the most popular.
If you found this Article interesting, why not review the other Articles in our archive.