Login  |  Join Us  |  Subscribe to Newsletter
Login to View News Feed and Manage Profile
☰
Login
Join Us
Login to View News Feed and Manage Profile
Agency
Agency
  • Home
  • Information
    • Discussion
    • Articles
    • Whitepapers
    • Use Cases
    • News
    • Contributors
    • Subscribe to Newsletter
  • Courses
    • Data Science & Analytics
    • Statistics and Related Courses
    • Online Data Science Courses
  • Prodigy
    • Prodigy Login
    • Prodigy Find Out More
    • Prodigy Free Services
    • Prodigy Feedback
    • Prodigy T&Cs
  • Awards
    • Contributors Competition
    • Data Science Writer Of The Year
  • Membership
    • Individual
    • Organisational
    • University
    • Associate
    • Affiliate
    • Benefits
    • Membership Fees
    • Join Us
  • Consultancy
    • Professional Services
    • Project Methodology
    • Unlock Your Data
    • Advanced Analytics
  • Resources
    • Big Data Resources
    • Technology Resources
    • Speakers
    • Data Science Jobs Board
    • Member CVs
  • About
    • Contact
    • Data Science Foundation
    • Steering Group
    • Professional Standards
    • Government And Industry
    • Sponsors
    • Supporter
    • Application Form
    • Education
    • Legal Notice
    • Privacy
    • Sitemap
  • Home
  • Information
    • Discussion
    • Articles
    • Whitepapers
    • Use Cases
    • News
    • Contributors
  • Courses
    • Data Science & Analytics
    • Statistics and Related Courses
    • Online Data Science Courses
  • Prodigy
    • Prodigy Login
    • Prodigy Find Out More
    • Prodigy Free Services
    • Prodigy Feedback
    • Prodigy T&Cs
  • Awards
    • Contributors Competition
    • Data Science Writer
  • Membership
    • Individual
    • Organisational
    • University
    • Associate
    • Affiliate
    • Benefits
    • Membership Fees
    • Join Us
  • Consultancy
    • Professional Services
    • Project Methodology
    • Unlock Your Data
    • Advanced Analytics
  • Resources
    • Big Data Resources
    • Technology Resources
    • Speakers
    • Data Science Jobs Board
    • Member CVs
  • About
    • Contact
    • Data Science Foundation
    • Steering Group
    • Professional Standards
    • Government And Industry
    • Sponsors
    • Supporter
    • Application Form
    • Education
    • Legal Notice
    • Privacy
    • Sitemap
  • Subscribe to Newsletter

Understanding Linear Regression with Python: Practical Guide 2

A DSF Whitepaper
22 April 2020
Mayank Tripathi
Author Profile
Other Articles
Follow (30)

Share with your network:

 

This second practical guide will help you to brush-up your knowledge and go to the top of the class. Let us start with a simple statistical algorithm known as Linear Regression and begin to develop our skills by understanding the principles that underpin how it works.

Before we dive deep into the theories surrounding Linear Regression we need to start with a clear view on why it is needed in the first place.

So, what does the term Regression mean?

Understanding a Regression Problem

Example 1: Suppose I am planning a road trip with two of my best friends from Nashville, TN to Las Vegas. To make the trip enjoyable and stress-free and to ensure that we arrive in one piece, I would be well advised to create a schedule. The first thing I will create as part of this is a budget, how much money should I allocate for gas (petrol / diesel), for food and for accommodation along the way. My approach to this problem would be to find a way to estimate the amount of money needed, based on the distance we are travelling, and the number of stops required. We can all appreciate that the larger the distance travelled, the more expensive it would be.

Example 2: You are asked to examine the relationship between the age and the price of used cars sold in the previous year by a car dealership. We all are aware of the general rule that as a car age increases price goes down, this is an example of a negative relationship between car price (Y) and car age (X).

Example 3: Suppose we are planning to buy a new House, for this we will gather information such as sale prices from various property dealers along with information that would include number of bedrooms, number of bathrooms, location etc. Based on all this data, we can then, if we are presented with the same data but not the sale price of another property, estimate the sale price within a reasonably accurate range. And if given the asking price of a house on the market, appreciate if it is a good deal or if it is overpriced.

Other examples include: Sleep hours Vs Test scores; Experience vs Salary; etc.

The point of these examples is that once we have established a relationship between 2 or more variables or have identified a statistically significant relationship, we can then proceed to forecast, predict, or estimate the value for new observation(s).

And the best thing about this? You are already doing it, day in day out. You just didn’t realize that it was called linear regression.

Meaning of Regression & Linear Regression

Regression analysis attempts to predict one dependent variable or target (usually denoted by Y) and a series of other independent variables or features (usually denoted by X).

Linear regression is a statistical approach for modelling the relationship between a dependent variable with a set of explanatory variables. Linear regression is a common Statistical Data Analysis technique.

Problem-solving using linear regression has so many applications in business, social, biological, and many other areas.

Types of Linear Regression

In general, there are two types of linear regression

  • Simple Linear Regression
  • Multiple Linear Regression

Simple Linear Regression allows us to study the relationship between two variables.

In simple linear regression a single independent variable is used to predict the value of a dependent variable.

I.e. One independent variable (X) and One Dependent variable (Y).

This can be denoted by

 

In school we became familiar with equations like the one shown below, then how it is different from the equation above?

 

Well both are the same type of equation, we just changed the name of the variable and added something extra, the ‘e’ to minimize the chance of error.

Where m was slope, and c was Intercept. In the first equation b0 is the intercept, and b1 is the slope.

Slope direction

The slope of a line can be positive, negative, zero or undefined.

Positive slope: y increases as x increases, so the line slopes upwards to the right.

Negative slope: y decreases as x increases, so the line slopes downwards to the right. If you remember from the previous examples, we have seen an example of this, where the Age of the Car increases, the price decreases.

Zero slope: y does not change as x increases, so the line remains horizontal. The slope of any horizontal line is always zero.

Undefined slope: When the line is exactly vertical, it does not have a defined slope. The two x coordinates are the same, so the difference is zero.

Multiple Linear Regression allows us to study the relationship between three or more variables.

In Multiple Linear Regression two or more independent variables are used to predict the value of a dependent variable.

I.e. Two or more independent variables (X1, X2, X3, ….) and one Dependent variable (Y).

 

The difference between linear regression and multiple linear regression is the number of independent variables (X). In both cases there is only a single dependent variable (Y).

There is one more type of Linear Regression which is Polynomial Regression, this is a generalized case of linear regression, which we will come in another paper and, we will find that many of the challenges we encounter can be solved with Multiple Linear Regression.

Practical 1

Let’s start by working our way through a basic example so that we understand how to use the equation before we convert it into Python code.

Our input data is as follows:

X 1 3 10 16 26 36
Y 42 50 75 100 150 200

Here our goal is to solve the equation y = mx + c.

There is a formula for finding the Intercept and Slope from the data. As the data values are few, we will be able to do it manually. Let us do it step by step.

  • Sum of X = 1+3+10+16+26+36 = 92
  • Sum of Y = 42+50+75+100+150+200 = 670
  • Next is to get the Sum of X squared = (square of 1) + (square of 3) + (square of 10) + (square of 16) + (square of 26) + (square of 36) = 2338
  • Similarly get the Sum of Y Squared = (square of 42) + (square of 50) + (square of 75) + (square of 100) + (square of 150) + (square of 200) = 82389
  • Next is to multiply x and y = (1 * 42) + (3 * 50) + (10 * 75) + (16 * 100) + (26 * 150) + (26 * 150) = 13642

Cool now we do have the values, and we can find the value of slope (m) and intercept (c).

Below is the formula to find the Slope (m)

 

The value for slope is 4.51

 

And formula to get the intercept (c) is

 

After this we have identified the Slope (m) as 4.51 and intercept (c) as 33.83.

Now that we have these values, if we need to predict any new value, we can easily get it. For example, if x is 12, we can predict the value of y as follows:

y = m x + c = 4.51 * 12 + 33.83 = 87.95

Wow, we did it. So, for X as 12, Y will be 87.95.

This seems a bit tedious, so we would have ignored this during our time at school. Which is why we are brushing-up on it now. Today we have Python to make life easy, but it is important to understand who the equation works

Getting Started with Python

First gather data or observation or sample.

 

Next import the required library for Linear Regression. And get the value of the Slope and intercept.

 

Now calculate the value of y when the value of x is 12, which comes out as 87.80 which is near to what we calculated manually (87.95). The difference is mainly due to values after the decimal, which when done manually we only considered 2 places after decimal.

 

So, in python with the 5 - 6 lines of code we are able to resolve the equation and start predicting values.

It helps to visualize. With the graph below we can see clearly that the blue dots are values and the green line is the one our model has predicted.

If you observe carefully the blue dot on X-Axis between 15-20 is not on the green line, so that gap or distance is measured as an error represented by e.

 

This doesn’t end here, after we have created the model, I mean the Linear Regression, we have to test the accuracy of our model. Which can be achieved by the R-Squared method, which is denoted as

 

The R-Squared value is a statistical measure of how close the data are to the fitted regression line. It is also known as Coefficient of determination. This is used to calculate the distance or gap between the actual minus the mean or average & the distance or gap between the predicted minus the mean.

R-squared values range from 0 to 1, as the R-squared for the model reaches 1 is considered to be a good fit, and vice-versa if the R-squared value for the model is near to 0 is considered to be a bad fit. This fit excludes outliers.

However this is not always the case, it depends on the problem we are solving. In some fields it is entirely expected that the R-squared value will be low. For example any field that attempts to predict human behavior, such as Psychology, typically has R-squared values lower than around 50% which is less than 0.5, through which we can conclude that the actions and thoughts of humans are simply harder to predict.

I hope by now you have basic understanding of Linear Regression.

What we have seen above is an example of Simple Linear Regression, just imagine if we have multiple independent variables, working on this manually would be a difficult and tedious task but this can be easily handled by Python and many other programming languages.

Practical 2

Let’s do some hands-on with an actual dataset. The best thing is we won’t have any difficulty finding a dataset, sklearn have provided these for our use.

We already seen Simple Linear Regression, so this time we will work on Multiple Linear Regression.

You can follow along or directly check the source code, link is provided at the end of the paper.

We will use the sklearn Boston House-prices Dataset in our regression. There are 506 instances or records or observations and 13 attributes or Features and 01 Target.

As a general practice, first gather the dataset. Which we have already done.

Next step is to load the required libraries.

 

Now load the Boston dataset into some variables. Here I am using boston as a variable to hold the entire dataset.

 

Note we have used the parameter return_X_y = False, thus it will return the whole dataset into a single variable, if we set the parameter return_X_y as True, then the result will be stored into two variables aka X and y.

The Boston dataset is a dictionary which holds keys and values for each key. We can view the keys from the Boston dataset with the keys() method.

boston.keys()

Though these are not related to Linear Regression, it is important to know that when dealing with a dataset from the sklearn library or any other source, if the dataset is in dictionary format.

As we can see from the above output, there are 506 rows of data and 13 columns. So must know what data the 13 columns contain. The feature_names method will provide the feature names of the columns.

This can be verified using the below code.

print(boston.feature_names)

The DESCR method will provide the dataset characteristics for the Boston dataset.

print (boston.DESCR)

Now it’s time to convert the Dataset captured in the boston variable into the pandas DataFrame.

boston_df = pd.DataFrame(boston.data)

The resulting data frame has no column names in it. So let’s add the column names by using the feature_names method and pass the column names into boston_df.

boston_df.columns = boston.feature_names
boston_df.head()

 

From the results we can see that now we have Column Names, and there are 13 columns which are our Features. So, Feature 1 (X1) is CRIM, Feature 2 (X2) is ZN, and so on.

.As already mentioned, to get details on features, please check the DESCR method.

Now we have all the features, but we have not yet not talked about the Target. So, here we go,

boston.target

is holding the target column which is our Final Boston Housing Price based on the features.

Now we can split the data into X and y.

 

Next is to split the X and y into Train and Test Dataset so that we can train and test or validate the model. As usual will use 75 % for Train and 25 % for Test Dataset.

# splitting X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=10)

The next step is to create the instance or object for our Linear Regression model, which is similar to what we did in practice 1.

lr = LinearRegression()

lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)

Also, we can get the slope and intercept value.

m = lr.coef_

print('Value for m (slope): \n', m)

c = lr.intercept_

print ('Value of c (Intercept) ', c)

And finally, we will come up with the Predicted value which we can visualize using various graph methodology. Here I am using simple matplotlib pyplot and seaborn.

 
 

Note: You cannot plot graphs for multiple regression in the same way we did in Simple Linear Regression. Multiple regression yields graphs with many dimensions. The dimensions of the graph increase as your features increase. In our case, X has 13 features.

I hope by now you have a better understanding of Linear Regression.

For Logistic Regression, please visit:

https://datascience.foundation/sciencewhitepaper/understanding-logistic-regression-with-python

Source Code

For Practice 1: https://colab.research.google.com/drive/1riQbz0VgGbgG3GhJo7cC8iEkXQcHu-1R

For Practice 2: https://colab.research.google.com/drive/1izDUnGYXhCqOQgjLGlJnsMTQbSJqZRIz

Boston Housing Dataset:

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston

Scikit-learn site for other datasets:

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets

References:

  • https://www.statisticallysignificantconsulting.com/RegressionAnalysis.htm
  • https://www.statisticssolutions.com/what-is-multiple-linear-regression/
  • https://en.wikipedia.org/wiki/Linear_regression
  • https://www.investopedia.com/terms/r/r-squared.asp
Rate this Whitepaper
Rate 1 - 10 by clicking on a star
(12 Ratings) (2 Comments) (10628 Views)
Download

If you found this Whitepaper interesting, why not review the other Whitepapers in our archive.

Login to Comment and Rate

Email a PDF Whitepaper

Comments:

Abhishek Mishra

25 Apr 2020 10:08:27 AM

realization of the importance of extracting knowledge from data, these techniques play an important role in Machine Learning and Data Science.- Quite point and I agree

Balakrishnan Subramanian

25 Apr 2020 10:26:06 AM

This white paper clearly explained about different regression concept with different real time examples.

Go to discussion page

Categories

  • Data Science
  • Data Security
  • Analytics
  • Machine Learning
  • Artificial Intelligence
  • Robotics
  • Visualisation
  • Internet of Things
  • People & Leadership Skills
  • Other Topics
  • Top Active Contributors
  • Balakrishnan Subramanian
  • Abhishek Mishra
  • Mayank Tripathi
  • Michael Baron
  • Santosh Kumar
  • Recent Posts
  • AN ADAPTIVE MODEL FOR RUNWAY DETECTION AND LOCALIZATION IN UNMANNED AERIAL VEHICLE
    12 November 2021
  • Deep Learning
    05 November 2021
  • Machine Learning
    05 November 2021
  • Data is a New oil : A step into WSN enabled IoT and security
    26 October 2021
  • Highest Rated Posts
  • Data Driven Business Models in FMCG & Retail
  • Internet of Things (IOT): Network Protocol Queue and Enabling Technologies
  • Understanding Buzzwords in Data Science
  • Data Analysis with Pandas
  • What have the changes made to primary and secondary assessment frameworks done to the ‘London effect’ in school performance?
To attach files from your computer

    Comment

    You cannot reply to your own comment or question. You can respond to another member's comment in this thread.

    Get in touch

     

    Subscribe to latest Data science Foundation news

    I have read and agree to the Data science Foundation Privacy Policy

    • Home
    • Information
    • Resources
    • Membership
    • Services
    • Legal
    • Privacy
    • Site Map
    • Contact

    © 2022 Data science Foundation. All rights reserved. Data S.F. Limited 09624670

    Site By-Peppersack

    We use cookies

    Cookie Information

    We are using cookies to provide statistics that help us to improve your experience of our site. You can choose to use the site without cookies. However, by continuing to use the site without changing your settings, you are agreeing to our use of cookies.

    Contact Form

    This member is participating in the Prodigy programme. This message will be directed to Prodigy Admin the Prodigy Programme manager. Find out more about Prodigy

    Complete your membership listing and tell others about your interests, experience and qualifications with a Personal Profile page.

    Add a Personal Profile

    Your Personal Profile page is missing information about your experience and qualifications that other members would find interesting. Click here to update.

    Login / Join Us

    Login to your membership account to view your personalised news feed, update your profile, manage your preferences. publish articles and to create a following.

    If you are not a member but work with or have an interest in Data Science, Machine Learning and Artificial Intelligence, join us today.

    Login | Join Us

    Support the work of the Data Science Foundation

    Help to fund our work and enable us to provide free communications and knowledge sharing services to members across the globe.

    Click here to set-up a donation of £30 per year

    Follow

    Login

    Login to follow this member

    Login