It's been a while since I was busy with my official work, but I got some time to write another article on Data Analysis with Pandas, which is a continuation of the previous article Python for Data Analysis please do refer https://datascience.foundation/sciencewhitepaper/python-for-data-analysis, In previous part we had gone through the python fundamentals, and in this article we will walk through with the next steps which is to understand our dataset using Pandas. Before working on any Machine Learning algorithm, one must understand his data and the problem. Data Analysis steps can be considered as either EDA (Exploratory Data Analysis); or Data Visualization part or both. The purpose of the step is to understand, and some time visualization enhances the data visibility much better. However, I will not focus on the visualization part for now but will see it in part 3.
Here the main focus is on how to get most out of the dataset.
Disclaimer, all the codes are copied and modified from the Pandas official documentation, and the motive is to gather very basic methods and attributes of Pandas at one place which are very necessary to know and this helps a lot and make the data analysis easy.
Will be using a sample weather data set to demonstrate the use of each methods or attribute we are going to use for data analysis, this dataset can be downloaded from
Let us begin to make our hands dirty.
If you are new to Python and Machine Learning and wondering what setup is required then dont worry and just refer to the article https://datascience.foundation/datatalk/setting-up-a-python-jupyter-notebook-online-working-with-python-on-the-cloud and you will not have to install anything on your local workstation.
As usual we have to first import the library, as we are working with Pandas so let’s import the Pandas library.
The time I created the python notebook, I was using pandas 1.1.5 version.
Next is to load the dataset.
To load the dataset, we have various options based on the source format. Here we have the source format as csv, thus we will be using Pandas read_csv() method, which will read the csv, and load the entire dataset into Data Frame.
So, in-case if you are not sure what Data Frame is --
Pandas Data Frame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labelled axes (rows and columns). Arithmetic operations align on both row and column labels. It can be thought of as a dictionary - like container for Series objects. This is the primary data structure of the Pandas.
The general thumb rule in any Machine Learning Problem is to know your data, understand your data. Get the basic information out of it such as
- How many data points or observations our dataset has?
- Does it have any missing values?
- how many columns / features our dataset has?
- What is the data type of each column?
- and so on.
Based on this next step is decided.
You may call this with any name you like such as EDA (Exploratory Data Analysis); or Understanding Data; or Exploring Data; Performing Data Analysis etc
The purpose here is to get a basic understanding of the dataset, and the problem we are trying to solve.
Without wasting the time, let us see what methods Pandas has provided to us to explore the data.
The very first method is head().
The very first thing after loading the data is to see how our data looks, for this we have to have a glance of the first few data points, for this we have a method called head().
Pandas head() method is used to return top n (5 by default) rows of a data frame or series.
We could see the first or top 5 records, as we did not provide any arguments to the head method, so it is defaulted to 5, thus returning the first 5 records.
Pandas Data Frame are indexed from 0, so we have 0 to 4 which is a total of 5 records.
When provided the argument to the head method as 8, it returned or displayed 8 records with indexes from 0 to 7.
Feel free to provide any integer value and check the outcome.
Similar to the head() method, pandas has a tail() method to show the last n rows.
Pandas tail() method is used to return the last n (5 by default) rows of a data frame or series.
We could see the last 5 records, as we did not provide any argument, it is defaulted to 5, thus returning the last 5 records.
Here we could see the last 2 records, as we requested to return only the last two records.
Also, the last record is at index 8783.
It shows the total number of rows (data points or observations) and total number of columns of the data frame. I.e. (rows, columns)
Pandas also provided .size and .ndim along with .shape which are used to return size dimensions, and shape respectively of data frames and series.
So, there are a total of 8784 records or data points with 8 columns.
From here itself we could say that the data frame is of 2 dimensional and the size of a data frame is 8784 * 8 = 70272
Same information can be checked using pandas .size and .ndim methods.
So we could see that the size if just the multiplication of rows and columns, which alternatively written as df.shape * df.shape.
We already knew that our data frame is of 2 dimensional, usually ML problems will have a data frame of 2 dimensional array, and when we are working on Images (specially Gray or coloured images) we do have 3 dimensional arrays. Please do explore if interested.
What if we need to verify the dimension of an individual column or a series, no problem, let's do it.
So, when we are working with individual columns from the data frame it will be converted into series, and the series are 1 dimensional, as we saw above.
Pandas Series is a one-dimensional labelled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). It has to be remembered that unlike Python lists, a Series will always contain data of the same type.
index attribute provides the index or range index of the data frame.
In our sample dataset we know that the index has started from zero (0) which we validated using .head() and the last row has index as 8783, which we validated using .tail() method.
Same is verified using.
Oops, here we got the RangeIndex which says that it started from 0 agreed, but how come it stops at 8784 whereas we knew that the last indexed row is 8783.
Here is the catch, with the index range, last value is excluded. Thus, when it says it stops at 8784, the actual index is ended at 8784 - 1 = 8783.
Next is .columns attribute, which is to get all the column names in the data frame.
Here will get the list of Columns from the data frame.
There are times when we need to get all column names might be just of Integer type or Object type or etc. so at that time this .columns attribute is very useful.
Pandas .dtypes attribute is used to get the data type of each column in the DataFrame.
Above result shows the data type of each and every column in the data frame.
What if we are just interested in any one or specific set of columns. Let’s see that as well.
Here is the data type of individual columns “Wind Speed_km/h”, for example I have used Wind Speed_km/h, similarly one could use any column.
Here is the data type of provided multiple columns which can be a subset of columns. When checking the data type of multiple columns make sure to provide them in double brackets as [ [ ] ] which is shown above.
Pandas .unique() method is used to show all unique values for a given column. Remember that this .unique() method can be applied on a single column only, and not on the whole data frame.
Here we will get an array of un-sorted unique or distinct data for the given column.
When working with columns we could use it in two ways.
- df.Weather --> When the column name does not have any spaces in between.
- df[‘Weather'] --> always recommended to use. Easy to readability.
Above .unique() method returns all unique values for the given column, as the input column Weather is of categorical type, the returned array will be of Categorical or object dtype which i have yellow highlighted in above screenshot.
So please do remember, the return can be:
- Index : when the input is an Index
- Categorical or Object : when the input is a Categorical dtype
- ndarray : when the input is a Series/ndarray
Pandas .nunique() method is to show the total number of unique values in each column. As mentioned, this can be applied on the whole data frame as well as on a single or subset of columns.
Here I have used the nunique with the entire data frame. Now let’s see how to use it with a single column and then with a subset of columns.
As mentioned earlier as well either we could use df.column_name or df[‘column_name’] when using for a single column, and on applying nunique() on the Weather column we get the count as 50.
Similarly, we could apply on subset of columns as below.
Above two code snippets shows that the total number of unique values in the weather column is 50, and when checked as a subset of columns it shows the same, along with all provided columns.
Pandas .count() method is used to show the total number of non-null in each column.
count() method can be applied on a single column as well as on the whole data frame.
From this we can easily say that there are no null or Nan values in our data frame, as the data frame contains 8784 rows, and each column also has the same 8784 values (irrespective of duplicate or unique).
As mentioned above, the .count() method can be applied on a single column as well, so let’s see the example of it.
Also getting the count for a subset of columns.
isnull() & notnull()
While making a Data Frame from a csv file, many blank columns are imported as null values into the Data Frame which later creates problems while operating that data frame.
Sometimes our dataset itself contains missing data points, which when imported into the Data Frame will be converted into null values or NaN.
Pandas .isnull() and .notnull() methods are used to check and manage NULL values in a data frame.
.isnull() will return True if there is any NaN value present in the value, else will be false.
However seeing such a big dataset with True or False will be very tedious, to make it simpler we could aggregate it and check if there is any null value, using .sum() on top of .isnull() as below.
From counts() method we already knew that our dataset has no null value which we also confirmed with .isnull().sum() method.
Opposite to .isnull() we do have .notnull() method which will return True for every NON-NULL value and False for a null value.
Here also we could use .sum() aggregate function to get the count.
Pandas .value_count() is another method to show all the unique values with their count for a given column. This method is applied on a single column only.
The resulting object will be in descending order so that the first element is the most frequently occurring element. Excludes NA or NaN values by default.
Above I mentioned that .value_counts() is applied on a single column, but what if we have a requirement to get the unique value count for a subset of columns or on the whole data frame.
There are ways with which we could solve such problems. We just have to
- Loop over each column.
- use of apply and pass series.value_counts()
For example, as below.
Pandas .info() method is used to show a concise summary of a DataFrame. This method displays basic information about a DataFrame including the
- index range
- column(s) dtypes,
- Count of non-null values
- memory usage
All these details we have seen above using individual pandas methods or attributes, however with .info() method we could get in one single line of code.
So always remember there are multiple ways to solve a problem, it just depends on the tools we are using, and its performance matters as well.
For example when we need to get the shape of data frame we could simply use .shape attribute, however when we need additional details like count of each column, and data type, then instead of running individual code one could use .info(), with this single code we could get the required information.
Last but not the least is pandas .describe() method, which is used to view some basic statistical details like percentile, mean, standard deviation, etc of a data frame or a series of numeric values.
Remember the .describe() method by-default returns the detail for numeric value, however we could check the details for object type as well, in this the method will return a different output which includes total count of values, unique value; top and frequency of occurrence from each object type column.
If we observe the output carefully then we could easily identify that the columns Weather and Date/Time are not displayed, as the data type form them is object.
Let’s display them now.
If you need to view for all columns including all data types, use the below code.
I have marked O for Object data typed columns, and I for Integer (including float) data typed columns.
.describe() method has two more parameters apart from include which are percentile and exclude. I recommend that you please do check them and explore them.
Now as we are familiar with some of the basic methods and attributes, let's try to answer the following questions.
Please make a note that there are multiple ways to solve any problem, and here we are just focusing on one or two ways. Please do write in comments if you have used any other method, so that others including me will be benefited with that code.
Question 1 : Determine how many times the Weather was "Clear"?
As stated above, there are multiple ways to solve any problem, let's check two approaches.
Approach 1 : value_counts()
Use of .value_counts() method to display all the counts, and just pick the count for "Clear".
From the above output we could say that the "Clear" weather was 1326 times yellow highlighted.
However, this is not a good way to show the result, as one could not just show the entire data and ask the customer to pick based on your requirement.
Approach 2 : Filter
Filter the required data, to get the count of specific value from a column.
df[‘Weather’] == ‘Clear’ will return the index of the record from the data frame where the condition satisfies that i.e. where the Weather is Clear the value is True and where it does not satisfy value will be False.
But is this what we are looking for, no, we need the count of True indexes.
Perfect, this is what we are looking for. So, 1326 times the weather was "Clear".
If you want the subset of our data frame with "Clear" weather, then just use the code df[df['Weather'] == 'Clear'], it will return all columns which satisfy the condition.
Question 2 : On which days the Wind Speed was exactly 4 km per hour, and what was the Weather on that day.
In this example as well, we will use the filter approach.
Here you go, there are 474 records when the wind speed was 4 km per hr, and we have the date time with the weather for that duration.
Question 3 : Get the list of columns which are of object type.
To solve this let's go with 2 different approaches as below.
Approach 1: using .dtype and .columns
Approach 2 : using alternate method as .select_dtypes() and attribute .columns.
Hurray, we just completed the basic data analysis step.
Today we just checked the basic methods and attributes from Pandas library, there are other ways as well to check the same or similar things. Another important part of data analysis is visualizing the data.
Once we are comfortable with this we could jump onto Data Cleaning; EDA; pre-processing of data.
I hope you have learned quite a bit today. Let me know if you want me to cover any specific topic related to data science, Machine learning, etc. To do so, kindly leave a comment on my blog and I will make sure to get back to you.
Also follow me on DSF & LinkedIn to get my blog updates. Also follow me on Kaggle as well and subscribe to my notebook dedicated to hands-on experience. If you have any feedback on this article, please comment below!
Stay Happy, Stay Safe, Stay Fit, Stay Humble…!
Python Notebook can be accessed or downloaded from https://colab.research.google.com/drive/1-cY3OznSPDofG5nRHps4tS-5yoOOCWiQ?usp=sharing
Dataset can be downloaded from https://drive.google.com/file/d/1uklXAlxzGThMXWVB2OiHB3G-X1rsrnIs/view?usp=sharing