An introduction to Pandas and its capabilities
Before one can imagine venturing into Machine Learning or its potential use cases, the fundamental prerequisite is Data. As a Data Scientist, Pandas is one of the best weapons to have in your arsenal when wrangling with data.
So what is Pandas?
Pandas is an open source library that is a part of the Python ecosystem. It is a flexible way of exploring, cleaning, and manipulating data with operations similar to Microsoft Excel or SQL .
Pandas can load and process data from different formats such as CSV, SQL queries, Excel, web pages, JSON, and pickle files. Now, let’s get started with Pandas.
If you have Anaconda with an environment already setup, just run the below commands in your terminal:
source activate <environment_name>
conda install pandas
However if you don’t use Anaconda, the alternative is:
pip install pandas
You can also use virtualenv to install the libraries with the pip command above.
Data Structures in Pandas
Pandas support two data structures: Series and DataFrames. Let’s start with Series.
A Series is a representation of a one dimensional array element in Pandas. For example:
A Pandas Series takes in an input of the format:
series_example = pd.Series(1d_data, index)
Now, let’s see what happens when we pass np.arrayas the input instead of a list :
You get the idea; a Series is a representation of one-dimensional data with an index. In the format mentioned above, it is not mandatory to pass in an index, it is an optional parameter. If the index is not passed, Pandas will adjust the value to [0,1, .. len(input)-1]. For example:
A Pandas series can also take in a dictionary as an input and index the dictionary’s values automatically with its respective keys, so that the values can easily be retrieved. For example:
In the above example, if we’d like to shorten the indexes we can do this:
The cool part about Pandas Series is not that list items can be indexed (automatically or not), but that objects of any type can be indexed and a whole set of operations can be performed on the Series object for insights on the inputted data, for example:
That was a high-level overview of Pandas Series objects. Now, let’s talk about Pandas DataFrames. Extending the grocery inventory example, let’s pretend that the source of some other data is residing in a CSV, that looks like this:
The following code will help us load the data and get some initial insights.
What just happened there? The first line is loading the data into to the Pandas DataFrame from the CSV. Now the data is in memory. We verify this using the head() method. Notice how we just pass a value of 5 as a parameter to the method, this is to only look at the first five elements of the DataFrame. We can pass any number between 0 to len(dataframe) elements; however, if we pass in a greater number, it will still limit the results to the length of the DataFrame. Next, the info() method gives us an idea of how many entries there are, the number of columns, and their object types. By default, Pandas considers the first row of the CSV file as the column names.
To get some stats we can use the describe() method. Pretty cool!
The following code snippet helps us to retrieve the list of columns in our DataFrame. Also, after realizing there is an extra space in one of the column names, we can immediately correct it using the rename() method.
Now, how do we know what the unique products and categories in this data are? Answer: by using the unique() method. It also lets us get the results in a list.
But in the original data we had a duplicate for Zucchini. We can get rid of that duplicate by using the drop_duplicates() method.
If there were any null entries in this DataFrame, we could use thedrop_na() or thedrop_null() methods to get rid of null or NaN entries. We can also use isna or isnull to determine if there are empty entries in the first place. In addition, we can use fillna to replace those entries with a value of interest to our use case.
There is another typo here for Parsley in the column value. We can fix that as well!
As you can see, we can do a lot of data cleaning with Pandas.
Another interesting feature of Pandas is the apply() method where we can perform a custom function that we define ourselves. For example, let’s define a method called set_availability() that returns availability depending on the Quantity value and apply it this way:
The sort_values method lets us view our data in ascending or descending order. Also, the groupby method groups data elements based on the group passed in, and performs aggregations on those groups.
Want some visualizations of the data? (Note: We need matplotlib installed for this)
Write to the local file system
Once we are happy with our data cleaning and manipulations, we can write the cleaned data in a CSV format like this:
Or we can write it as a JSON. Usually it is best practice to load again to see if everything is as expected.
To recap what we learned about the capabilities of Pandas DataFrames:
- Loaded the data representing the inventory of a grocery store.
- Got insights on how the data looks and retrieve the metrics using Pandas built in methods.
- Fixed some typos on column names and column values.
- Defined our own custom methods, applied existing data, and created new columns.
- Performed visualizations on the data.
- Rewrote the cleaned data back into the local file system.
Some more interesting concepts are concatenating, merging, and joining one or more Pandas DataFrames (see here for more information).
In this article, we have barely scratched the surface of the capabilities offered by Pandas. They help immensely in cleaning, manipulating and exploring data. Hopefully this article gives you got a head start using Pandas. Now, enjoy diving deeper into it’s capabilities with your own data!