Code has been added to clipboard!

How to Use the Pandas Groupby for Grouping Data and Applying Functions

Reading time 4 min
Published Nov 28, 2019
Updated Nov 28, 2019

TL;DR – Pandas groupby is a function in the Pandas library that groups data according to different sets of variables. In this case, splitting refers to the process of grouping data according to specified conditions. Applying refers to the function that you can use on these groups. Combining means that you form results in a data structure.

What is the Pandas groupby function?

Pandas groupby is a function for grouping data objects into Series (columns) or DataFrames (a group of Series) based on particular indicators. In simpler terms, group by in Python makes the management of datasets easier since you can put related records into groups.

Note: essentially, it is a map of labels intended to make data easier to sort and analyze.

Using the groupby function: syntax rules

The basic Python groupby syntax typically consists of clauses such as from, group by, select, and aggregation.

To understand what each of them means, let’s first look at the syntax:

Table_name.groupby(['Group'])['Feature'].aggregation()

Which can be broken down into these parts:

  • Table_name: this would be the name of the DataFrame, the source of the data you are working on.
  • groupby: the group by in Python is for sorting data based on different criteria. In this case, the condition is Group.
  • Feature: the part of the data or feature you want to be inserted in the computation.
  • aggregation(): the specific function name or aggregation you wish to execute with this operation.

Note: before using Python groupby function, you need to prepare the Pandas library. For instance, you can get Anaconda, and most of the necessary modules are already installed.

Splitting, applying and combining

How to split using Pandas groupby?

Splitting with groupby works by dividing a DataFrame into several categories and assigning labels to each one.

Note: frequently, developers mention split-apply-combine technique. It means that you divide your data into groups based on specific conditions, then you apply some changes to each group and combine old and new data.

In the following example, we are creating a DataFrame with some information about employees (their age, city, and hours they have worked):

Example
import pandas as pd
import numpy as np
df = pd.DataFrame( {     
"Employee" : ["Susan", "Bart", "Emily", "Charles", "David", "Charles", "Julia", "Bart"] ,           
"City" : ["London", "London", "Philadelphia", "London", "London", "Philadelphia", "London", "Philadelphia"] ,
"Age" : [20, 40, 18, 24, 37, 40, 44, 20 ],
"Hours" : [24, 40, 50, 36, 54, 44, 41, 35]} ) 
df

In the next snapshot, you can see how the data looks before we start applying the Pandas groupby function:

Now, we can use the Pandas groupby() to arrange records in alphabetical order, group similar records and count the sums of hours and age:

df.groupby(['Employee']).sum()

Here is an outcome that will be presented to you:

Applying functions with groupby

In this example, we will use this Python group by function to count how many employees are from the same city:

df.groupby('City').count()

In the following example, we add the values of identical records and present them in ascending order:

Example
df.groupby('Employee')['Hours'].sum().to_frame().reset_index().sort_values(by='Hours')

Here is the output:

You can also import matplotlib.pyplot to visualize your data in graphs. For instance, the following example visualizes the age and hours from the table. Add the following code to the first example before visualizing data:

import matplotlib.pyplot as plt

The following code will put data into a bar chart:

Example
plt.clf()
df.groupby('Employee').sum().plot(kind='bar')
plt.show()

The next example will display values of every group according to their ages:

df.groupby('Employee')['Age'].apply(lambda group_series: group_series.tolist()).reset_index()

The following example shows how to use the collections you create with Pandas groupby and count their average value. It keeps the individual values unchanged.

df.groupby(['Employee']).mean()

You can also find the number of even numbers in your groups. However, before you can complete this task with the Python group by function, you need to define the method for it to work:

Example
def count_even_numbers(series):
  return len([elem for elem in series if elem % 2 == 0 ])
df.groupby('Employee')['Age'].apply(count_even_numbers).reset_index(name='num_even_numbers')

Filtering

You can filter data according to the age of people as well. For instance, this code will only include people that are younger than 30:

Example
df_filtered = df.query('Age < 30')
print(df_filtered)

The filtering operation selects data groups by using true/false conditions. Here’s another example of a simple DataFrame that consists of employee, salary, and year variables:

Example
import pandas as pd
import numpy as np
df = pd.DataFrame( {     
"Employee" : ["Susan", "Kevin", "Charles", "David", "Ben"] ,           
"Salary" : [60000, 35000, 31000, 10000, 20000] ,
"Year" : [2019, 2019, 2019, 2019, 2019]} ) 
df

The following code will filter employees according to their salary:

Example
df_filtered = df.query('Salary > 30000')
print(df_filtered)

Here is the output after we apply the filtering function:

Combining results

You might need to group data, apply a specific function on these collections, and then place it with the original data. For this purpose, you can use transform(). In the following example, we are grouping employees according to their age, adding values together and including a column of the sum to the table:

Example
df['sum']=df.groupby(['Employee'])['Age'].transform('sum') 
df

Append
Array
Class
Command Line Arguments
Comment
Enumerate
Functions
Groupby
If... else
Map
Not Equal
Print
Queue
Random
Range
Round
Set
Sorting Lists
Split
Time
Train_test_split
Variables
While Loop