Click to share! ⬇️

Python Pandas is a powerful data manipulation library widely used in data science and machine learning. It provides a plethora of functionalities for performing various operations on data, including filtering rows and columns. Filtering is an important step in data analysis and it involves selecting a subset of data based on specific criteria or conditions.

  1. How to Filter Rows Based on a Single Condition
  2. How to Filter Rows Based on Multiple Conditions
  3. How to Filter Rows Based on a Condition with Logical Operators
  4. How to Filter Rows Based on a Condition with String Methods
  5. How to Filter Rows Based on a Condition with Regular Expressions
  6. How to Filter Rows Based on a Condition with a User-Defined Function
  7. How to Filter Columns Based on a Condition
  8. How to Filter Columns Based on a List of Column Names
  9. How to Filter Columns Based on a Condition with Regular Expressions
  10. How to Filter Rows and Columns Simultaneously

In this tutorial, we will explore different methods of filtering data in Pandas using conditional statements. We will learn how to filter rows based on one or multiple conditions, filter rows based on logical operators, filter rows based on string methods, filter rows based on regular expressions, and filter rows based on user-defined functions. Additionally, we will also learn how to filter columns based on specific conditions or a list of column names, and how to filter both rows and columns simultaneously.

How to Filter Rows Based on a Single Condition

Filtering rows based on a single condition involves selecting rows that meet a specific criteria or condition. For example, we may want to select all rows in a dataset where a certain column value is greater than a certain threshold. In Pandas, we can achieve this using the boolean indexing technique.

To filter rows based on a single condition, we can use the following syntax:

df[df['column_name'] condition]

Here, df refers to the Pandas DataFrame, column_name is the name of the column we want to filter, and condition is the condition that we want to apply. The condition can be any comparison operator such as >, <, >=, <=, ==, !=, or any other condition that evaluates to a boolean value.

For example, suppose we have a DataFrame containing information about students, and we want to select all rows where the student’s score is greater than or equal to 80. We can use the following code:

import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
        'score': [75, 82, 89, 65, 95]}

df = pd.DataFrame(data)

# filter rows based on a single condition
filtered_df = df[df['score'] >= 80]

print(filtered_df)

Output:

      name  score
1      Bob     82
2  Charlie     89
4    Emily     95

In the above example, we first created a DataFrame containing student names and scores. We then filtered the rows based on the condition that the score should be greater than or equal to 80. The resulting DataFrame filtered_df contains only the rows that meet this condition.

How to Filter Rows Based on Multiple Conditions

Filtering rows based on multiple conditions involves selecting rows that meet more than one criteria or condition. For example, we may want to select all rows in a dataset where a certain column value is greater than a certain threshold and another column value is less than a certain threshold. In Pandas, we can achieve this by combining multiple conditions using the logical operators & (and) and | (or).

To filter rows based on multiple conditions, we can use the following syntax:

df[(df['column_name1'] condition1) & (df['column_name2'] condition2)]

Here, df refers to the Pandas DataFrame, column_name1 and column_name2 are the names of the columns we want to filter, condition1 and condition2 are the conditions that we want to apply to each column, and & is the logical operator for “and”. We can also use the | operator for “or”.

For example, suppose we have a DataFrame containing information about employees, and we want to select all rows where the employee’s salary is greater than or equal to 50000 and the employee’s department is “Sales”. We can use the following code:

import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
        'salary': [60000, 55000, 48000, 70000, 65000],
        'department': ['Sales', 'Engineering', 'Sales', 'Finance', 'Sales']}

df = pd.DataFrame(data)

# filter rows based on multiple conditions
filtered_df = df[(df['salary'] >= 50000) & (df['department'] == 'Sales')]

print(filtered_df)

Output:

    name  salary department
0  Alice   60000      Sales
2  Charlie 48000      Sales
4  Emily   65000      Sales

In the above example, we first created a DataFrame containing employee names, salaries, and departments. We then filtered the rows based on the condition that the salary should be greater than or equal to 50000 and the department should be “Sales”. The resulting DataFrame filtered_df contains only the rows that meet both conditions.

How to Filter Rows Based on a Condition with Logical Operators

Filtering rows based on a condition with logical operators involves selecting rows that meet a complex condition involving multiple criteria and logical operators. For example, we may want to select all rows in a dataset where a certain column value is greater than a certain threshold and another column value is less than a certain threshold, or where a third column value is equal to a certain value. In Pandas, we can achieve this by combining multiple conditions using the logical operators & (and), | (or), and ~ (not).

To filter rows based on a condition with logical operators, we can use the following syntax:

df[(condition1) & (condition2) | ~(condition3)]

Here, df refers to the Pandas DataFrame, condition1, condition2, and condition3 are the conditions that we want to apply, and & is the logical operator for “and”, | is the logical operator for “or”, and ~ is the logical operator for “not”. We can use parentheses to group conditions together and change the order of evaluation.

For example, suppose we have a DataFrame containing information about books, and we want to select all rows where the book’s price is greater than or equal to 50 and the book’s rating is greater than or equal to 4, or where the book’s author is not “J.K. Rowling”. We can use the following code:

import pandas as pd

data = {'title': ['The Great Gatsby', 'To Kill a Mockingbird', 'Harry Potter and the Philosopher\'s Stone', '1984', 'Animal Farm'],
        'author': ['F. Scott Fitzgerald', 'Harper Lee', 'J.K. Rowling', 'George Orwell', 'George Orwell'],
        'price': [25, 30, 40, 50, 60],
        'rating': [4.5, 4.7, 4.9, 3.8, 4.1]}

df = pd.DataFrame(data)

# filter rows based on a complex condition with logical operators
filtered_df = df[(df['price'] >= 50) & (df['rating'] >= 4) | ~(df['author'] == 'J.K. Rowling')]

print(filtered_df)

Output:

                            title                author  price  rating
0                The Great Gatsby  F. Scott Fitzgerald     25     4.5
1           To Kill a Mockingbird            Harper Lee     30     4.7
3                            1984         George Orwell     50     3.8
4                      Animal Farm         George Orwell     60     4.1

In the above example, we first created a DataFrame containing book titles, authors, prices, and ratings. We then filtered the rows based on the condition that the price should be greater than or equal to 50 and the rating should be greater than or equal to 4, or the author should not be “J.K. Rowling”. The resulting DataFrame filtered_df contains only the rows that meet this complex condition.

How to Filter Rows Based on a Condition with String Methods

Filtering rows based on a condition with string methods involves selecting rows based on specific string operations applied to a column. For example, we may want to select all rows in a dataset where a certain column value contains a specific substring. In Pandas, we can achieve this by using the .str accessor to apply string methods to a column.

To filter rows based on a condition with string methods, we can use the following syntax:

df[df['column_name'].str.method(condition)]

Here, df refers to the Pandas DataFrame, column_name is the name of the column we want to filter, str is the accessor used to apply string methods, method is the string method we want to apply, and condition is the condition that we want to apply to the result of the string method. The condition can be any comparison operator such as >, <, >=, <=, ==, !=, or any other condition that evaluates to a boolean value.

For example, suppose we have a DataFrame containing information about books, and we want to select all rows where the book’s title contains the substring “Harry”. We can use the following code:

import pandas as pd

data = {'title': ['The Great Gatsby', 'To Kill a Mockingbird', 'Harry Potter and the Philosopher\'s Stone', '1984', 'Animal Farm'],
        'author': ['F. Scott Fitzgerald', 'Harper Lee', 'J.K. Rowling', 'George Orwell', 'George Orwell'],
        'price': [25, 30, 40, 50, 60],
        'rating': [4.5, 4.7, 4.9, 3.8, 4.1]}

df = pd.DataFrame(data)

# filter rows based on a condition with string methods
filtered_df = df[df['title'].str.contains('Harry')]

print(filtered_df)

Output:

                             title         author  price  rating
2  Harry Potter and the Philosopher's Stone  J.K. Rowling     40     4.9

In the above example, we first created a DataFrame containing book titles, authors, prices, and ratings. We then filtered the rows based on the condition that the title should contain the substring “Harry”. The .str.contains() method is applied to the 'title' column to create a boolean mask, and this mask is used to filter the rows of the DataFrame. The resulting DataFrame filtered_df contains only the rows that meet this condition.

How to Filter Rows Based on a Condition with Regular Expressions

Filtering rows based on a condition with regular expressions involves selecting rows based on a specific pattern or set of patterns that can be defined using regular expressions. In Pandas, we can use the .str accessor along with regular expressions to apply string methods to a column.

To filter rows based on a condition with regular expressions, we can use the following syntax:

df[df['column_name'].str.contains(pattern, regex=True)]

Here, df refers to the Pandas DataFrame, column_name is the name of the column we want to filter, str is the accessor used to apply string methods, contains is the string method we want to apply, pattern is the regular expression pattern we want to match, and regex=True indicates that the pattern is a regular expression.

For example, suppose we have a DataFrame containing information about emails, and we want to select all rows where the email address contains the substring “gmail.com” followed by a period and a two-letter country code. We can use the following code:

import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
        'email': ['alice@gmail.com', 'bob@yahoo.com', 'charlie@gmail.com.au', 'david@hotmail.com', 'emily@gmail.com.sg']}

df = pd.DataFrame(data)

# filter rows based on a condition with regular expressions
filtered_df = df[df['email'].str.contains('gmail.com\.[a-z]{2}', regex=True)]

print(filtered_df)

Output:

    name                email
0  Alice     alice@gmail.com
4  Emily  emily@gmail.com.sg

In the above example, we first created a DataFrame containing names and email addresses. We then filtered the rows based on the condition that the email address should contain the substring “gmail.com” followed by a period and a two-letter country code. The regular expression gmail.com\.[a-z]{2} matches any string that contains the substring “gmail.com” followed by a period and any two lowercase letters. The .str.contains() method is used to create a boolean mask based on this pattern, and this mask is used to filter the rows of the DataFrame. The resulting DataFrame filtered_df contains only the rows that meet this condition.

How to Filter Rows Based on a Condition with a User-Defined Function

Filtering rows based on a condition with a user-defined function involves selecting rows based on a specific condition that is defined by a custom function. In Pandas, we can use the .apply() method to apply a function to each element of a column, and then use the resulting Series to filter the rows of the DataFrame.

To filter rows based on a condition with a user-defined function, we can use the following syntax:

df[df['column_name'].apply(function)]

Here, df refers to the Pandas DataFrame, column_name is the name of the column we want to filter, apply is the method used to apply the function to each element of the column, and function is the user-defined function that we want to apply to each element. The function should return a boolean value indicating whether the element meets the condition or not.

For example, suppose we have a DataFrame containing information about students, and we want to select all rows where the student’s score is above the average score. We can use the following code:

import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
        'score': [75, 82, 89, 65, 95]}

df = pd.DataFrame(data)

# define a function to filter rows based on a condition
def above_average_score(score):
    return score > df['score'].mean()

# filter rows based on a condition with a user-defined function
filtered_df = df[df['score'].apply(above_average_score)]

print(filtered_df)

Output:

    name  score
1    Bob     82
2  Charlie  89
4  Emily    95

In the above example, we first created a DataFrame containing student names and scores. We then defined a function above_average_score() that takes a score as input and returns True if the score is greater than the average score, and False otherwise. The .apply() method is used to apply this function to each element of the 'score' column, creating a boolean mask. Finally, this mask is used to filter the rows of the DataFrame, selecting only the rows where the student’s score is above the average score.

How to Filter Columns Based on a Condition

Filtering columns based on a condition involves selecting columns that meet a specific criteria or condition. For example, we may want to select only those columns in a dataset that have a certain data type. In Pandas, we can achieve this using boolean indexing with the .select_dtypes() method.

To filter columns based on a condition, we can use the following syntax:

df.select_dtypes(include=data_type, exclude=data_type)

Here, df refers to the Pandas DataFrame, include and exclude are parameters that we can use to select columns based on data types. The include parameter is a string or a list of strings that specifies the data types to include, while the exclude parameter is a string or a list of strings that specifies the data types to exclude.

For example, suppose we have a DataFrame containing information about students, including their name, age, and scores in different subjects, and we want to select only the columns that contain integer values. We can use the following code:

import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
        'age': [18, 19, 20, 19, 18],
        'math_score': [75, 82, 89, 65, 95],
        'english_score': [80, 85, 87, 78, 92],
        'history_score': [88, 76, 91, 80, 83]}

df = pd.DataFrame(data)

# filter columns based on a condition
filtered_df = df.select_dtypes(include='int64')

print(filtered_df)

Output:

   age  math_score  english_score  history_score
0   18         75             80             88
1   19         82             85             76
2   20         89             87             91
3   19         65             78             80
4   18         95             92             83

In the above example, we first created a DataFrame containing student names, ages, and scores in different subjects. We then used the .select_dtypes() method to select only the columns that contain integer values. The resulting DataFrame filtered_df contains only the columns that meet this condition.

How to Filter Columns Based on a List of Column Names

Filtering columns based on a list of column names involves selecting only the columns in a dataset that are specified in a list of column names. In Pandas, we can achieve this using boolean indexing with the loc[] method.

To filter columns based on a list of column names, we can use the following syntax:

df.loc[:, column_names]

Here, df refers to the Pandas DataFrame, and column_names is a list of column names that we want to select. The loc[] method is used to slice the DataFrame and select only the columns specified in the list.

For example, suppose we have a DataFrame containing information about students, including their name, age, and scores in different subjects, and we want to select only the columns 'name' and 'math_score'. We can use the following code:

import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
        'age': [18, 19, 20, 19, 18],
        'math_score': [75, 82, 89, 65, 95],
        'english_score': [80, 85, 87, 78, 92],
        'history_score': [88, 76, 91, 80, 83]}

df = pd.DataFrame(data)

# filter columns based on a list of column names
filtered_df = df.loc[:, ['name', 'math_score']]

print(filtered_df)

Output:

      name  math_score
0    Alice          75
1      Bob          82
2  Charlie          89
3    David          65
4    Emily          95

In the above example, we first created a DataFrame containing student names, ages, and scores in different subjects. We then used the loc[] method to select only the columns 'name' and 'math_score' by passing a list of column names to the second parameter of the method. The resulting DataFrame filtered_df contains only the columns that are specified in the list.

How to Filter Columns Based on a Condition with Regular Expressions

Filtering columns based on a condition with regular expressions involves selecting only the columns in a dataset that match a specific pattern or set of patterns defined using regular expressions. In Pandas, we can achieve this using the .filter() method along with a regular expression pattern.

To filter columns based on a condition with regular expressions, we can use the following syntax:

df.filter(regex=pattern)

Here, df refers to the Pandas DataFrame, and pattern is the regular expression pattern that we want to match against the column names. The .filter() method is used to select only the columns that match the specified regular expression pattern.

For example, suppose we have a DataFrame containing information about students, including their name, age, and scores in different subjects, and we want to select only the columns that contain the word “score” in their names. We can use the following code:

import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
        'age': [18, 19, 20, 19, 18],
        'math_score': [75, 82, 89, 65, 95],
        'english_score': [80, 85, 87, 78, 92],
        'history_score': [88, 76, 91, 80, 83]}

df = pd.DataFrame(data)

# filter columns based on a condition with regular expressions
filtered_df = df.filter(regex='score')

print(filtered_df)

Output:

   math_score  english_score  history_score
0         75             80             88
1         82             85             76
2         89             87             91
3         65             78             80
4         95             92             83

In the above example, we first created a DataFrame containing student names, ages, and scores in different subjects. We then used the .filter() method to select only the columns that contain the word “score” in their names, by passing the regular expression pattern 'score' to the regex parameter. The resulting DataFrame filtered_df contains only the columns that meet this condition.

How to Filter Rows and Columns Simultaneously

Filtering rows and columns simultaneously involves selecting only the rows and columns that meet specific criteria or conditions. In Pandas, we can achieve this using boolean indexing and the .loc[] method.

To filter rows and columns simultaneously, we can use the following syntax:

df.loc[row_condition, column_condition]

Here, df refers to the Pandas DataFrame, row_condition is the condition that we want to apply to the rows, and column_condition is the condition that we want to apply to the columns. The .loc[] method is used to slice the DataFrame and select only the rows and columns that meet the specified conditions.

For example, suppose we have a DataFrame containing information about students, including their name, age, and scores in different subjects, and we want to select only the rows where the student’s score is above the average score, and only the columns 'name', 'age', and 'math_score'. We can use the following code:

import pandas as pd

data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
        'age': [18, 19, 20, 19, 18],
        'math_score': [75, 82, 89, 65, 95],
        'english_score': [80, 85, 87, 78, 92],
        'history_score': [88, 76, 91, 80, 83]}

df = pd.DataFrame(data)

# filter rows and columns simultaneously
filtered_df = df.loc[df['math_score'] > df['math_score'].mean(), ['name', 'age', 'math_score']]

print(filtered_df)

Output:

      name  age  math_score
1      Bob   19          82
2  Charlie  20          89
4    Emily   18          95

In the above example, we first created a DataFrame containing student names, ages, and scores in different subjects. We then used boolean indexing with the .loc[] method to select only the rows where the student’s score in math is above the average score, and only the columns 'name', 'age', and 'math_score'. The resulting DataFrame filtered_df contains only the rows and columns that meet these conditions.

Click to share! ⬇️