PythonForBeginners.com

Count Rows With Null Values in PySpark

Aditya Raj — Mon, 24 Jul 2023 13:00:00 +0000

Missing values in tabular data are a common problem. When we load tabular data with missing values into a pyspark dataframe, the empty values are replaced with null values. In this article, we will discuss how to count rows with null values in a given pyspark dataframe.

Table of Contents

Count Rows With Null Values in a Column in PySpark DataFrame
Get the Number of Rows With Not Null Values in a Column
Count Rows with Null Values in Multiple columns in a DataFrame
Get The Number of Rows with Not Null Values in Multiple Columns
Conclusion

Count Rows With Null Values in a Column in PySpark DataFrame

To count rows with null values in a column in a pyspark dataframe, we can use the following approaches.

Using filter() method and the isNull() method with count() method
By using the where() method and the isNull() method with count() method
By Using sql IS NULL statement with COUNT() function.

Let us discuss all these approaches one by one.

Count Rows With Null Values Using The filter() Method

To count rows with null values in a particular column in a pyspark dataframe, we will first invoke the isNull() method on the given column. The isNull() method will return a masked column having True and False values. We will pass the mask column object returned by the isNull() method to the filter() method. After this, we will get the rows with null values in the given column.

Once we get the dataframe with rows having null values in the specified column, we will use the count() method to get the count of rows with null values.

You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter(dfs.Physics.isNull())
print("Rows at which there are null values in Physics Column:")
dfs.show()
print("Count of the rows with null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|    Sam|   null|       62|  75|     95|  63|
|    Tom|   null|       94|  61|     65|  69|
+-------+-------+---------+----+-------+----+

Count of the rows with null values in Physics column is:
2

In the above example, we first read a csv file into a pyspark dataframe. Then, we used the isNull() method and the filter() method to filter rows with Null values in the Physics column. In the output, you can observe that the final dataframe returned by the filter() method contains only those rows in which the Physics column is set to Null. Hence, the count() method returns the value 2.

Rows With Null Values using the where() Method

Instead of the filter() method, you can also use the where() method to count rows with null values in a given column. The where() method also takes the mask returned by the isNull() method and returns a dataframe containing the rows where the mask column contains True.

We can invoke the count() method on the dataframe returned by the where() method to get the number of rows with null values as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(dfs.Physics.isNull())
print("Rows at which there are null values in Physics Column:")
dfs.show()
print("Count of the rows with null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|    Sam|   null|       62|  75|     95|  63|
|    Tom|   null|       94|  61|     65|  69|
+-------+-------+---------+----+-------+----+

Count of the rows with null values in Physics column is:
2

In this example, we have used the where() method instead of the filter() method. However, you can observe that the output remains the same.

Instead of the syntax used in the above examples, you can use the col() function with the isNull() method to create the mask containing True and False values. The col() function is defined in the pyspark.sql.functions module. It takes a column name as an input argument and returns the column object containing all the values in the given column in a pyspark dataframe.

We can invoke the isNull() method on the column object to select rows with null values from the pyspark dataframe. Then, we can use the count() method to get the count of rows in the dataframe as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(col("Physics").isNull())
print("Rows at which there are null values in Physics Column:")
dfs.show()
print("Count of the rows with null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|    Sam|   null|       62|  75|     95|  63|
|    Tom|   null|       94|  61|     65|  69|
+-------+-------+---------+----+-------+----+

Count of the rows with null values in Physics column is:
2

In this example, we have used the col() function to access the Physics column of the dataframe instead of using the name of the dataframe. You can observe the number of rows with null values in the Physics column still remains at 2.

Get Number of Rows With Null Values Using SQL syntax

We can also use spark SQL to get the number of rows with null values from a pyspark dataframe. For this, we will first create a view of the input dataframe using the createOrReplaceTempView() method. The createOrReplaceTempView(), when invoked on a pyspark dataframe, takes the name of the view object as its input argument and creates a view of the dataframe.

After getting the view of the dataframe, we can use the sql SELECT statement with IS NULL clause and the COUNT(*) function to count rows with null values in a given column in the pyspark dataframe.

For this, we can execute the SQL query using the sql() function as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT COUNT(*) FROM df_sql WHERE Physics IS NULL")
print("Count of the rows with null values in Physics column is:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Count of the rows with null values in Physics column is:
+--------+
|count(1)|
+--------+
|       2|
+--------+

We first created a view of the input dataframe in the above example. This is because we cannot execute SQL statements on a pyspark dataframe. Then, we used the sql() function to execute the SQL query for retrieving the number of rows in which the Physics column is Null. The sql() function takes the SQL query as its input and returns the output dataframe containing the count.

Instead of creating the view of the dataframe and executing the sql() function, you can directly pass the statement in the WHERE clause in the sql statement to the filter() method. Then, you can invoke the count() method on the dataframe returned by the filter() method to get the number of rows in which the given column has null values as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NULL")
print("Rows at which there are null values in Physics Column:")
dfs.show()
print("Count of the rows with null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|    Sam|   null|       62|  75|     95|  63|
|    Tom|   null|       94|  61|     65|  69|
+-------+-------+---------+----+-------+----+

Count of the rows with null values in Physics column is:
2

In this example, we have passed the "Physics IS NULL" string to the filter() method. Hence, the filter() method treats the string as a statement of the WHERE clause of the SQL statement and returns the output dataframe in which the Physics column contains only null values. Then, we get the count of the output dataframe using the count() method and print it.

Get the Number of Rows With Not Null Values in a Column

We can use the approaches used for counting rows with null values to count rows with not null values. The only difference is that we need to use the isNotNull() method in place of the isNull() method.

Count Rows With Not Null Values Using The filter() Method

To get the number of rows with no null values in a particular column in a pyspark dataframe, we will first invoke the isNotNull() method on the given column. The isNotNull() method will return a masked column containing True and False values. Next, we will pass the mask column object returned by the isNotNull() method to the filter() method. After this, we will get the dataframe with rows having no null values in the given column. Finally, we will use the count() method to get the number of rows in the dataframe returned by the filter() method.

You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter(dfs.Physics.isNotNull())
print("Rows at which there are not null values in Physics Column:")
dfs.show()
print("Count of the rows with not null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Count of the rows with not null values in Physics column is:
8

In this example, we used the isNotNull() method with the filter() method and the count() method to get the number of rows with not null values from the pyspark dataframe.

Get the Number Of Rows With Not Null Values Using the where() Method

Instead of the filter() method, you can also use the where() method to count rows with not null values in a given column. The where() method also takes the mask returned by the isNotNull() method and returns a dataframe containing the rows where the mask column contains True. We can use the count() method on the dataframe returned by the where() method to get the number of rows in which the specified column is not null.

You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(dfs.Physics.isNotNull())
print("Rows at which there are not null values in Physics Column:")
dfs.show()
print("Count of the rows with not null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Count of the rows with not null values in Physics column is:
8

In this example, we used the where() method instead of the filter() method. However, the output remains the same.

Instead of the syntax used in the above examples, you can also use the col() function with the isNotNull() method to count rows with not null values in a pyspark dataframe as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(col("Physics").isNotNull())
print("Rows at which there are not null values in Physics Column:")
dfs.show()
print("Count of the rows with not null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Count of the rows with not null values in Physics column is:
8

Get the Number of Rows With Not Null Values Using the dropna() Method

We can also use the dropna() method to get the number of rows with not null values in a single column from a pyspark dataframe. For this, we can invoke the dropna() method on the pyspark dataframe and pass the column name as input to the subset parameter. After execution of the dropna() method, we will get a dataframe with rows having no null values in the specified column. Then, we can use the count() method to get the number of rows having no null values in the dataframe in the given column.

You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.dropna(subset="Physics")
print("Rows at which there are not null values in Physics Column:")
dfs.show()
print("Count of the rows with not null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Count of the rows with not null values in Physics column is:
8

In this example, we invoked the dropna() method on the input pyspark dataframe. Here, we have passed the column name Physics as input to the subset parameter in the dropna() method. Due to this, all the rows in which the Physics column contains null values are dropped from the dataframe. Thus, we get the required rows with no null values in the Physics column in the dataframe returned by the dropna() method. Finally, we use the count() method to get the number of rows in the dataframe returned by the dropna() method.

Count Rows With Not Null Values using SQL in a PySpark DataFrame

We can also use pyspark SQL syntax to get the number of rows with no null values in a column from a pyspark dataframe. For this, we will use the following steps.

First, we will create a view of the input dataframe using the createOrReplaceTempView() method. The createOrReplaceTempView(), when invoked on a pyspark dataframe, takes the name of the view object as its input argument and creates a view of the dataframe.
After getting the view of the dataframe, we can use the sql SELECT statement with IS NOT NULL clause and the COUNT(*) function to get the count of rows with not null values in the given column in the pyspark dataframe. For this, we can execute the SQL query using the sql() function.

You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT COUNT(*) FROM df_sql WHERE Physics IS NOT NULL")
print("Count of the rows with not null values in Physics column is:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Count of the rows with not null values in Physics column is:
+--------+
|count(1)|
+--------+
|       8|
+--------+

In this example, we have used the sql() function to execute the sql statement for obtaining the number of rows with not null values in the Physics column.

Instead of using the above approach, you can directly pass the statement in the WHERE clause of the sql statement to the filter() method as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NOT NULL")
print("Rows at which there are not null values in Physics Column:")
dfs.show()
print("Count of the rows with not null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Count of the rows with not null values in Physics column is:
8

In this example, we passed the string "Physics IS NOT NULL" to the filter() method instead of executing the entire SQL query. However, the output remains the same.

Count Rows with Null Values in Multiple columns in a DataFrame

To count rows with Null values in Multiple columns, we can use the conditional operators along with the isNull() method inside the filter() and where() method. Then, we can use the count() method on the dataframes returned by the filter() method or the where() method to get the number of rows with null values. For this, we will use the following steps.

We will first invoke the isNull() method on all the required columns.
Next, we will enclose the masked columns returned by the isNull() method inside parentheses and combine them with conditional operators like AND (&) and OR (|) to create a conditional statement.
After this, we will pass the conditional statement to the filter() method. The filter() method will return a dataframe containing null values in the specified columns.
Finally, we will use the count() method to get the number of rows in the dataframe returned by the filter() method.

After executing the above statements, we can count rows with null values in multiple columns in a pyspark dataframe. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter((dfs.Physics.isNull())& (dfs.Chemistry.isNull()))
print("Rows with null values in Physics and Chemistry Columns are:")
dfs.show()
print("Count of the rows with null values in Physics and Chemistry column is:")
print(dfs.count())
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
+-------+-------+---------+----+-------+----+

Count of the rows with null values in Physics and Chemistry column is:
0

In this example, we have passed the conditions (dfs.Physics.isNull()) and (dfs.Chemistry.isNull()) to the filter() method to filter rows with not null values in these columns. Then, we have used the count() method. Observe that the output of the isNull() method is enclosed inside parentheses for the & operator to execute successfully. If we don’t use parentheses here, the program will run into error.

In the output, you can observe that we get an empty dataframe because there are no rows where both the Physics and Chemistry column is null. Hence, the count is 0.

In a similar manner, you can use the OR conditional operator (|) to count rows in which the Physics column or Chemistry columns contain null values as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter((dfs.Physics.isNull())|(dfs.Chemistry.isNull()))
print("Rows with null values in Physics or Chemistry Columns are:")
dfs.show()
print("Count of the rows with null values in Physics or Chemistry column is:")
print(dfs.count())
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|   Golu|     74|     null|  96|     76|  64|
|    Tom|   null|       94|  61|     65|  69|
|  Clara|     93|     null|  78|   null|  71|
+-------+-------+---------+----+-------+----+

Count of the rows with null values in Physics or Chemistry column is:
5

In the above example, we have used the filter() method with the conditions to filter rows with null values in Physics or Chemistry column from the pyspark dataframe. Then, we used the count() method to get the number of rows in the dataframe returned by the filter() method.

Instead of the filter() method, you can also use the where() method in the previous example as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where((dfs.Physics.isNull())|(dfs.Chemistry.isNull()))
print("Rows with null values in Physics or Chemistry Columns are:")
dfs.show()
print("Count of the rows with null values in Physics or Chemistry column is:")
print(dfs.count())
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|   Golu|     74|     null|  96|     76|  64|
|    Tom|   null|       94|  61|     65|  69|
|  Clara|     93|     null|  78|   null|  71|
+-------+-------+---------+----+-------+----+

Count of the rows with null values in Physics or Chemistry column is:
5

We can also use multiple conditional statements inside SQL syntax to get the number of rows with null values in multiple columns in a pyspark dataframe. For this, you can use the IS NULL clause with conditional operators in the WHERE clause of the SQL statement as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT COUNT(*) FROM df_sql WHERE Physics IS NULL or Chemistry IS NULL")
print("Count of the rows with null values in Physics or Chemistry column is:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Count of the rows with null values in Physics or Chemistry column is:
+--------+
|count(1)|
+--------+
|       5|
+--------+

Instead of using the sql statement, we can also pass the conditional statement in the WHERE clause to the filter() method as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NULL OR Chemistry IS NULL")
print("Rows with null values in Physics or Chemistry Columns are:")
dfs.show()
print("Count of the rows with null values in Physics or Chemistry column is:")
print(dfs.count())
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|   Golu|     74|     null|  96|     76|  64|
|    Tom|   null|       94|  61|     65|  69|
|  Clara|     93|     null|  78|   null|  71|
+-------+-------+---------+----+-------+----+

Count of the rows with null values in Physics or Chemistry column is:
5

Get The Number of Rows with Not Null Values in Multiple Columns

To count rows with no Null values in Multiple columns, we can use the conditional operators along with the isNotNull() method inside the filter() and where() method. For this, we will use the following steps.

We will first invoke the isNotNull() method on all the required columns.
Next, we will enclose the masked columns returned by the isNotNull() method inside parentheses and combine them with conditional operators like AND (&) and OR (|) to create a conditional statement.
After this, we will pass the conditional statement to the filter() method. The filter() method will return a dataframe not null values in the specified columns.
Finally, we will use the count() method to get the count of rows with not null values.

After executing the above statements, we can count rows with not null values in multiple columns in a pyspark dataframe. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter((dfs.Physics.isNotNull())& (dfs.Chemistry.isNotNull()))
print("Rows with not null values in Physics and Chemistry Columns are:")
dfs.show()
print("Count of the rows with not null values in Physics or Chemistry column is:")
print(dfs.count())
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with not null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Harry|     68|       92|  69|     66|  98|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Count of the rows with not null values in Physics or Chemistry column is:
5

Instead of the filter() method, you can also use the where() method in the previous example as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where((dfs.Physics.isNotNull())& (dfs.Chemistry.isNotNull()))
print("Rows with not null values in Physics and Chemistry Columns are:")
dfs.show()
print("Count of the rows with not null values in Physics or Chemistry column is:")
print(dfs.count())
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with not null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Harry|     68|       92|  69|     66|  98|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Count of the rows with not null values in Physics or Chemistry column is:
5

We can also use multiple conditional statements inside the SQL WHERE clause with the COUNT() function to get the number of rows with not null values in multiple columns in a pyspark dataframe as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT COUNT(*) FROM df_sql WHERE Physics IS NOT NULL AND Chemistry IS NOT NULL")
print("Count of the rows with not null values in Physics or Chemistry column is:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Count of the rows with not null values in Physics or Chemistry column is:
+--------+
|count(1)|
+--------+
|       5|
+--------+

Instead of using the sql statement, we can also pass the conditional statement in the WHERE clause to the filter() method as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("count_null_example") \
      .getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NOT NULL AND Chemistry IS NOT NULL")
print("Rows with not null values in Physics and Chemistry Columns are:")
dfs.show()
print("Count of the rows with not null values in Physics or Chemistry column is:")
print(dfs.count())
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with not null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Harry|     68|       92|  69|     66|  98|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Count of the rows with not null values in Physics or Chemistry column is:
5

Conclusion

In this article, we discussed different ways to count the number of rows with null values in a pyspark dataframe. To learn more about pyspark dataframes, you can read this article on how to sort a pyspark dataframe. You might also like this article on how to select distinct rows from a pyspark dataframe.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post Count Rows With Null Values in PySpark appeared first on PythonForBeginners.com.

PySpark OrderBy One or Multiple Columns

Aditya Raj — Fri, 21 Jul 2023 13:00:00 +0000

While working with pyspark dataframes, we often need to order the rows according to one or multiple columns. In this article, we will discuss different ways to orderby a pyspark dataframe using the orderBy() method.

Table of Contents

The pyspark orderBy() Method
Orderby PySpark DataFrame By Column Name
Pyspark Orderby DataFrame in Descending Order
Order PySpark DataFrame by Multiple Columns
Orderby PySpark DataFrame Nulls First
Orderby PySpark DataFrame Nulls Last
Conclusion

The pyspark orderBy() Method

The orderBy() method in pyspark is used to order the rows of a dataframe by one or multiple columns. It has the following syntax.

df.orderBy(*column_names, ascending=True)

Here,

The parameter *column_names represents one or multiple columns by which we need to order the pyspark dataframe.
The ascending parameter specifies if we want to order the dataframe in ascending or descending order by given column names. If there are multiple columns by which you want to sort the dataframe, you can also pass a list of True and False values to specify the columns by which the dataframe is ordered in ascending or descending order.

Orderby PySpark DataFrame By Column Name

To orderby a pyspark dataframe by a given column name, we can use the orderBy() method as shown in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("orderby_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy('Physics')
print("The dataframe ordered by Physics column is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The dataframe ordered by Physics column is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|Katrina|   49|     47|       83|
|    Sam|   99|     62|       95|
|   Joel|   45|     75|       87|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|  Chris| null|     85|       82|
| Aditya|   45|     89|       71|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

In this example, we first created a SparkSession on our local machine. Then, we read a csv to create a pyspark dataframe. Next, we used the orderBy() method to order the dataframe using the ‘Physics’ column. In the output dataframe, you can observe that the rows are ordered in ascending order by the Physics column.

Instead of the above approach, you can also use the col() function to orderby the pyspark dataframe. The col() function is defined in the pyspark.sql.functions module. It takes a column name as its input argument and returns a column object. We can pass the column object to the orderBy() method to get the pyspark dataframe ordered by a given column.

You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("orderby_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Physics'))
print("The dataframe ordered by Physics column is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The dataframe ordered by Physics column is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|Katrina|   49|     47|       83|
|    Sam|   99|     62|       95|
|   Joel|   45|     75|       87|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|  Chris| null|     85|       82|
| Aditya|   45|     89|       71|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

In this example, we used the col() function inside the orderBy() method to order the pyspark dataframe by the Physics column.

Pyspark Orderby DataFrame in Descending Order

To order a pyspark dataframe by a column in descending order, you can set the ascending parameter to False in the orderBy() method as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("orderby_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy('Physics',ascending=False)
print("The dataframe ordered by Physics column in descending order is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The dataframe ordered by Physics column in descending order is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
| Aditya|   65|     89|       71|
|  Chris| null|     85|       82|
| Agatha|   77|     76|       93|
|   Joel|   45|     75|       87|
|   Joel|   45|     75|       87|
|    Sam|   99|     62|       95|
|Katrina|   49|     47|       83|
+-------+-----+-------+---------+

In the above example, we have set the ascending parameter to True in the orderBy() method. Hence, the output dataframe is ordered by the Physics column in descending order.

If you are using the col() function to set the pyspark dataframe in order, you can use the desc() method on the column of the pyspark.

When we invoke the desc() method on the column obtained using the col() function, the orderBy() method sorts the pyspark dataframe in descending order. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("orderby_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Physics').desc())
print("The dataframe ordered by Physics column in descending order is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The dataframe ordered by Physics column in descending order is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
| Aditya|   65|     89|       71|
|  Chris| null|     85|       82|
| Agatha|   77|     76|       93|
|   Joel|   45|     75|       87|
|   Joel|   45|     75|       87|
|    Sam|   99|     62|       95|
|Katrina|   49|     47|       83|
+-------+-----+-------+---------+

Order PySpark DataFrame by Multiple Columns

To orderby a pyspark dataframe by multiple columns, you can pass all the column names to the orderBy() method as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("orderby_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy('Maths','Physics')
print("The dataframe ordered by Maths and Physics column is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The dataframe ordered by Maths and Physics column is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|   Joel|   45|     75|       87|
| Aditya|   45|     89|       71|
|Katrina|   49|     47|       83|
| Aditya|   65|     89|       71|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
+-------+-----+-------+---------+

In the above example, we passed the column names 'Maths' and 'Physics' to the orderBy() method. Hence, the output dataframe is first sorted by the Maths column. For the rows in which the Maths column has the same value, the order is decided using the Physics column.

By default, the orderBy() method sets the pyspark dataframe in ascending order by all the columns. To sort the dataframe in descending order by all the columns using the orderBy() method, you can set the ascending parameter to False as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("orderby_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy('Maths','Physics',ascending=False)
print("The dataframe ordered by Maths and Physics column in descending order is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The dataframe ordered by Maths and Physics column in descending order is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|    Sam|   99|     62|       95|
| Agatha|   77|     76|       93|
| Aditya|   65|     89|       71|
|Katrina|   49|     47|       83|
| Aditya|   45|     89|       71|
|   Joel|   45|     75|       87|
|   Joel|   45|     75|       87|
|  Chris| null|     85|       82|
+-------+-----+-------+---------+

If you want to change the sorting order for each column, you can pass a list of True and False values to the ascending parameter in the orderBy() method. Here, the number of boolean values should be equal to the number of column names passed to the orderBy() method. Each value in the list corresponds to a single column at the same position in the parameter list.

If we want to orderby the pyspark dataframe in ascending order by the ith column name passed to the orderBy() method, the ith element in the list passed to the ascending parameter should be True. Similarly, if we want to orderby the pyspark dataframe in ascending order by the jth column name passed to the orderBy() method, the jth element in the list should be False.

You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("orderby_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy('Maths','Physics',ascending=[True, False])
print("The dataframe ordered by Maths and Physics column is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The dataframe ordered by Maths and Physics column is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|  Chris| null|     85|       82|
| Aditya|   45|     89|       71|
|   Joel|   45|     75|       87|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
| Aditya|   65|     89|       71|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
+-------+-----+-------+---------+

In the above example, we passed the list [True, False] to the ascending parameter in the orderBy() method. Hence, the output dataframe is sorted by the Maths column in ascending order. For the rows in which the Maths column has the same value, the rows are sorted in descending order by the Physics column.

Suppose you are using the col() function to orederby the pyspark dataframe. In that case, you can use the asc() method and desc() method on each column to sort the dataframe by the column in ascending order or descending order respectively.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("orderby_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Maths').asc(),col('Physics').desc())
print("The dataframe ordered by Maths and Physics column is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The dataframe ordered by Maths and Physics column is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|  Chris| null|     85|       82|
| Aditya|   45|     89|       71|
|   Joel|   45|     75|       87|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
| Aditya|   65|     89|       71|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
+-------+-----+-------+---------+

In this example, we have invoked the asc() method on the Maths column and desc() method on the Physics column. Hence, the output dataframe is sorted by the Maths column in ascending order. For the rows in which the Maths column has the same value, the rows are sorted in descending order by the Physics column.

Orderby PySpark DataFrame Nulls First

If there are null values present in a row in the column by which we want to orderby a pyspark dataframe, the row is placed at the top of the ordered dataframe by default.

You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("orderby_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Maths'))
print("The ordered dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam| null|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The ordered dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|  Chris| null|     85|       82|
|    Sam| null|     62|       95|
|   Joel|   45|     75|       87|
|   Joel|   45|     75|       87|
| Aditya|   45|     89|       71|
|Katrina|   49|     47|       83|
| Aditya|   65|     89|       71|
| Agatha|   77|     76|       93|
+-------+-----+-------+---------+

In this example, the input dataframe contains two rows with null values in the Maths column. Hence, when we sort the dataframe by the Maths column in ascending order, the rows with null values are kept at the top of the output dataframe by default.

When we sort a pyspark dataframe by a column with null values in descending order, the rows with null values are placed at the bottom. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("orderby_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Maths').desc())
print("The ordered dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam| null|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The ordered dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Agatha|   77|     76|       93|
| Aditya|   65|     89|       71|
|Katrina|   49|     47|       83|
| Aditya|   45|     89|       71|
|   Joel|   45|     75|       87|
|   Joel|   45|     75|       87|
|  Chris| null|     85|       82|
|    Sam| null|     62|       95|
+-------+-----+-------+---------+

To put the rows containing the null values in the first place in the ordered dataframe, we can use the desc_nulls_first() method on the columns given in the orderBy() method. After this, the data frame will be ordered in descending order with rows containing the null values at the top of the dataframe. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("orderby_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Maths').desc_nulls_first())
print("The ordered dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam| null|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The ordered dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|  Chris| null|     85|       82|
|    Sam| null|     62|       95|
| Agatha|   77|     76|       93|
| Aditya|   65|     89|       71|
|Katrina|   49|     47|       83|
| Aditya|   45|     89|       71|
|   Joel|   45|     75|       87|
|   Joel|   45|     75|       87|
+-------+-----+-------+---------+

Here, we have used the desc_nulls_first() method on the Maths column. Hence, even if the dataframe is sorted in descending order, the rows with null values are kept at the top of the output dataframe.

If you want to sort the pyspark dataframe in ascending order and put the rows containing nulls at the top of the dataframe, you can use the asc_nulls_first() method in the orderBy() method. However, Using the asc_nulls_first() method is redundant as the rows with null values are put at the top of the ordered dataframe by default when we sort them in ascending order.

Orderby PySpark DataFrame Nulls Last

If there are null values present in a row in the column by which we want to orderby a pyspark dataframe, the row is placed at the last of the ordered dataframe by default when we order it in descending order. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("orderby_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Maths').desc())
print("The ordered dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam| null|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The ordered dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Agatha|   77|     76|       93|
| Aditya|   65|     89|       71|
|Katrina|   49|     47|       83|
| Aditya|   45|     89|       71|
|   Joel|   45|     75|       87|
|   Joel|   45|     75|       87|
|  Chris| null|     85|       82|
|    Sam| null|     62|       95|
+-------+-----+-------+---------+

When we sort a pyspark dataframe by a column with null values in ascending order, the rows with null values are placed at the top. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("orderby_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Maths'))
print("The ordered dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam| null|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The ordered dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|  Chris| null|     85|       82|
|    Sam| null|     62|       95|
|   Joel|   45|     75|       87|
|   Joel|   45|     75|       87|
| Aditya|   45|     89|       71|
|Katrina|   49|     47|       83|
| Aditya|   65|     89|       71|
| Agatha|   77|     76|       93|
+-------+-----+-------+---------+

To put the rows containing the null values in the last place in the ordered dataframe while sorting in ascending order, we can use the asc_nulls_last() method on the columns given in the orderBy() method. After this, the data frame will be ordered in ascending order with rows containing the null values at the last of the dataframe. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("orderby_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Maths').asc_nulls_last())
print("The ordered dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam| null|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The ordered dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|   Joel|   45|     75|       87|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
| Aditya|   65|     89|       71|
| Agatha|   77|     76|       93|
|  Chris| null|     85|       82|
|    Sam| null|     62|       95|
+-------+-----+-------+---------+

Here, we have used the asc_nulls_last() method on the Maths column. Hence, even if the dataframe is sorted in ascending order, the rows with null values are kept at the last of the output dataframe.

If you want to sort the pyspark dataframe in descending order and put the rows containing nulls at the last of the dataframe, you can use the desc_nulls_last() method in the orderBy() method. However, using the desc_nulls_last() method is redundant as the rows with null values are put at the last of the ordered dataframe by default when we sort them in descending order.

Conclusion

In this article, we discussed how to sort a pyspark dataframe using the orderBy() method. To learn more about Python programming, you can read this article on how to select rows with null values in a pyspark dataframe. You might also like this article on list of lists in Python.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post PySpark OrderBy One or Multiple Columns appeared first on PythonForBeginners.com.

Select Rows with Null values in PySpark

Aditya Raj — Wed, 19 Jul 2023 13:00:00 +0000

Missing values in tabular data are a common problem. When we load tabular data with missing values into a pyspark dataframe, the empty values are replaced with null values. In this article, we will discuss how to select rows with null values in a given pyspark dataframe.

Table of Contents

The isNull() Method in PySpark
The isNotNull() Method in PySpark
Select Rows With Null Values in a Column in PySpark DataFrame
Select Rows With Not Null Values in a Column
Select Rows with Null Values in Multiple columns in a DataFrame
Select Rows with Not Null Values in Multiple Columns
Conclusion

The isNull() Method in PySpark

The isNull() Method is used to check for null values in a pyspark dataframe column. When we invoke the isNull() method on a dataframe column, it returns a masked column having True and False values. Here, the values in the mask are set to True at the positions where no values are present. Otherwise, the value in the mask is set to True.

The length of the mask column is equal to the number of rows in the pyspark dataframe. We can use the isNull() method with the filter() or where() method to filter rows with null values from a pyspark dataframe.

The isNotNull() Method in PySpark

The isNotNull() method is the negation of the isNull() method. It is used to check for not null values in pyspark. If we invoke the isNotNull() method on a dataframe column, it also returns a mask having True and False values. Here, the values in the mask are set to False at the positions where no values are present. Otherwise, the value in the mask is set to True. Again, the length of the mask column is equal to the number of rows in the pyspark dataframe.

We can use the isNotNull() method with the filter() or where() method to select rows with not null values from a pyspark dataframe.

Select Rows With Null Values in a Column in PySpark DataFrame

To select rows with null values in a column in a pyspark dataframe, we can use the following approaches.

Using filter() method and the isNull() Method
By using the where() method and the isNull() Method
By Using sql IS NULL statement.

Let us discuss all these approaches one by one.

Select Rows With Null Values Using The filter() Method

To filter rows with null values in a particular column in a pyspark dataframe, we will first invoke the isNull() method on the given column. The isNull() method will return a masked column having True and False values. We will pass the mask column object returned by the isNull() method to the filter() method. After this, we will get the rows with null values in the given column.

You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter(dfs.Physics.isNull())
print("Rows at which there are null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|    Sam|   null|       62|  75|     95|  63|
|    Tom|   null|       94|  61|     65|  69|
+-------+-------+---------+----+-------+----+

In the above example, we first read a csv file into a pyspark dataframe. Then, we used the isNull() method and the filter() method to filter rows with Null values in the Physics column. In the output, you can observe that the final dataframe contains only those rows in which the Physics column is set to Null.

Filter Rows With Null Values using the where() Method

Instead of the filter() method, you can also use the where() method to filter rows with null values in a given column. The where() method also takes the mask returned by the isNull() method and returns a dataframe containing the rows where the mask column contains True.

You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(dfs.Physics.isNull())
print("Rows at which there are null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|    Sam|   null|       62|  75|     95|  63|
|    Tom|   null|       94|  61|     65|  69|
+-------+-------+---------+----+-------+----+

In this example, we have used the where() method instead of the filter() method. However, you can observe that the output remains the same.

Instead of the syntax used in the above examples, you can also use the col() function with the isNull() method to create the mask containing True and False values. The col() function is defined in the pyspark.sql.functions module. It takes a column name as an input argument and returns the column object containing all the values in the given column in a pyspark dataframe.

We can invoke the isNull() method on the column object to select rows with null values from the pyspark dataframe as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(col("Physics").isNull())
print("Rows at which there are null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|    Sam|   null|       62|  75|     95|  63|
|    Tom|   null|       94|  61|     65|  69|
+-------+-------+---------+----+-------+----+

In this example, we have used the col() function to access the Physics column of the dataframe instead of using the name of the dataframe.

Select Rows With Null Values Using SQL syntax

We can also use spark SQL to filter rows with null values from a pyspark dataframe. For this, we will first create a view of the input dataframe using the createOrReplaceTempView() method. The createOrReplaceTempView(), when invoked on a pyspark dataframe, takes the name of the view object as its input argument and creates a view of the dataframe.

After getting the view of the dataframe, we can use the sql SELECT statement with IS NULL clause to select rows with null values in a given column in the pyspark dataframe. For this, we can execute the SQL query using the sql() function as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT * FROM df_sql WHERE Physics IS NULL")
print("Rows at which there are null values in Physics Column:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|    Sam|   null|       62|  75|     95|  63|
|    Tom|   null|       94|  61|     65|  69|
+-------+-------+---------+----+-------+----+

In the above example, we first created a view of the input dataframe. This is because we cannot execute SQL statements on a pyspark dataframe. Then, we used the sql() function to execute the SQL query for retrieving rows in which the Physics column is Null. The sql() function takes the SQL query as its input and returns the output dataframe.

Instead of creating the view of the dataframe and executing the sql() function, you can directly pass the statement in the WHERE clause in the sql statement to the filter() method as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NULL")
print("Rows at which there are null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|    Sam|   null|       62|  75|     95|  63|
|    Tom|   null|       94|  61|     65|  69|
+-------+-------+---------+----+-------+----+

Select Rows With Not Null Values in a Column

We can use the approaches used for selecting rows with null values to filter rows with not null values. The only difference is that we need to use the isNotNull() method in place of the isNull() method.

Filter Rows With Not Null Values Using The filter() Method

To select rows with not null values in a particular column in a pyspark dataframe, we will first invoke the isNotNull() method on the given column. The isNotNull() method will return a masked column containing True and False values. Next, we will pass the mask column object returned by the isNotNull() method to the filter() method. After this, we will get the rows with not null values in the given column.

You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter(dfs.Physics.isNotNull())
print("Rows at which there are not null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

In this example, we used the isNotNull() method with the filter() method to select rows with not null values from the pyspark dataframe.

Select Rows With Not Null Values Using the where() Method

Instead of the filter() method, you can also use the where() method to filter rows with not null values in a given column. The where() method also takes the mask returned by the isNotNull() method and returns a dataframe containing the rows where the mask column contains True.

You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(dfs.Physics.isNotNull())
print("Rows at which there are not null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

he input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

In this example, we used the where() method instead of the filter() method. However, the output remains the same.

Instead of the syntax used in the above examples, you can also use the col() function with the isNotNull() method to filter rows with not null values in a pyspark dataframe as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(col("Physics").isNotNull())
print("Rows at which there are not null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Select Rows With Not Null Values Using the dropna() Method

We can also use the dropna() method to select rows with not null values in a single column from a pyspark dataframe. For this, we can invoke the dropna() method on the pyspark dataframe and pass the column name as input to the subset parameter. After execution of the dropna() method, we will get rows with not null values in the specified column.

You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.dropna(subset="Physics")
print("Rows at which there are not null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Filter Rows With Not Null Values using SQL From a PySpark DataFrame

We can also use pyspark SQL syntax to filter rows with no null values in a column from a pyspark dataframe. For this, we will use the following steps.

First, we will create a view of the input dataframe using the createOrReplaceTempView() method. The createOrReplaceTempView(), when invoked on a pyspark dataframe, takes the name of the view object as its input argument and creates a view of the dataframe.
After getting the view of the dataframe, we can use the sql SELECT statement with IS NOT NULL clause to select rows with not null values in the given column in the pyspark dataframe. For this, we can execute the SQL query using the sql() function.

You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT * FROM df_sql WHERE Physics IS NOT NULL")
print("Rows at which there are not null values in Physics Column:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

In this example, we have used the sql() function to execute the sql statement for selecting rows with not null values in the Physics column.

Instead of using the above approach, you can directly pass the statement in the WHERE clause of the sql statement to the filter() method as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NOT NULL")
print("Rows at which there are not null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

In this example, we passed the string "Physics IS NOT NULL" to the filter() method instead of executing the entire SQL query. However, the output remains the same.

Select Rows with Null Values in Multiple columns in a DataFrame

To select rows with Null values in Multiple columns, we can use the conditional operators along with the isNull() method inside the filter() and where() method. For this, we will use the following steps.

We will first invoke the isNull() method on all the required columns.
Next, we will enclose the masked columns returned by the isNull() method inside parentheses and combine them with conditional operators like AND (&) and OR (|) to create a conditional statement.
Finally, we will pass the conditional statement to the filter() method.

After executing the above statements, we can select rows with null values in multiple columns in a pyspark dataframe. You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter((dfs.Physics.isNull())& (dfs.Chemistry.isNull()))
print("Rows with null values in Physics and Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
+-------+-------+---------+----+-------+----+

In this example, we have passed the conditions (dfs.Physics.isNull()) and (dfs.Chemistry.isNull()) to the filter() method to filter rows with not null values in these columns. Observe that the output of the isNull() method is enclosed inside parentheses for the & operator to execute successfully. If we don’t use parentheses here, the program will run into error.

In the output, you can observe that we get an empty dataframe because there are no rows where both the Physics and Chemistry column are null.

In a similar manner, you can use the OR conditional operator (|) to filter rows in which the Physics column or Chemistry column contains null values as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter((dfs.Physics.isNull())|(dfs.Chemistry.isNull()))
print("Rows with null values in Physics or Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|   Golu|     74|     null|  96|     76|  64|
|    Tom|   null|       94|  61|     65|  69|
|  Clara|     93|     null|  78|   null|  71|
+-------+-------+---------+----+-------+----+

In the above example, we have used the filter() method with the conditions to filter rows with null values in Physics or Chemistry column from the pyspark dataframe.

Instead of the filter() method, you can also use the where() method in the previous example as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where((dfs.Physics.isNull())|(dfs.Chemistry.isNull()))
print("Rows with null values in Physics or Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|   Golu|     74|     null|  96|     76|  64|
|    Tom|   null|       94|  61|     65|  69|
|  Clara|     93|     null|  78|   null|  71|
+-------+-------+---------+----+-------+----+

We can also use multiple conditional statements inside SQL syntax to select rows with null values in multiple columns in a pyspark dataframe. For this, you can use the IS NULL clause with conditional operators in the WHERE clause of the SQL statement as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT * FROM df_sql WHERE Physics IS NULL or Chemistry IS NULL")
print("Rows with null values in Physics or Chemistry Columns are:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|   Golu|     74|     null|  96|     76|  64|
|    Tom|   null|       94|  61|     65|  69|
|  Clara|     93|     null|  78|   null|  71|
+-------+-------+---------+----+-------+----+

Instead of using the sql statement, we can also pass the conditional statement in the WHERE clause to the filter() method as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NULL OR Chemistry IS NULL")
print("Rows with null values in Physics or Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|   Golu|     74|     null|  96|     76|  64|
|    Tom|   null|       94|  61|     65|  69|
|  Clara|     93|     null|  78|   null|  71|
+-------+-------+---------+----+-------+----+

Select Rows with Not Null Values in Multiple Columns

To select rows with not Null values in Multiple columns, we can use the conditional operators along with the isNotNull() method inside the filter() and where() method. For this, we will use the following steps.

We will first invoke the isNotNull() method on all the required columns.
Next, we will enclose the masked columns returned by the isNotNull() method inside parentheses and combine them with conditional operators like AND (&) and OR (|) to create a conditional statement.
Finally, we will pass the conditional statement to the filter() method.

After executing the above statements, we can select rows with not null values in multiple columns in a pyspark dataframe. You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter((dfs.Physics.isNotNull())& (dfs.Chemistry.isNotNull()))
print("Rows with not null values in Physics and Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with not null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Harry|     68|       92|  69|     66|  98|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Instead of the filter() method, you can also use the where() method in the previous example as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where((dfs.Physics.isNotNull())& (dfs.Chemistry.isNotNull()))
print("Rows with not null values in Physics and Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with not null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Harry|     68|       92|  69|     66|  98|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

We can also use multiple conditional statements inside the SQL WHERE clause to select rows with not null values in multiple columns in a pyspark dataframe as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT * FROM df_sql WHERE Physics IS NOT NULL AND Chemistry IS NOT NULL")
print("Rows with not null values in Physics and Chemistry Columns are:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with not null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Harry|     68|       92|  69|     66|  98|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Instead of using the sql statement, we can also pass the conditional statement in the WHERE clause to the filter() method as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NOT NULL AND Chemistry IS NOT NULL")
print("Rows with notn null values in Physics and Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with notn null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Harry|     68|       92|  69|     66|  98|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Conclusion

In this article, we discussed different ways to select rows with null values in a pyspark dataframe. To learn more about pyspark dataframes, you can read this article on how to sort a pyspark dataframe. You might also like this article on how to select distinct rows from a pyspark dataframe.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post Select Rows with Null values in PySpark appeared first on PythonForBeginners.com.

PySpark Count Distinct Values in One or Multiple Columns

Aditya Raj — Mon, 17 Jul 2023 13:00:00 +0000

While handling data in pyspark, we often need to find the count of distinct values in one or multiple columns in a pyspark dataframe. In this article, we will discuss how to count distinct values in one or multiple columns in pyspark.

Table of Contents

Pyspark Count Rows in A DataFrame
1. Count Distinct Rows in a PySpark DataFrame
Pyspark Count Values in a Column
1. Count Distinct Values in a Column in PySpark DataFrame
PySpark Count Distinct Multiple Columns
Count Unique Values in Columns Using the countDistinct() Function
Conclusion

Pyspark Count Rows in A DataFrame

The count() method counts the number of rows in a pyspark dataframe. When we invoke the count() method on a dataframe, it returns the number of rows in the data frame as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("countdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.count()
print("The number of rows in the dataframe are:")
print(num_rows)
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The number of rows in the dataframe are:
8

In this example, we first read a csv file into a pyspark dataframe. Then, we used the count() method to find the number of rows in the dataframe. As there are eight rows in the data, the count() method returns the value 8.

Count Distinct Rows in a PySpark DataFrame

To count distinct rows in a dataframe, we will first use the distinct() method to select distinct rows in the pyspark dataframe. Then, we can use the count() method to count unique rows in a given dataframe as shown in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("countdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.distinct().count()
print("The number of distinct rows in the dataframe are:")
print(num_rows)
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The number of distinct rows in the dataframe are:
7

The dataframe that we create using the csv file has duplicate rows. Hence, when we invoke the distinct() method on the pyspark dataframe, the duplicate rows are dropped. After this, when we invoke the count() method on the output of the distinct() method, we get the number of distinct rows in the given pyspark dataframe.

Instead of the distinct() method, you can also use the dropDuplicates() method to count distinct rows in a dataframe as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("countdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.dropDuplicates().count()
print("The number of distinct rows in the dataframe are:")
print(num_rows)
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The number of distinct rows in the dataframe are:
7

The dropDuplicates() method, when invoked on a pyspark dataframe, drops all the duplicate rows. Hence, when we invoke the count() method on the dataframe returned by the dropDuplicates() method, we get the count of distinct rows in the dataframe.

Pyspark Count Values in a Column

To count the values in a column in a pyspark dataframe, we can use the select() method and the count() method. The select() method takes the column names as its input and returns a dataframe containing the specified columns.

To count the values in a column of a pyspark dataframe, we will first select the particular column using the select() method by passing the column name as input to the select() method.

Next, we will use the count() method to count the number of values in the selected column as shown in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("countdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.select("Name").count()
print("The number of values in the Name column are:")
print(num_rows)
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The number of values in the Name column are:
8

By default, a column will have the same number of values as the rows in the dataframe. Hence, this example doesn’t make any sense. However, we can combine the select() method with the distinct() method to count distinct values in a column in the pyspark dataframe.

Count Distinct Values in a Column in PySpark DataFrame

To count distinct values in a column in a pyspark dataframe, we will use the following steps.

First, we will select the particular column from the dataframe using the select() method.
Next, we will use the distinct() method to get a column with distinct values.
Finally, we will use the count() method to count distinct values in the column.

You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("countdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.select("Name").distinct().count()
print("The number of distinct values in the Name column are:")
print(num_rows)
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The number of distinct values in the Name column are:
6

In the above example, we first selected the Name column using the select() method. Then, we used the distinct() method to get the distinct values in the Name column. Finally, we used the count() method to count distinct values in the specified column.

Instead of the distinct() method, you can also use the dropDuplicates() method to count distinct values in a column as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("countdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.select("Name").dropDuplicates().count()
print("The number of distinct values in the Name column are:")
print(num_rows)
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The number of distinct values in the Name column are:
6

PySpark Count Distinct Multiple Columns

To count the number of distinct values in multiple columns, we will use the following steps.

We will first select the specified columns using the select() method.
Next, we will use the distinct() method to find the distinct() pairs of values in the given columns.
Finally, we will use the count() method to get the count of distinct values.

You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("countdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.select("Name","Maths").distinct().count()
print("The number of distinct values in the Name and Maths column are:")
print(num_rows)
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The number of distinct values in the Name and Maths column are:
7

In this example, we have counted the distinct values in the Name and Maths column. For this, we first selected both these columns using the select() method. Next, we used the distinct() method to drop duplicate pairs from both columns. Finally, we used the count() method to count distinct values in multiple columns in the given pyspark dataframe.

Instead of the distinct() method, you can also use the dropDuplicates() method in the previous approach to count distinct values in multiple columns in a pyspark dataframe.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("countdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.select("Name","Maths").dropDuplicates().count()
print("The number of distinct values in the Name and Maths column are:")
print(num_rows)
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The number of distinct values in the Name and Maths column are:
7

In the above example, we can avoid using the select statement while finding distinct counts for multiple columns. We can directly invoke the dropDuplicates() method on the original dataframe for this.

Here, we will pass the column names for which we want to find distinct values as input to the subset parameter in the dropDuplicates() method in a list. After this, we will get the initial dataframe with all the columns, but only the unique set of values in the columns passed to the dropDuplicates() method.

After getting the dataframe, you can use the count() method to find the number of distinct values in multiple columns as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("countdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.dropDuplicates(subset=["Name","Maths"]).count()
print("The number of distinct values in the Name and Maths column are:")
print(num_rows)
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The number of distinct values in the Name and Maths column are:
7

In the output, you can observe that we get the same result as the previous example even if we haven’t used the select() statement.

Count Unique Values in Columns Using the countDistinct() Function

The countDistinct() function is defined in the pyspark.sql.functions module. It is often used with the groupby() method to count distinct values in different subsets of a pyspark dataframe. However, we can also use the countDistinct() method to count distinct values in one or multiple columns.

To count the number of distinct values in a column in pyspark using the countDistinct() function, we will use the agg() method. Here, we will pass the countDistinct() function to the agg() method as input. Also, we will pass the column name for we want to count the distinct values as input to the countDistinct() function. After execution of the agg() method, we get a dataframe containing the count of unique values in the given column. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("countdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
count_df=dfs.agg(countDistinct("Name"))
print("The number of distinct values in the Name column are:")
count_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The number of distinct values in the Name column are:
+-----------+
|count(Name)|
+-----------+
|          6|
+-----------+

If you want to count unique values in multiple columns using the countDistinct() function, you can pass the required column names to the countDistinct() function as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("countdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
count_df=dfs.agg(countDistinct("Name","Maths"))
print("The number of distinct values in the Name and Maths column are:")
count_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The number of distinct values in the Name and Maths column are:
+------------------+
|count(Name, Maths)|
+------------------+
|                 6|
+------------------+

In this output, you can observe that the countDistinct() method gives the count of distinct values in the Name and Maths columns as 6 whereas the previous example gave 7 as the output. This is due to the reason that the countDistinct() method ignores the rows with null values.

Conclusion

In this article, we discussed different ways to count distinct values in one or multiple columns in a pyspark dataframe. To learn more topics in pyspark, you can read this article on how to sort a pyspark dataframe. YOu might also like this article on tuple unpacking in Python.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post PySpark Count Distinct Values in One or Multiple Columns appeared first on PythonForBeginners.com.

PySpark Filter Rows in a DataFrame by Condition

Aditya Raj — Fri, 14 Jul 2023 13:00:00 +0000

While working with pyspark dataframes, we often need to filter rows based on different criteria. In this article, we will discuss different ways to filter rows in a pyspark dataframe.

Table of Contents

The filter() Method
PySpark Filter DataFrame by Column Value
Filter PySpark DataFrame Using SQL Statement
Filter PySpark DataFrame by Multiple Conditions
1. PySpark Filter DataFrame by Multiple Conditions Using SQL
Conclusion

The filter() Method

The filter() method, when invoked on a pyspark dataframe, takes a conditional statement as its input. The conditional statement generally uses one or multiple columns of the dataframe and returns a column containing True or False values. The filter() method checks the mask and selects the rows for which the mask created by the conditional statement has the value True in the output. The rest of the rows are discarded.

PySpark Filter DataFrame by Column Value

To filter a pyspark dataframe by a column value, we will use the filter() method. Here, we will check for the column value in a conditional statement and pass it to the filter() method. After execution, we will get a pyspark dataframe with rows satisfying the condition. You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("filter_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
new_df=dfs.filter(dfs.Physics>80)
print("The filtered rows are:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The filtered rows are:
+------+-----+-------+---------+
|  Name|Maths|Physics|Chemistry|
+------+-----+-------+---------+
|Aditya|   45|     89|       71|
| Chris| null|     85|       82|
|Aditya|   65|     89|       71|
+------+-----+-------+---------+

In this example, we first read a csv file into a pyspark dataframe. Then, we used the filter() method to filter rows from the dataframe. In the filter() method, we passed the condition dfs.Physics>80. Here, dfs is the dataframe created from the csv file and Physics is the column name. Hence, the filter() method will return a dataframe having values greater than 80 in the Physics column.

Filter PySpark DataFrame Using SQL Statement

You can also use SQL statements to filter a pyspark dataframe by column value. For this, we can use the SQL SELECT statement with a WHERE clause to check for the condition in the given column name. To filter a dataframe by column value using SQL in PySpark, we can use the following steps.

First, we will create a view of the pyspark dataframe using the createOrReplaceTempView() function. The createOrReplaceTempView() method, when invoked on a pyspark dataframe, take the name of the view as its input argument. After execution, it generates a view of the dataframe with the given name. We can execute SQL statements on this view to filter data.
Next, we will create an SQL statement to filter rows using the SELECT statement and WHERE clause.
Finally, we will use the sql() function to execute the SQL statement.

After executing the sql() function, we will get the output dataframe with filtered rows. You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("filter_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT * FROM df_sql WHERE Physics>80")
print("The filtered rows are:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The filtered rows are:
+------+-----+-------+---------+
|  Name|Maths|Physics|Chemistry|
+------+-----+-------+---------+
|Aditya|   45|     89|       71|
| Chris| null|     85|       82|
|Aditya|   65|     89|       71|
+------+-----+-------+---------+

In the above example, we first created a dataframe from the csv file. Then, we used the createOrReplaceTempView() method to create a view of the pyspark dataframe as we cannot directly execute sql statements on a dataframe. Next, we used the sql() function to execute the SQL statement for filtering the rows of the dataframe based on the Physics column.

Instead of using the above approach, you can also pass the statement in the sql WHERE clause directly to the filter() method by invoking it on the input dataframe. After this, you will get the output dataframe with desired rows as shown in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("filter_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
new_df=dfs.filter("Physics>80")
print("The filtered rows are:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The filtered rows are:
+------+-----+-------+---------+
|  Name|Maths|Physics|Chemistry|
+------+-----+-------+---------+
|Aditya|   45|     89|       71|
| Chris| null|     85|       82|
|Aditya|   65|     89|       71|
+------+-----+-------+---------+

In this example, we directly passed the SQL sub-query that we used in the WHERE clause in the previous example to the filter() method. However, the output remains unchanged.

Filter PySpark DataFrame by Multiple Conditions

You can also filter pyspark dataframes by multiple conditions. For this, you need to include all the conditions inside the filter() method or in the sql WHERE clause using conditional operators.

For instance, we can filter rows in the pyspark dataframe by multiple conditions using the filter() method as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("filter_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
new_df=dfs.filter((dfs.Physics>70) & (dfs.Chemistry<90))
print("The filtered rows are:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The filtered rows are:
+------+-----+-------+---------+
|  Name|Maths|Physics|Chemistry|
+------+-----+-------+---------+
|Aditya|   45|     89|       71|
| Chris| null|     85|       82|
|  Joel|   45|     75|       87|
|  Joel|   45|     75|       87|
|Aditya|   65|     89|       71|
+------+-----+-------+---------+

In the above example, we have passed two conditions to the filter() method using the AND operator (&). Hence, the the filter() method filters the input dataframe by both the conditions and produces the result. Here, you need to keep in mind that each condition is enclosed in a parentheses and then they are combined using conditional operators.

If you don’t use parentheses with the conditional statements in the filter() method. The program will run into an error. You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("filter_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
new_df=dfs.filter(dfs.Physics>70 & dfs.Chemistry<90)
print("The filtered rows are:")
new_df.show()
spark.sparkContext.stop()

Output:

Py4JError: An error occurred while calling o240.and. Trace:
py4j.Py4JException: Method and([class java.lang.Integer]) does not exist

In the above code, we haven’t used parentheses with the conditions in the filter() method. Hence, the program runs into Py4JError.

PySpark Filter DataFrame by Multiple Conditions Using SQL

Instead of the filter method, you can also use sql WHERE clause to filter a pyspark dataframe by multiple conditions. For this, you can pass all the conditions in the WHERE clause and combine them using conditional operators. After execution of the conditional statement, you will get the desired output.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("filter_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT * FROM df_sql WHERE Physics>70 AND Chemistry<90")
print("The filtered rows are:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The filtered rows are:
+------+-----+-------+---------+
|  Name|Maths|Physics|Chemistry|
+------+-----+-------+---------+
|Aditya|   45|     89|       71|
| Chris| null|     85|       82|
|  Joel|   45|     75|       87|
|  Joel|   45|     75|       87|
|Aditya|   65|     89|       71|
+------+-----+-------+---------+

Instead of creating a view of the dataframe and executing the SQL statement using the sql() function, you can also pass the sub-query used in the WHERE clause in the SQL statement to the filter() method. After this, you will get the same output as the previous example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("filter_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
new_df=dfs.filter("Physics>70 AND Chemistry<90")
print("The filtered rows are:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   77|     76|       93|
|    Sam|   99|     62|       95|
| Aditya|   65|     89|       71|
+-------+-----+-------+---------+

The filtered rows are:
+------+-----+-------+---------+
|  Name|Maths|Physics|Chemistry|
+------+-----+-------+---------+
|Aditya|   45|     89|       71|
| Chris| null|     85|       82|
|  Joel|   45|     75|       87|
|  Joel|   45|     75|       87|
|Aditya|   65|     89|       71|
+------+-----+-------+---------+

Conclusion

In this article, we discussed multiple ways to filter rows in a pyspark dataframe. To learn more about pyspark, you can read this article on how to select distinct rows from a pyspark dataframe. You might also like this article on how to sort a pyspark dataframe.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post PySpark Filter Rows in a DataFrame by Condition appeared first on PythonForBeginners.com.

PySpark Select Distinct Rows From DataFrame

Aditya Raj — Wed, 12 Jul 2023 13:00:00 +0000

When using a pyspark dataframe, we sometimes need to select unique rows or unique values from a particular column. In this article, we will discuss how to select distinct rows or values in a column of a pyspark dataframe using three different ways.

Table of Contents

Select Distinct Rows From PySpark DataFrame
Select Distinct Rows Based on Multiple Columns in PySpark DataFrame
PySpark Select Unique Values in A Column
Pyspark Select Distinct From Multiple Columns
Conclusion

Select Distinct Rows From PySpark DataFrame

You can use three ways to select distinct rows in a dataframe in pyspark.

Using the distinct() method
By using the dropDuplicates() method
Using SQL Statement

Let us discuss each method one at a time.

PySpark Select Distinct Rows Using The distinct() Method

The distinct() method, when invoked on a pyspark dataframe, returns all the unique rows in the dataframe. Hence, we can directly invoke the distinct() method on a pyspark dataframe to select unique rows.

To observe this, we will first read a csv file into a pyspark dataframe. You can download the file using this link.

sample_csv_file Download

After reading the csv file into the pyspark dataframe, you can invoke the distinct() method on the pyspark dataframe to get distinct rows as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.distinct()
print("The dataframe with distinct rows is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The dataframe with distinct rows is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|     75|       87|
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|Katrina|   49|     47|       83|
|    Sam|   99|     98|       95|
| Agatha|   76|     93|       83|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

In the above example, you can observe that the original dataframe contains eight rows. After using the distinct() method, we get seven distinct rows from the dataframe.

Select Distinct Rows Using The dropDuplicates() Method

The dropDuplicates() method works in a similar manner to the distinct() method. When we invoke the dropDuplicates() method on a dataframe, it returns unique rows in the dataframe as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.dropDuplicates()
print("The dataframe with distinct rows is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The dataframe with distinct rows is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|     75|       87|
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|Katrina|   49|     47|       83|
|    Sam|   99|     98|       95|
| Agatha|   76|     93|       83|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

In this example, you can observe that we have used the dropDuplicates() method instead of the distinct() method.

Distinct Rows From PySpark DataFrame Using SQL

We can also select distinct rows in a pyspark dataframe using SQL syntax. For this, we will use the following steps.

First, we need to create a temporary view of the dataframe using the createOrReplaceTempView() method. The createOrReplaceTempView() method takes the desired name of the temporary view of the dataframe and returns a temporary view.
Once we get the temporary view, we can execute the SELECT DISTINCT SQL statement on the view of the dataframe to select unique rows. For this, we will create the string containing the SQL statement and pass it to the sql() function.
After execution of the sql() function, we get the output dataframe with distinct rows.

After executing the above statements, we can get the pyspark dataframe with distinct rows as shown in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
dfs=spark.sql("SELECT DISTINCT * FROM df_sql")
print("The dataframe with distinct rows is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The dataframe with distinct rows is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|     75|       87|
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|Katrina|   49|     47|       83|
|    Sam|   99|     98|       95|
| Agatha|   76|     93|       83|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

Select Distinct Rows Based on Multiple Columns in PySpark DataFrame

In the previous examples, we have selected unique rows based on all the columns. However, we can also use specific columns to decide on unique rows.

To select distinct rows based on multiple columns, we can pass the column names by which we want to decide the uniqueness of the rows in a list to the dropDuplicates() method. After execution, the dropDuplicates() method will return a dataframe containing a unique set of values in the specified columns. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.dropDuplicates(["Name","Maths"])
print("The dataframe with distinct rows is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The dataframe with distinct rows is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
| Agatha|   76|     93|       83|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|    Sam|   99|     98|       95|
+-------+-----+-------+---------+

In this example, we first read a csv file to create a pyspark dataframe. Then, we used the dropDuplicates() method to select distinct rows having unique values in the Name and Maths Column. For this, we passed the list ["Name", "Maths"] to the dropDuplicates() method. In the output, you can observe that the pyspark dataframe contains all the columns. However, the combination of the Name and Maths columns is unique in each row.

PySpark Select Unique Values in A Column

To select distinct values from one column in a pyspark dataframe, we first need to select the particular column using the select() method. Then, we can get distinct values from the column using the distinct() method as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.select("Name").distinct()
print("The distinct values in the column are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the column are:
+-------+
|   Name|
+-------+
|Katrina|
|  Chris|
| Agatha|
|    Sam|
|   Joel|
| Aditya|
+-------+

In this example, we first selected the Name column using the select() method. Then, we invoked the distinct() method on the selected column to get all the unique values.

Instead of the distinct() method, you can use the dropDuplicates() method to select unique values from a column in a pyspark dataframe as shown in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.select("Name").dropDuplicates()
print("The distinct values in the column are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the column are:
+-------+
|   Name|
+-------+
|Katrina|
|  Chris|
| Agatha|
|    Sam|
|   Joel|
| Aditya|
+-------+

We can also use the SQL SELECT DISTINCT statement to select distinct values from a column in a pyspark dataframe as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
dfs=spark.sql("SELECT DISTINCT Name FROM df_sql")
print("The distinct values in the column are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the column are:
+-------+
|   Name|
+-------+
|Katrina|
|  Chris|
| Agatha|
|    Sam|
|   Joel|
| Aditya|
+-------+

Pyspark Select Distinct From Multiple Columns

To select distinct values from multiple columns, we will first select the desired columns using the select() statement. After this, we will use the distinct() method to get the unique values from the selected columns as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.select("Name","Maths").distinct()
print("The distinct values in the columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the columns are:
+-------+-----+
|   Name|Maths|
+-------+-----+
| Aditya|   45|
|Katrina|   49|
| Agatha|   76|
|  Chris|   78|
|    Sam|   99|
|   Joel|   45|
+-------+-----+

In this example, we have selected distinct values from the Name and Maths column. For this, we first selected both these columns from the original dataframe using the select() method. Then, we used the distinct() method to select distinct values in the columns.

Instead of the distinct() method, you can also use the dropDuplicates() method to select distinct values from multiple columns as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.select("Name","Maths").dropDuplicates()
print("The distinct values in the columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the columns are:
+-------+-----+
|   Name|Maths|
+-------+-----+
| Aditya|   45|
|Katrina|   49|
| Agatha|   76|
|  Chris|   78|
|    Sam|   99|
|   Joel|   45|
+-------+-----+

You can also use the SQL SELECT DISTINCT statement with column names to select unique values from multiple columns as shown in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
dfs=spark.sql("SELECT DISTINCT Name, Maths FROM df_sql")
print("The distinct values in the columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the columns are:
+-------+-----+
|   Name|Maths|
+-------+-----+
| Aditya|   45|
|Katrina|   49|
| Agatha|   76|
|  Chris|   78|
|    Sam|   99|
|   Joel|   45|
+-------+-----+

Conclusion

In this article, we discussed different ways to select distinct rows or values from a PySpark dataframe. To learn more about PySpark, you can read this article on how to sort dataframe in pyspark. You might also like this article on fill nan values in pandas dataframe.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post PySpark Select Distinct Rows From DataFrame appeared first on PythonForBeginners.com.

PySpark Sort DataFrame With Examples

Aditya Raj — Mon, 10 Jul 2023 13:00:00 +0000

While working with pyspark dataframes, we often need to sort them according to one or multiple columns. In this article, we will discuss different ways to sort a pyspark dataframe using the sort() method.

Table of Contents

The PySpark sort() Method
PySpark Sort DataFrame by Column Name
Sort DataFrame in Descending Order
PySpark Sort DataFrame by Multiple Columns
1. Sort DataFrame by Multiple Columns With Different Sorting Order
Sort PySpark DataFrame With Null Values
1. Pyspark Sort DataFrame Nulls Last
2. Pyspark Sort DataFrame Nulls First
Conclusion

The PySpark sort() Method

The sort() method in pyspark is used to sort a dataframe by one or multiple columns. It has the following syntax.

df.sort(*columns, ascending=True)

Here,

The parameter *columns represent one or multiple columns by which we need to sort the dataframe.
The ascending parameter specifies if we want to sort the dataframe in ascending or descending order. If there are multiple columns by which you want to sort the dataframe, you can also pass a list of True and False values to specify the columns by which the dataframe is sorted in ascending or descending order.

PySpark Sort DataFrame by Column Name

To sort a pyspark dataframe by column name, we can use the sort() method as shown in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort('Maths')
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   86|     85|       82|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|     85|       87|
| Aditya|   45|     89|       71|
|Katrina|   49|     47|       83|
|    Sam|   76|     98|       95|
| Agatha|   76|     89|       83|
|  Chris|   86|     85|       82|
+-------+-----+-------+---------+

In this example, we first created a SparkSession on our local machine. Then, we read a csv file to create a pyspark dataframe. You can download the file using the following link.

sample_csv_file Download

Once we get the dataframe, we sort it using the ‘Maths’ column. In the output dataframe, you can observe that the rows of the dataframe are sorted in ascending order by the Maths column.

Instead of the above approach, you can also use the col() function to sort the pyspark dataframe as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths'))
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

he input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   86|     85|       82|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|     85|       87|
| Aditya|   45|     89|       71|
|Katrina|   49|     47|       83|
|    Sam|   76|     98|       95|
| Agatha|   76|     89|       83|
|  Chris|   86|     85|       82|
+-------+-----+-------+---------+

The col() function is used to select a dataframe column as a column object. In this example, we have used the col() function to select the Maths column in the dataframe. Then, the dataframe is sorted by the selected column using the sort() method.

Sort DataFrame in Descending Order

To sort a pyspark dataframe by a column in descending order, you can set the ascending parameter to False in the sort() method as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort('Maths',ascending=False)
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   86|     85|       82|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|  Chris|   86|     85|       82|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
|Katrina|   49|     47|       83|
| Aditya|   45|     89|       71|
|   Joel|   45|     85|       87|
+-------+-----+-------+---------+

In this example, we have set the ascending parameter to False in the sort() method. Hence, the dataframe is sorted by the Maths column in descending order.

If you are using the col() function to sort the pyspark dataframe, you can use the desc() method on the column of the pyspark. When we invoke the desc() method on the column obtained using the col() function, the sort() method sorts the pyspark dataframe in descending order. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths').desc())
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   86|     85|       82|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|  Chris|   86|     85|       82|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
|Katrina|   49|     47|       83|
| Aditya|   45|     89|       71|
|   Joel|   45|     85|       87|
+-------+-----+-------+---------+

PySpark Sort DataFrame by Multiple Columns

To sort a pyspark dataframe by multiple columns, you can pass all the column names to the sort() method in a sequence. Here, the dataframe is first sorted by the first column name given to the sort() method. If two rows have same values in the first column, they are sorted using the values in the next column. This process is repeated until the rows are sorted by all the given columns. You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort('Maths','Physics')
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   86|     85|       82|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|     85|       87|
| Aditya|   45|     89|       71|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
|  Chris|   86|     85|       82|
+-------+-----+-------+---------+

In this example, we have sorted the pyspark dataframe using Maths and Physics column. While execution, the sort() method first sorts the rows of the dataframe by the Maths column as its is given as the first input argument. The rows in which the Maths column has the same values are then sorted by the Physics column, as you can observe in the output above.

By default, the sort() method sorts the pyspark dataframe in ascending order. To sort the dataframe in descending order by all the columns using the sort() method, you can set the ascending parameter to False as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort('Maths','Physics',ascending=False)
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   86|     85|       82|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|  Chris|   86|     85|       82|
|    Sam|   76|     98|       95|
| Agatha|   76|     89|       83|
|Katrina|   49|     47|       83|
| Aditya|   45|     89|       71|
|   Joel|   45|     85|       87|
+-------+-----+-------+---------+

Sort DataFrame by Multiple Columns With Different Sorting Order

If you want to change the sorting order for each column, you can pass a list of True and False values to the ascending parameter in the sort() method. Here, the number of boolean values should be equal to the number of column names passed to the sort() method. Each value in the list corresponds to a single column at the same position in the parameter list.

If we want to sort the pyspark dataframe in ascending order by the ith column name passed to the sort() method, the ith element in the list passed to the ascending parameter should be True. Similarly, if we want to sort the pyspark dataframe in ascending order by the jth column name passed to the sort() method, the jth element in the list should be False. You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort('Maths','Physics',ascending=[True, False])
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   86|     85|       82|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
|    Sam|   76|     98|       95|
| Agatha|   76|     89|       83|
|  Chris|   86|     85|       82|
+-------+-----+-------+---------+

In this example, we have passed the list [True, False] to the ascending parameter in the sort() method. Hence, the sort() method first sorts the dataframe by the Maths column in ascending order. For the rows where the Maths column contains equal values, the dataframe is sorted by the Physics column in descending order as the second element in the list passed to the ascending parameter is False.

If you are using the col() function to sort the dataframe, you can use the asc() method and desc() method on each column to sort the dataframe by the column in ascending order or descending order respectively.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths').asc(),col('Physics').desc())
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   86|     85|       82|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
|    Sam|   76|     98|       95|
| Agatha|   76|     89|       83|
|  Chris|   86|     85|       82|
+-------+-----+-------+---------+

In this example, we have used the asc() method on the Maths and desc() method on the Physics column object. Hence, the sort() method first sorts the dataframe by the Maths column in ascending order. If two rows have the same values in the Maths column, the dataframe is sorted by the Physics column in descending order as we have used the desc() method on the Physics column.

Sort PySpark DataFrame With Null Values

If there are null values present in a row in the column by which we are sorting a pyspark dataframe, the row is placed at the top of the sorted dataframe. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths'))
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|   null|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|   null|       83|
|    Sam| null|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|    Sam| null|     98|       95|
|  Chris| null|     85|       82|
| Aditya|   45|     89|       71|
|   Joel|   45|   null|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|   null|       83|
+-------+-----+-------+---------+

In this example, the Maths column contains null values in two rows. Hence, when we sort the pyspark dataframe by the Maths column in ascending order, the rows with null values are put at the top of the dataframe in the output.

When we sort the dataframe in descending order, the null values are put at the bottom of the dataframe.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths').desc())
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|   null|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|   null|       83|
|    Sam| null|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Agatha|   76|   null|       83|
|Katrina|   49|     47|       83|
| Aditya|   45|     89|       71|
|   Joel|   45|   null|       87|
|  Chris| null|     85|       82|
|    Sam| null|     98|       95|
+-------+-----+-------+---------+

Here, we have sorted the pyspark dataframe by the Maths column in descending order. Hence, the rows with null values in the Maths column are positioned at the bottom of the output dataframe.

Pyspark Sort DataFrame Nulls Last

To put the rows containing the null values in the first place in the sorted dataframe, we can use the asc_nulls_last() method on the columns given in the sort() method. After this, the data frame will be sorted in ascending order with rows containing the null values at the top of the dataframe. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths').asc_nulls_last())
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|   null|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|   null|       83|
|    Sam| null|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|   null|       87|
| Aditya|   45|     89|       71|
|Katrina|   49|     47|       83|
| Agatha|   76|   null|       83|
|  Chris| null|     85|       82|
|    Sam| null|     98|       95|
+-------+-----+-------+---------+

In this example, we have invoked the asc_nulls_last() method on the column object passed to the sort() method. Hence, even if we are sorting the dataframe in ascending order, the rows with null values are positioned at the bottom of the output dataframe.

Similarly, if you want to sort the pyspark dataframe in descending order and put the rows containing nulls at the bottom of the dataframe, you can use the desc_nulls_last() method in the sort() method as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths').desc_nulls_last())
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|   null|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|   null|       83|
|    Sam| null|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Agatha|   76|   null|       83|
|Katrina|   49|     47|       83|
| Aditya|   45|     89|       71|
|   Joel|   45|   null|       87|
|  Chris| null|     85|       82|
|    Sam| null|     98|       95|
+-------+-----+-------+---------+

Pyspark Sort DataFrame Nulls First

For sorting a pyspark dataframe in descending order and with null values at the top of the sorted dataframe, you can use the desc_nulls_first() method. When we invoke the desc_nulls_first() method on a column object, the sort() method returns the pyspark dataframe sorted in descending order and null values at the top of the dataframe.

You can also use the asc_nulls_first() method to sort the pyspark data frame in ascending order and place the rows containing null values at the top of the data frame which is the default behavior of the sort() method.

Conclusion

In this article, we discussed how to sort a pyspark dataframe using the sort() method.To learn more about python programming, you can read this article on Pyspark vs Pandas. You might also like this article on list of lists in Python.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post PySpark Sort DataFrame With Examples appeared first on PythonForBeginners.com.

Tuple Unpacking in Python

Aditya Raj — Fri, 07 Jul 2023 13:00:00 +0000

Python provides us with different shorthand methods to perform various tasks. In this article, we will discuss tuple unpacking in Python with examples.

Table of Contents

What is Unpacking in Python?
Tuple Unpacking Using Parallel Assignment
Tuple Unpacking Using The * Operator
Unpack Tuple into a List in Python
Unpack Tuple into a Set
Conclusion

What is Unpacking in Python?

In Python, the unpacking operation is used to assign elements of a collection object like a list, tuple, dictionary, etc directly to different variables. We use the Python unpacking operator and parallel assignment to unpack collection objects.

Let us discuss both ways for tuple unpacking in Python.

Tuple Unpacking Using Parallel Assignment

If a tuple has N elements, we can perform tuple unpacking in Python using parallel assignment as shown in the following syntax.

var1, var2, var3,...., varN-1, varN=(e1, e2, e3,.....,eN-1, eN)

In the above syntax, the number of variables on the left-hand side of the assignment operator must be equal to the number of elements in the tuple on the right side of the assignment operator.

After tuple unpacking, the elements in the tuple are assigned to the variables at the corresponding position on the left-hand side of the assignment operator. The first element of the tuple is assigned to the first variable, the second element of the tuple is assigned to the second variable on LHS, and so on. You can observe this in the following example.

myTuple=(1,12,11,14)
print("The tuple is:",myTuple)
a,b,c,d=myTuple
print("The variables are:",a,b,c,d)

Output:

The tuple is: (1, 12, 11, 14)
The variables are: 1 12 11 14

In the above example, we have defined a tuple with four elements. Then, we unpack the tuple using the parallel assignment. You can observe that the elements in the tuple are assigned to the variable at the same position on the left-hand side of the assignment operator.

In the parallel assignment operation, if the number of variables is less than the elements in the tuple, the program runs into a Python ValueError exception. You can observe this in the following example.

myTuple=(1,12,11,14,15,-1)
print("The tuple is:",myTuple)
a,b,c,d=myTuple
print("The variables are:",a,b,c,d)

Output:

The tuple is: (1, 12, 11, 14, 15, -1)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_115443/3055827636.py in 
      1 myTuple=(1,12,11,14,15,-1)
      2 print("The tuple is:",myTuple)
----> 3 a,b,c,d=myTuple
      4 print("The variables are:",a,b,c,d)

ValueError: too many values to unpack (expected 4)

In the above code, you can observe that the tuple contains six elements whereas there are only four variables on the left side of the assignment operator during tuple unpacking. Hence, the program runs into a ValueError exception.

In a similar manner, if the number of variables exceeds the number of elements in the tuple, the program runs into a ValueError exception saying that there aren’t enough values to unpack. You can observe this in the following example.

myTuple=(1,12,11,14)
print("The tuple is:",myTuple)
a,b,c,d,e,f=myTuple
print("The variables are:",a,b,c,d)

Output:

The tuple is: (1, 12, 11, 14)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_115443/1429134448.py in 
      1 myTuple=(1,12,11,14)
      2 print("The tuple is:",myTuple)
----> 3 a,b,c,d,e,f=myTuple
      4 print("The variables are:",a,b,c,d)

ValueError: not enough values to unpack (expected 6, got 4)

Here, we have only four elements in the tuple. However, the assignment statement contains six variables. Due to this, the program runs into a ValueError exception.

Tuple Unpacking Using The * Operator

If we have fewer variables than the number of elements in the tuple, we can use the * operator for tuple unpacking in Python. For tuple unpacking using the * operator, we will use the following syntax

var1, var2, var3,var4..varN, *var=myTuple

Here, if the tuple myTuple contains more than N variables, the first N elements of the tuple are assigned to variables var1, var2,.. till varN. The rest of the elements are stored in the variable var as a list. You can observe this in the following example.

myTuple=(1,12,11,14,123,32,76,114,56)
print("The tuple is:",myTuple)
a,b,c,*d=myTuple
print("The variables are:",a,b,c,d)

Output:

The tuple is: (1, 12, 11, 14, 123, 32, 76, 114, 56)
The variables are: 1 12 11 [14, 123, 32, 76, 114, 56]

In the above example, there are nine elements in the tuple. On the contrary, we have only four variables. You can observe that the fourth variable d in the code uses the * operator. Hence, first three elements of the tuple are assigned to the variables a, b, and c. The rest of the elements are packed in a list and assigned to the variable d.

Instead of the above approach, you can also use the following syntax for tuple unpacking in Python.

*var, var1, var2, var3,var4..varN =myTuple

When we use the above syntax, the last N elements of the tuple myTuple are assigned to the variables var1, var2, till varN. The remaining elements of the tuple from starting index are stored in the variable var as a list. You can observe this in the following example.

myTuple=(1,12,11,14,123,32,76,114,56)
print("The tuple is:",myTuple)
*a,b,c,d=myTuple
print("The variables are:",a,b,c,d)

Output:

The tuple is: (1, 12, 11, 14, 123, 32, 76, 114, 56)
The variables are: [1, 12, 11, 14, 123, 32] 76 114 56

In the above example, the first variable in the assignment statement uses the * operator. Hence, the last three elements of the tuple are assigned to variables b, c, and d. Rest of the elements are packed into a list and assigned to the variable a.

You can also choose to put the variable containing the * operator in the middle of the variables on the left-hand side of the assignment operator. For instance, consider the following syntax.

var1,var2,var3…varM,*var,varM+1,varM+2…varN=myTuple

If we use the above syntax, the first M elements of myTuple are assigned to the variables var1, var2, till varM. The last N-M elements are assigned to the variables varM+1, varM+2,… till varN. The rest of the elements in the middle of the tuple are assigned to the variable var in a list. You can observe this in the following example.

myTuple=(1,12,11,14,123,32,76,114,56)
print("The tuple is:",myTuple)
a,b,*c,d=myTuple
print("The variables are:",a,b,c,d)

Output:

The tuple is: (1, 12, 11, 14, 123, 32, 76, 114, 56)
The variables are: 1 12 [11, 14, 123, 32, 76, 114] 56

In this example, we have used the * operator with the third variable in the assignment statement. Hence, the first two elements of the tuple are assigned to variables a, and b. The last element of the tuple is assigned to the variable d. Rest of the elements are assigned in a list to the variable c.

Unpack Tuple into a List in Python

You can also use the tuple unpacking operation to create a list directly from a tuple. For this, we will use the following syntax.

myList=[*myTuple]

After executing the above statement, we get a list myList containing elements from the tuple myTuple.

myTuple=(1,12,11,14,123,32,76,114,56)
print("The tuple is:",myTuple)
myList=[*myTuple]
print("The list is:",myList)

Output:

The tuple is: (1, 12, 11, 14, 123, 32, 76, 114, 56)
The list is: [1, 12, 11, 14, 123, 32, 76, 114, 56]

Unpack Tuple into a Set

Similar to a list, we can also create a set from a tuple using the unpacking operator in Python as shown below.

myTuple=(1,12,11,14,12,32,76,11,56)
print("The tuple is:",myTuple)
mySet={*myTuple}
print("The set is:",mySet)

Output:

The tuple is: (1, 12, 11, 14, 12, 32, 76, 11, 56)
The set is: {32, 1, 11, 12, 76, 14, 56}

Conclusion

In this article, we discussed the basics of tuple unpacking in Python. To learn more about tuples, you can read this article on tuple comprehension. You might also like this article on tuple vs list in Python.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post Tuple Unpacking in Python appeared first on PythonForBeginners.com.

PySpark Read CSV File With Examples

Aditya Raj — Wed, 05 Jul 2023 13:00:00 +0000

The csv file format is one of the most used file formats to store tabular data. In this article, we will discuss different ways to read a csv file in PySpark.

Table of Contents

Pyspark Read CSV File Using The csv() Method
Read CSV With Header as Column Names
Read CSV With inferSchema Parameter
1. Why Should You Avoid Using The inferSchema Parameter in PySpark?
PySpark Read CSV File With Schema
Read CSV With Different Delimiter in PySpark
Read Multiple CSV Files into a Single PySpark DataFrame
1. Multiple CSV Files With Different Column Names
2. CSV Files With Different Numbers of Columns in PySpark
Conclusion

Pyspark Read CSV File Using The csv() Method

To read a csv file to create a pyspark dataframe, we can use the DataFrame.csv() method. The csv() method takes the filename of the csv file and returns a pyspark dataframe as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("readcsv_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv")
print("The input csv file is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input csv file is:
+-------+-----+-------+---------+
|    _c0|  _c1|    _c2|      _c3|
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
| Aditya|   45|     89|       1 |
|  Chris|   86|     85|        2|
|   Joel| null|     85|       3 |
|Katrina|   49|     47|        4|
| Agatha|   76|     89|        5|
|    Sam|   76|     98|        6|
+-------+-----+-------+---------+

In the above example, we have used the following CSV file.

sample_csv_file Download

In the output, you can observe that the column names are given as _c0,_c1, _c2. However, the actual column names should be Name, Maths, Physics, and Chemistry. Hence, we need to find a way to read the csv with its column names.

Read CSV With Header as Column Names

To read a csv file with column names, you can use the header parameter in the csv() method. When we set the header parameter to True in the csv() method, the first row of the csv file is treated as the column names. You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("readcsv_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input csv file is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input csv file is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       1 |
|  Chris|   86|     85|        2|
|   Joel| null|     85|       3 |
|Katrina|   49|     47|        4|
| Agatha|   76|     89|        5|
|    Sam|   76|     98|        6|
+-------+-----+-------+---------+

In this example, we have set the header parameter to True in the csv() method. Hence, the first line of the csv file is read as column names.

Read CSV With inferSchema Parameter

By default, the csv() method reads all the values as strings. For example, if we print the data types using the dtypes attribute of the pyspark dataframe, you can observe that all the column names have string data types.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("readcsv_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input csv file is:")
dfs.show()
print("The data type of columns is:")
print(dfs.dtypes)
spark.sparkContext.stop()

Output:

The input csv file is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       1 |
|  Chris|   86|     85|        2|
|   Joel| null|     85|       3 |
|Katrina|   49|     47|        4|
| Agatha|   76|     89|        5|
|    Sam|   76|     98|        6|
+-------+-----+-------+---------+

The data type of columns is:
[('Name', 'string'), ('Maths', 'string'), ('Physics', 'string'), ('Chemistry', 'string')]

In the above output, you can observe that all the columns have string data types irrespective of the values in the columns.

To read a csv file with correct data types for columns, we can use the inferSchema parameter in the csv() method. When we set the inferSchema parameter to True, the program scans all the values in the dataframe and assigns the best data type to each column. You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("readcsv_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input csv file is:")
dfs.show()
print("The data type of columns is:")
print(dfs.dtypes)
spark.sparkContext.stop()

Output:

The input csv file is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|      1.0|
|  Chris|   86|     85|      2.0|
|   Joel| null|     85|      3.0|
|Katrina|   49|     47|      4.0|
| Agatha|   76|     89|      5.0|
|    Sam|   76|     98|      6.0|
+-------+-----+-------+---------+

The data type of columns is:
[('Name', 'string'), ('Maths', 'int'), ('Physics', 'int'), ('Chemistry', 'double')]

In this example, we have set the inferSchema parameter to True. Hence, the columns are given proper data types.

Why Should You Avoid Using The inferSchema Parameter in PySpark?

Using the inferSchema parameter to decide the data type for columns in a pyspark dataframe is a costly operation. When we set the inferSchema parameter to True, the program needs to scan all the values in the csv file. After scanning all the values in a given column, the data type for the particular column is decided. For large datasets, this can be a costly operation. This is why setting the inferSchema parameter to True is a costly operation and it isn’t recommended for large datasets.

PySpark Read CSV File With Schema

Instead of using the inferSchema parameter, we can read csv files with specified schemas.

A schema contains the column names, their data types, and a boolean value nullable to specify if a particular column can contain null values or not.

To define the schema for a pyspark dataframe, we use the StructType() function and the StructField() function.

The StructField() function is used to define the name and data type of a particular column. It takes the column name as its first input argument and the data type of the column as its second input argument. To specify the data type of the column names, we use the StringType(), IntegerType(), FloatType(), DoubleType(), and other functions defined in the pyspark.sql.types module.

In the third input argument to the StructField() function, we pass True or False specifying if the column can contain null values or not. If we set the third parameter to True, the column will allow null values. Otherwise, it will not.

The StructType() function is used to create the schema for the pyspark dataframe. It takes a list of StructField objects as its input argument and returns a StructType object that we can use as a schema.

To read a csv file with schema using pyspark, we will use the following steps.

First, we will define the data type for each column using the StructField() function.
Next, we will pass a list of all the StructField objects to the StructType() function to create a schema.
Finally, we will pass the StructType object to the schema parameter in the csv() function while reading the csv file.

By executing the above steps, we can read a csv file in pyspark with a given schema. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("readcsv_example") \
      .getOrCreate()
list_of_cols=[StructField("Name",StringType(),True),
             StructField("Maths",IntegerType(),True),
             StructField("Physics",IntegerType(),True),
             StructField("Chemistry",IntegerType(),True)]
schema=StructType(list_of_cols)
print("The schema is:")
print(schema)
spark.sparkContext.stop()

Output:

The schema is:
StructType([StructField('Name', StringType(), True), StructField('Maths', IntegerType(), True), StructField('Physics', IntegerType(), True), StructField('Chemistry', IntegerType(), True)])

In the above code, we have defined the schema for the csv file using the StructField() function and the StructType() function.

After defining the schema, you can pass it to the csv() method to read the csv file with a proper data type for each column as shown in the following example.

import pyspark.sql as ps
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("readcsv_example") \
      .getOrCreate()
list_of_cols=[StructField("Name",StringType(),True),
             StructField("Maths",IntegerType(),True),
             StructField("Physics",IntegerType(),True),
             StructField("Chemistry",IntegerType(),True)]
schema=StructType(list_of_cols)
dfs=spark.read.csv("sample_csv_file.csv",header=True,schema=schema)
print("The input csv file is:")
dfs.show()
print("The data type of columns is:")
print(dfs.dtypes)
spark.sparkContext.stop()

Output:

The input csv file is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|     null|
|  Chris|   86|     85|        2|
|   Joel| null|     85|     null|
|Katrina|   49|     47|        4|
| Agatha|   76|     89|        5|
|    Sam|   76|     98|        6|
+-------+-----+-------+---------+

The data type of columns is:
[('Name', 'string'), ('Maths', 'int'), ('Physics', 'int'), ('Chemistry', 'int')]

In the above example, we have read a csv using schema. Observe that the values in a column that cannot be converted to the given data type in the schema are replaced with null values.

Read CSV With Different Delimiter in PySpark

The csv files need not contain the comma character as its delimiter. They might also contain characters like tabs, spaces, colons (:), semi-colons (;), pipe characters (|), etc as delimiters. For example, let us take the following file that uses the pipe character as the delimiter.

demo_file Download

To read a csv file in pyspark with a given delimiter, you can use the sep parameter in the csv() method. The csv() method takes the delimiter as an input argument to the sep parameter and returns the pyspark dataframe as shown below.

import pyspark.sql as ps
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("readcsv_example") \
      .getOrCreate()
dfs=spark.read.csv("demo_file.csv",header=True,inferSchema=True, sep="|")
print("The input csv file is:")
dfs.show()
print("The data type of columns is:")
print(dfs.dtypes)
spark.sparkContext.stop()

Output:

The input csv file is:
+------+----+----------+-----+
|  Name|Roll|  Language|Extra|
+------+----+----------+-----+
|Aditya|   1|    Python|   11|
|   Sam|   2|      Java|   12|
| Chris|   3|       C++|   13|
|  Joel|   4|TypeScript|   14|
+------+----+----------+-----+

The data type of columns is:
[('Name', 'string'), ('Roll', 'int'), ('Language', 'string'), ('Extra', 'int')]

In the above example, the csv file contains the | character as its delimiter. To read the file, we have passed the | character to the sep parameter as input in the csv() method.

Read Multiple CSV Files into a Single PySpark DataFrame

To read multiple csv files into a pyspark dataframe at once, you can pass the list of filenames to the csv() method as its first input argument. After execution, the csv() method will return the pyspark dataframe with data from all files as shown below.

import pyspark.sql as ps
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("readcsv_example") \
      .getOrCreate()
dfs=spark.read.csv(["demo_file.csv","demo_file2.csv"],header=True,inferSchema=True, sep="|")
print("The input csv files are:")
dfs.show()
print("The data type of columns is:")
print(dfs.dtypes)
spark.sparkContext.stop()

Output:

The input csv files are:
+------+----+----------+-----+
|  Name|Roll|  Language|Extra|
+------+----+----------+-----+
|Aditya|   1|    Python|   11|
|   Sam|   2|      Java|   12|
| Chris|   3|       C++|   13|
|  Joel|   4|TypeScript|   14|
|George|  12|        C#|   15|
|  Sean|  13|       SQL|   16|
|   Joe|  14|       PHP|   17|
|   Sam|  15|JavaScript|   18|
+------+----+----------+-----+

The data type of columns is:
[('Name', 'string'), ('Roll', 'int'), ('Language', 'string'), ('Extra', 'int')]

In the above example, we have used the following files.

demo_file Download

demo_file2 Download

In the output, you can observe that the contents of the files are stacked horizontally in the order they are passed in the csv() function.

Multiple CSV Files With Different Column Names

If the files that we pass to the csv() method have the same number of columns but different column names, the output dataframe will contain the column names of the first csv file. The data in the columns are stacked by their positions to create the output dataframe. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("readcsv_example") \
      .getOrCreate()
dfs=spark.read.csv(["demo_file.csv","demo_file2.csv"],header=True,inferSchema=True, sep="|")
print("The input csv files are:")
dfs.show()
print("The data type of columns is:")
print(dfs.dtypes)
spark.sparkContext.stop()

Output:

The input csv files are:
23/07/09 04:54:17 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: Name, Roll, Language, Extra
 Schema: Name, Roll, Language, Ratings
Expected: Ratings but found: Extra
CSV file: file:///home/aditya1117/codes/demo_file2.csv
+------+----+----------+-------+
|  Name|Roll|  Language|Ratings|
+------+----+----------+-------+
|Aditya|   1|    Python|     11|
|   Sam|   2|      Java|     12|
| Chris|   3|       C++|     13|
|  Joel|   4|TypeScript|     14|
|George|  12|        C#|     15|
|  Sean|  13|       SQL|     16|
|   Joe|  14|       PHP|     17|
|   Sam|  15|JavaScript|     18|
+------+----+----------+-------+

The data type of columns is:
[('Name', 'string'), ('Roll', 'string'), ('Language', 'string'), ('Ratings', 'string')]

In the above example, the first csv file has the column names Name, Roll, Language, and Ratings. The second csv file has Extra as the last column instead of Ratings.

In the output, you can observe that the column names of the first csv files are selected as schema. Hence, the csv() function prints a warning when it encounters a different column name.

CSV Files With Different Numbers of Columns in PySpark

If the input files contain a different number of columns, the column names in the schema of the output dataframe are selected from the CSV file with more columns. Here, the rows from the csv file with lesser columns are filled with null values in the extra columns.

To understand this, let us add an extra column to the demo_file.csv. The updated file is as follows.

demo_file Download

Now, let us read both files into a pyspark dataframe using the csv() function.

import pyspark.sql as ps
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("readcsv_example") \
      .getOrCreate()
dfs=spark.read.csv(["demo_file2.csv","demo_file.csv"],header=True,inferSchema=True, sep="|")
print("The input csv files are:")
dfs.show()
print("The data type of columns is:")
print(dfs.dtypes)
spark.sparkContext.stop()

Output:

The input csv files are:
23/07/09 04:57:08 WARN CSVHeaderChecker: Number of column in CSV header is not equal to number of fields in the schema:
 Header length: 4, schema size: 5
CSV file: file:///home/aditya1117/codes/demo_file2.csv
+------+----+----------+-------+-----+
|  Name|Roll|  Language|Ratings|Grade|
+------+----+----------+-------+-----+
|Aditya|   1|    Python|     11|    A|
|   Sam|   2|      Java|     12|    A|
| Chris|   3|       C++|     13|   A+|
|  Joel|   4|TypeScript|     14|   A+|
|George|  12|        C#|     15| null|
|  Sean|  13|       SQL|     16| null|
|   Joe|  14|       PHP|     17| null|
|   Sam|  15|JavaScript|     18| null|
+------+----+----------+-------+-----+

The data type of columns is:
[('Name', 'string'), ('Roll', 'string'), ('Language', 'string'), ('Ratings', 'string'), ('Grade', 'string')]

In the above code, the demo_file.csv contains 4 columns. Hence, the column names given in demo_file.csv are selected for the schema despite the fact that we have passed it as the second file to the csv() function. You can also observe that the output pyspark data frame contains the data from the demo_file.csv on the top of the dataframe as the schema is selected from this file.

Conclusion

In this article, we have discussed different ways to read a CSV file in Pyspark. To learn more about pyspark, you can read this article on pyspark vs pandas. You might also like this article on how to create an empty pyspark dataframe.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post PySpark Read CSV File With Examples appeared first on PythonForBeginners.com.

PySpark Create Empty DataFrame

Aditya Raj — Mon, 03 Jul 2023 13:00:00 +0000

We use pyspark dataframes to handle tabular data. Sometimes, we need to create empty pyspark dataframes. This article will discuss how to create an empty dataframe in Pyspark.

Table of Contents

Create An Empty DataFrame With Column Names in PySpark
Create an Empty Data Frame Without Column Names
1. Empty DataFrame Without Column Names Using The emptyRDD Object
2. Empty DataFrame Without Column Names Using The createDataFrame() Function
Conclusion

Create An Empty DataFrame With Column Names in PySpark

We need to perform three steps to create an empty pyspark dataframe with column names.

First, we will create an empty RDD object.
Next, we will define the schema for the dataframe using the column names and data types.
Finally, we will convert the RDD to a dataframe using the schema.

Let us discuss all these steps one by one.

Create an Empty RDD in Pyspark

To create an empty dataframe in pyspark, we will first create an empty RDD. To create an empty RDD, you just need to use the emptyRDD() function on the sparkContext attribute of a spark session. After execution, the emptyRDD() function returns an empty RDD as shown below.

from pyspark.sql.types import StructType
import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("create_dataframe") \
      .getOrCreate()
empty_rdd=spark.sparkContext.emptyRDD()
print("The empty RDD object is:")
print(empty_rdd)

Output:

The empty RDD object is:
EmptyRDD[1] at emptyRDD at NativeMethodAccessorImpl.java:0

Here, we have created an emptyRDD object using the emptyRDD() method.

Define Schema For The DataFrame in PySpark

To define the schema for a pyspark dataframe, we use the StructType() function and the StructField() function.

In the third input argument to the StructField() function, we pass True or False specifying if the column can contain null values or not. If we set the third parameter to True, the column will allow null values. Otherwise, it will not.

After specifying the column names using the StructField() function, we can pass all the StructField objects to the StructType() function to create a schema for the dataframe. As shown below.

from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType
import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("create_dataframe") \
      .getOrCreate()
empty_rdd=spark.sparkContext.emptyRDD()
list_of_cols=[StructField("Roll",IntegerType(),True),
             StructField("Name",StringType(),True),
             StructField("Percentage",FloatType(),True)]
schema=StructType(list_of_cols)
print("The schema is:")
print(schema)

Output:

The schema is:
StructType([StructField('Roll', IntegerType(), True), StructField('Name', StringType(), True), StructField('Percentage', FloatType(), True)])

In this example, we have defined the schema for a dataframe having three columns i.e. Roll, Name, and Percentage.

Convert Empty RDD to PySpark DataFrame Using the Schema

Once we get the empty dataframe and the schema, we can use the createDataFrame() function to create an empty pyspark dataframe with column names. The createDataFrame() function takes the empty RDD object as its first input argument and the schema as its second input argument. After execution, it returns an empty dataframe with column names as shown below.

from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType
import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("create_dataframe") \
      .getOrCreate()
empty_rdd=spark.sparkContext.emptyRDD()
list_of_cols=[StructField("Roll",IntegerType(),True),
             StructField("Name",StringType(),True),
             StructField("Percentage",FloatType(),True)]
schema=StructType(list_of_cols)
df=spark.createDataFrame(empty_rdd,schema=schema)
print("The empty dataframe is:")
df.show()

Output:

The empty dataframe is:
+----+----+----------+
|Roll|Name|Percentage|
+----+----+----------+
+----+----+----------+

In this example, you can observe that the createDataFrame() function takes an emptyRDD object and the schema for the dataframe and returns an empty dataframe with given column names.

Instead of the createDataFrame() function, you can also use the toDF() method to convert an empty RDD to an empty pyspark dataframe with column names. The toDF() method, when invoked on an emptyRDD object, takes the schema as its input argument and returns an empty pyspark dataframe with column names. You can observe this in the following example.

from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType
import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("create_dataframe") \
      .getOrCreate()
empty_rdd=spark.sparkContext.emptyRDD()
list_of_cols=[StructField("Roll",IntegerType(),True),
             StructField("Name",StringType(),True),
             StructField("Percentage",FloatType(),True)]
schema=StructType(list_of_cols)
df=empty_rdd.toDF(schema=schema)
print("The empty dataframe is:")
df.show()

Output:

The empty dataframe is:
+----+----+----------+
|Roll|Name|Percentage|
+----+----+----------+
+----+----+----------+

In this example, we used the toDF() method instead of the createDataFrame() function to create an empty pyspark dataframe.

Create an Empty PySpark DataFrame Directly Using Schema

To create an empty dataframe directly using schema, you can pass an empty list to the createDataFrame() function as its first input argument and the schema created using the column names as its second input argument. After execution of the createDataFrame() function, you will get the empty dataframe as shown below.

from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType
import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("create_dataframe") \
      .getOrCreate()
empty_rdd=spark.sparkContext.emptyRDD()
list_of_cols=[StructField("Roll",IntegerType(),True),
             StructField("Name",StringType(),True),
             StructField("Percentage",FloatType(),True)]
schema=StructType(list_of_cols)
df=spark.createDataFrame([],schema=schema)
print("The empty dataframe is:")
df.show()

Output:

The empty dataframe is:
+----+----+----------+
|Roll|Name|Percentage|
+----+----+----------+
+----+----+----------+

In the above code, you can observe that we haven’t used an empty RDD object to create the empty dataframe. Instead, we directly passed an empty list and a schema to the createDataFrame() function to obtain the empty dataframe with column names.

Create an Empty Data Frame Without Column Names

We can also create empty dataframes without column names. For this, we can pass an empty StructType object to the functions discussed in the previous sections instead of a schema.

To create an empty StructType object, we will first pass an empty list to the StructType() function. After this, we can use an emptyRDD object or createDataFrame() function directly to create an empty pyspark dataframe without columns.

Empty DataFrame Without Column Names Using The emptyRDD Object

You can pass an emptyRDD object and an empty StructType object to the createDataFrame() function as input arguments to create an empty pyspark dataframe with column names as shown below.

from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType
import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("create_dataframe") \
      .getOrCreate()
empty_rdd=spark.sparkContext.emptyRDD()
empty_schema=StructType([])
df=spark.createDataFrame(empty_rdd,schema=empty_schema)
print("The empty dataframe is:")
df.show()

Output:

The empty dataframe is:
++
||
++
++

In the above example, we first created an empty schema using the StructType() function. Then, we passed the empty schema along with the emptyRDD object to create an empty dataframe without column names.

Alternatively, you can also invoke the toDF() method on the emptyRDD object and pass the empty StructType object to the toDF() function as its input argument. After executing the toDF() method, you will get an empty dataframe without column names as shown in the following example.

from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType
import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("create_dataframe") \
      .getOrCreate()
empty_rdd=spark.sparkContext.emptyRDD()
empty_schema=StructType([])
df=empty_rdd.toDF(schema=empty_schema)
print("The empty dataframe is:")
df.show()

Output:

The empty dataframe is:
++
||
++
++

Empty DataFrame Without Column Names Using The createDataFrame() Function

To create an empty pyspark dataframe using the createDataFrame() function, we will pass an empty list as the first argument and the empty StructType object as the second input argument to the createDataFrame() function. After executing the function, we will get an empty dataframe without column names as shown below.

from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType
import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("create_dataframe") \
      .getOrCreate()
empty_schema=StructType([])
df=spark.createDataFrame([],schema=empty_schema)
print("The empty dataframe is:")
df.show()

Output:

The empty dataframe is:
++
||
++
++

Conclusion

In this article, we discuss different ways to create an empty pyspark dataframe. To learn more about Pyspark, you can read this article on pyspark vs pandas. You might also like this article on list of lists in Python.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post PySpark Create Empty DataFrame appeared first on PythonForBeginners.com.