Dan Vatterott

Data Scientist

'Is Not in' With PySpark

In SQL it’s easy to find people in one list who are not in a second list (i.e., the “not in” command), but there is no similar command in PySpark. Well, at least not a command that doesn’t involve collecting the second list onto the master instance.

EDIT
Check the note at the bottom regarding “anti joins”. Using an anti join is much cleaner than the code described here.

Here is a tidbit of code which replicates SQL’s “not in” command, while keeping your data with the workers (it will require a shuffle).

I start by creating some small dataframes.

1
2
3
4
import pyspark
from pyspark.sql import functions as F
a = sc.parallelize([[1, 'a'], [2, 'b'], [3, 'c']]).toDF(['id', 'valueA'])
b = sc.parallelize([[1, 'a'], [4, 'd'], [5, 'e']]).toDF(['id', 'valueB'])

Take a quick look at dataframe a.

1
a.show()
id valueA
1 a
2 b
3 c

And dataframe b.

1
b.show()
id valueA
1 a
4 d
5 e

I create a new column in a that is all ones. I could have used an existing column, but this way I know the column is never null.

1
2
a = a.withColumn('inA', F.lit(1))
a.show()
id valueA inA
1 a 1
2 b 1
3 c 1

I join a and b with a left join. This way all values in b which are not in a have null values in the column “inA”.

1
b.join(a, 'id', 'left').show()
id valueA valueB inA
5 e null null
1 a a 1
4 d null null

By filtering out rows in the new dataframe c, which are not null, I remove all values of b, which were also in a.

1
2
c = b.join(a, 'id', 'left').filter(F.col('inA').isNull())
c.show()
id valueA valueB inA
5 e null null
4 d null null

EDIT
I recently gave the PySpark documentation a more thorough reading and realized that PySpark’s join command has a left_anti option. The left_anti option produces the same functionality as described above, but in a single join command (no need to create a dummy column and filter).

For example, the following code will produce rows in b where the id value is not present in a.

1
c = b.join(a, 'id', 'left_anti')

Comments