Running SQL notebook without duplicates in databricks

2 min readOct 26, 2022

The task for me was to write a script that combines data from two silver tables and generate a gold table. If you are not familiar with Lakehouse you can check here. Lakehouse stores data in levels based on the level of processing: bronze, silver and gold.

The goal was to generate a gold table that will help monitor monthly platform usage. I had two options; SQL and pyspark. I used both approaches but the SQL introduced duplicates in the gold table when I run the notebook twice with the same parameters using INSERT INTO. The method was not robust enough.

After a long search WHEN NOT EXISTS command worked and tho I have not experimented WHEN NOT MATCHED but I assume it can also get the job done. Below is not a mirror of the exact script but helpful if you want to ingest data into a table either as stored procedure or a job using SQL.

USE database;INSERT INTO gold_table
SELECT 
 name
 ,id
 ,year
 ,date
 ,total_charges
FROM silver_table AS c
 WHERE NOT EXISTS (SELECT 1 FROM gold_table AS g
                   WHERE c.name = g.name
                   AND c.id = g.id
                   AND c.year = g.year
                   AND c.date = g.date
                   AND c.total_charges = g.total_charges)

Assumptions are that the database and gold table are already created and the schema defined for the table.

I believe this is not the only approach. You can as well use scala or pyspark but if SQL is your choice then you can adopt my approach.

Running SQL notebook without duplicates in databricks

Written by Joshua Kobina Obeng

No responses yet