Databricks and spark crash course. Delta and more!

In this article i will cover a short introduction to databricks by having a look at the sample NYC Taxi datasets bundled inside databricks clusters. If you want to go straight to the databricks notebook and import it into your databricks workspace click here.

We will load some sample data from the NYC taxi dataset available in databricks, load them and store them as table. We will use then python to do some manipulation (Extract month and year from the trip time), which will create two new additional columns to our dataframe and will check how the file is saved in the hive warehouse. We will observe we have some junk data as it created folders for months and years (partitioning), that we are not supposed to have, so we will use filter to apply some filter in python way and in sql way to filter these bad records

Then, we will load another month of data as a temporary view and will compare this in contrast with a delta table where we can run updates and all sort of DML.

As a last step, we will load some master data and will perform a join. For more on Delta Lake you can follow this tutorial –> https://delta.io/tutorials/delta-lake-workshop-primer/

You can do this small notebook with the community edition of databricks that ofers a cluster with one driver (no executors) of 16 GB of ram and 2 cpu’s and 1dbu. Its not big but enough to follow this notebook. You can apply for the community edition here.

Get the notebook here. Happy coding!

Leave a Reply

Your email address will not be published. Required fields are marked *